c - Which data structure might be a more efficient implementation?

Question

I was doing an exercise on reading from a setup file in which every line specifies two words and a number. The number denotes the number of words in between the two words specified. Another file – input.txt – has a block of text, and the program attempts to count the number of occurrences in the input file which follows the constraints in each line in the setup file (i.e., two particular words a and b should be separated by n words, where a, b and n are specified in the setup file.

So I've tried to do this as a shell script, but my implementation is probably highly inefficient. I used an array to store the words from the setup file, and then did a linear search on the text file to find out the words, and the works. Here's a bit of the code, if it helps:

#!/bin/sh

j=0
count=0;
m=0;
flag=0;
error=0;
while read line; do
    line=($line);
    a[j]=${line[0]}
    b[j]=${line[1]}
    num=${line[2]}
    c[j]=`expr $num + 0`
    j=`expr $j + 1`
done <input2.txt

while read line2; do
    line2=($line2)
    for (( i=0; $i<=50; i++ )); do
        for (( m=0; $m<j; m++)); do
            g=`expr $i + ${c[m]}`
            g=`expr $g + 1`
            if [ "${line2[i]}" == "${a[m]}" ] ; then
                for (( k=$i; $k<$g; k++)); do
                    if [[ "${line2[k]}" == *.* ]]; then
                        flag=1
                        break
                    fi
                done
                if [ "${b[m]}" == "${line2[g]}" ] ; then
                    if [ "$flag" == 1 ] ; then 
                        error=`expr $error + 1`
                    fi
                    count=`expr $count + 1`
                fi
                flag=0
            fi
            if [ "${line2[i]}" == "${b[m]}" ] ; then
                for (( k=$i; $k<$g; k++)); do
                    if [[ "${line2[k]}" == *.* ]]; then
                        flag=1
                        break
                    fi
                done
                if [ "${a[m]}" == "${line2[g]}" ] ; then
                    if [ "$flag" == 1 ] ; then 
                        error=`expr $error + 1`
                    fi              
                count=`expr $count + 1`
                fi
                flag=0
            fi
        done
    done 
done <input.txt

count=`expr $count - $error`

echo "| Count = $count |"

As you can see, this takes a lot of time.

I was thinking of a more efficient way to implement this, in C or C++, this time. What could be a possible alternative implementation of this, efficiency considered? I thought of hash tables, but could there be a better way?

I'd like to hear what everyone has to say on this.

score 1 · Accepted Answer

这是一个完全可行的可能性。它不是 100% 纯的bash，因为它使用 (GNU) sed：我sed用来小写所有内容并去掉标点符号。也许你不需要这个。适应您的需求。

#!/bin/bash

input=input.txt
setup=setup.txt

# The Check function
Check() {
   # $1 is word1
   # $2 is word2
   # $3 is number of words between word1 and word2
   nb=0
   # Get all positions of w1
   IFS=, read -a q <<< "${positions[$1]}"
   # Check, for each position, if word2 is at distance $3 from word1
   for i in "${q[@]}"; do
      [[ ${words[$i+$3+1]} = $2 ]] && ((++nb))
   done
   echo "$nb"
}

# Slurp input file in an array
words=( $(sed 's/[,.:!?]//g;s/\(.*\)/\L\1/' -- "$input") )

# For each word, specify its positions in file
declare -A positions
pos=0
for i in "${words[@]}"; do
   positions[$i]+=$((pos++)),
done

# Do it!
while read w1 w2 p; do
   # Check that w1 w2 are not empty
   [[ -n $w2 ]] || continue
   # Check that p is a number
   [[ $p =~ ^[[:digit:]]+$ ]] || continue
   n=$(Check "$w1" "$w2" "$p")
   [[ $w1 != $w2 ]] && (( n += $(Check "$w2" "$w1" "$p") ))
   echo "$w1 $w2 $p: $n"
done < <(sed 's/\(.*\)/\L\1/' -- "$setup")

它是如何工作的：

我们首先读取数组中的整个文件input.txtwords：每个字段一个单词。请注意，我在sed这里使用删除所有标点符号（嗯，仅,, ., :, !, ?, 用于测试目的，如果您愿意，可以添加更多）并将每个字母小写。
遍历数组words并为每个单词，将其位置放在关联数组中positions：
```
w => "position1,position2,...,positionk,"
```
最后，我们读取setup.txt文件（sed再次过滤以小写所有内容 - 可选见下文）。快速检查该行是否有效（2 个单词和一个数字），然后调用该Check函数（两次，对于给定单词的每个排列，除非两个单词相等）。
该Check函数在文件中找到 word1 的所有位置，这要归功于关联数组positions，然后使用数组words检查 word2 是否在给定的 word1 的“距离”处。

第二个sed是可选的。我已将setup.txt文件过滤sed为小写所有内容。这sed只会留下非常少的开销，因此，从效率方面来说，这没什么大不了的。稍后您将能够添加更多过滤以确保数据与脚本使用它的方式一致（例如，去掉标点符号）。否则你可以：

完全摆脱它：将相应的行（最后一行）替换为 just
```
done < "$setup"
```
在这种情况下，您必须信任编写setup.txt文件的人/女孩。
如上所述摆脱它，但仍要将所有内容转换为小写。在这种情况下，低于
```
while read w1 w2 p; do
```
行，只需添加以下行：
```
w1=${w1,,}
w2=${w2,,}
```
这是小写字符串的 bash 方式。

注意事项。如果出现以下情况，脚本将中断：

setup.txt文件中给出的数字以a 开头0并包含 a8或 a 9。这是因为 bash 会认为它是一个八进制数，其中8' 和9' 无效。有解决方法。
input.txt中的文本没有遵循正确的排版习惯：标点符号后面总是跟一个空格。例如，如果输入文件包含
```
The quick,brown,dog jumps over the lazy fox
```
然后sed处理后的文字看起来像
```
The quickbrowndog jumps over the lazy fox
```
并且单词quick，brown和dog不会被正确处理。您可以将替换sed替换s/[,:!?]//g为s/[,:!?]/ /g以将这些符号转换为空格。这取决于您，但在这种情况下，缩写如eg和ie可能不会被正确考虑……现在这真的取决于您需要做什么。
使用了不同的字符编码......我真的不知道您需要脚本有多健壮，以及您将考虑哪些语言和编码。
（在这里添加东西:)。）

关于效率。我会说该算法相当有效。bash可能不是最适合这种情况的语言，但它很有趣，而且如果我们看一下它毕竟不是那么难（少于 20 行相关代码，甚至更少！）。如果你只有 50 个 50000 字的文件，没关系，你不会注意到bash和perl/python/awk/C/you-name-it:之间有太多区别，bash对于这种类型的文件执行得很快。现在，如果您有 100000 个文件，每个文件包含数百万个单词，那么应该采用不同的方法并使用不同的语言（但我不知道是哪一种）。

score 1 · Accepted Answer

如果：

为了效率，它可能会变得复杂
文本文件可能很大
安装文件可以有很多行

然后我会这样做：

作为准备，我将创建：

以单词的索引为键，单词为值的哈希映射（命名为 -say- WORDS）。所以 WORDS[1] 将是第一个单词， WORDS[2] 是第二个单词，依此类推。
以单词为键，索引列表为值的哈希图（命名为-say-INDEXES）。因此，如果 WORDS[2] 和 WORDS[5] 是 "dog" 而不是其他，则 INDEXES["dog"] 将产生数字 2 和 5。该值可以是动态索引数组或链表。如果有多次出现的单词，链表会更好。

您可以阅读文本文件，并同时填充两个结构。

加工：

对于设置文件的每一行，我将获取 INDEXES[firstword] 中的索引，并检查 WORDS[index + wordsinbetween + 1] 是否等于 secondword。如果是这样，那就是一个打击。

笔记：

准备：您只阅读文本文件一次。对于文本文件中的每个单词，您都在进行快速操作，其性能并未真正受到已处理的单词数量的影响。

处理：您只读取一次设置文件。对于每一行，您也在这里执行仅受文本文件中 firstword 出现次数影响的操作。

c - Which data structure might be a more efficient implementation?

2 回答 2

Related

Reference