regex - 需要字符串匹配算法

Question

数据：

x.txt，简单文本文件（约 1 MB） y.txt 字典文件（约 10 万字）。

需要查找 y.txt 中的任何单词/s 是否存在于 x.txt 中。

需要一种算法，该算法消耗更少的执行时间和首选语言。

PS：请建议除 BRUTE FORCE METHOD 之外的任何算法。

我需要模式匹配而不是精确的字符串匹配。

例如：

x.txt ：“老秃鹰队于 4 月 27 日解散”

Y.txt：“建立”

输出应该是：在 X.txt 中找到建立：第 1 行

谢谢你。

score 2 · Accepted Answer

我不清楚你是需要这个来完成工作还是在家工作。如果您需要它来完成工作，那么：

#!/usr/bin/bash
Y=`cat y.txt | tr '\n' '|'`
echo "${Y%|}"
grep -E "${Y%|}" x.txt
if [ "$?" -eq 0 ]
then
    echo "found"
else
    echo "no luck"
fi

当您从文件中吸取所有模式，构建正则表达式（回显显示正则表达式）然后将其交给grep为您构建有限状态自动机时，很难被击败。这会飞起来，因为它最多比较文本中的每个字符一次。如果是家庭作业，那么我建议你参考 Cormen 等人的“算法简介”，或者龙书的前几章，这也将解释我刚才所说的。

忘了补充：y.txt 应该每行包含一个模式，但作为一个很好的副作用，你的模式不必是单个单词。

score 0 · Accepted Answer

假设，您的标准库中有任何Set实现，这里有一些伪代码：

dictionary = empty set

def populate_dict():
    for word in dict_file:
        add(dictionary, word)

def validate_text(text_file):
    for word in text_file:      ### O(|text_file|)
        if word in dictionary:  ### O(log |dictonary|)
            report(word)

populate_dict()
every_now_and_then(populate_dict)

这将为您提供输入文本文件和字典的长度而O(t * log d) 不是蛮力。我认为不可能有更快的速度，因为您无法更快地读取文件，也无法比.O(t * d)tdO(t)O(log d)

score 0 · Accepted Answer

这是我考虑了一段时间的搜索算法。基本上，该算法分为两个步骤。

在第一步中，来自 y.txt 的所有单词都被插入到树中。树中从根到叶的每条路径都是一个词。叶子是空的。

例如，单词 dog 和 day 的树如下。

<root>--<d>-<a>-<y>-<>
          \-<o>-<g>-<>

算法的第二部分是向下搜索。当你到达一片空的叶子时，你就找到了一个词。

Groovy 中的实现，如果需要更多评论，请询问

//create a tree to store the words in a compact and fast to search way
//each path of the tree from root to an empty leaf is a word
def tree = [:]
new File('y.txt').eachLine{ word->
    def t=tree
    word.each{ c ->
        if(!t[c]){
            t[c]=[:]
        }
        t=t[c]
    }
    t[0]=0//word terminator (the leaf)
}
println tree//for debug purpose
//search for the words in x.txt
new File('x.txt').eachLine{ str, line->
    for(int i=0; i<str.length(); i++){
        if(tree[str[i]]){
            def t=tree[str[i]]
            def res=str[i]
            def found=false
            for(int j=i+1; j<str.length(); j++){
                if(t[str[j]]==null){
                    if(found){
                        println "Found $res at line $line, col $i"
                        res=str[j]
                        found=false
                    }
                    break
                }else if(t[str[j]][0]==0){
                    found=true
                    res+=str[j]
                    t=t[str[j]]
                    continue
                }else{
                    t=t[str[j]]
                    res+=str[j]
                }
                found=false
            }
            if(found) println "Found $res at line $line, col $i"//I know, an ugly repetition, it's for words at the end of a line. I will fix this later
        }
    }
}

这是我的 y.txt

dog
day
apple
daydream

和 x.txt

This is a beautiful day and I'm walking with my dog while eating an apple.
Today it's sunny.
It's a daydream

输出如下：

$ groovy search.groovy
[d:[o:[g:[0:0]], a:[y:[0:0, d:[r:[e:[a:[m:[0:0]]]]]]]], a:[p:[p:[l:[e:[0:0]]]]]]
Found day at line 1, col 20
Found dog at line 1, col 48
Found apple at line 1, col 68
Found day at line 2, col 2
Found daydream at line 3, col 7

这个算法应该很快，因为树的深度不依赖于 y.txt 中的单词数。深度等于 y.txt 中最长单词的长度。

regex - 需要字符串匹配算法

数据：

3 回答 3

Related

Reference