python - 从文本中提取行的替代方法（python-regex）

Question

我正在寻找一种从 python 中相当大的数据库中提取行的方法。我只需要保留那些包含我的关键字之一。我想我可以使用正则表达式来解决这个问题，我把下面的代码放在一起。不幸的是，它给了我一些错误（也可能是因为我的关键字，它们写在文件 listtosearch.txt 中的单独行中，确实数量很大，接近 500）。

import re
data = open('database.txt').read() 
fileout = open("fileout.txt","w+")

with open('listtosearch.txt', 'r') as f:
    keywords = [line.strip() for line in f]

pattern = re.compile('|'.join(keywords))

for line in data:
    if pattern.search(line):
        fileout.write(line)

我还尝试使用双循环（在关键字列表和数据库行中），但运行时间太长。

我得到的错误是：

Traceback (most recent call last):
  File "/usr/lib/python2.7/re.py", line 190, in compile 
    return _compile(pattern, flags)   
  File "/usr/lib/python2.7/re.py", line 240, in _compile 
    p = sre_compile.compile(pattern, flags) 
  File "/usr/lib/python2.7/sre_compile.py", line 511, in compile 
    "sorry, but this version only supports 100 named groups" 
AssertionError: sorry, but this version only supports 100 named groups

有什么建议吗？谢谢

score 2 · Accepted Answer

您可能想看看Aho–Corasick 字符串匹配算法。在 python 中的工作实现可以在这里找到。

该模块的一个简单示例用法：

from pyahocorasick import Trie

words = ['foo', 'bar']

t = Trie()
for w in words:
    t.add_word(w, w)
t.make_automaton()

print [a for a in t.iter('my foo is a bar')]

>> [(5, ['foo']), (14, ['bar'])]

集成到您的代码中应该很简单。

score 1 · Accepted Answer

首先，我很确定您的意思是data = open('database.txt').readlines()而不是read(). 否则，data将是一个字符串而不是行列表，并且您for line in data将没有任何意义。

此时，您实际上是在寻找按关键字建立索引的解决方案，而幼稚的搜索将不再有效，无法为您提供及时的结果。

确实没有另一种方法更有效或更简单。您将不得不磨牙并接受查看整个数据库的成本。

此外，如果它完全适合内存，您的数据库就不会那么大:)

也就是说，还有其他可能会更有效的方法：

将您的关键字放在一个集合中，然后将输入数据标记为单词并在集合中查找所有它们：

data = open('database.txt').readlines() 
fileout = open("fileout.txt","w+")

with open('listtosearch.txt', 'r') as f:
  keywords = [line.strip() for line in f]

keywords = set(keywords)

for line in data:
    # You might have to be smarter about splitting the line to 
    # take things like punctuation into consideration.
    for word in line.split():
      if word in keywords:
        fileout.write(line)
        break

这是一个考虑标点符号的分词示例。

score 1 · Accepted Answer

这是我的代码：

import re
data = open('database.txt', 'r')
fileout = open("fileout.txt","w+")

with open('listtosearch.txt', 'r') as f:
    keywords = [line.strip() for line in f]

# one big pattern can take time to match, so you have a list of them
patterns = [re.compile(keyword) for keyword in keywords]

for line in data:

    for pattern in patterns:
        if not pattern.search(line):
            break
    else:
        fileout.write(line)

我使用以下文件对其进行了测试：

数据库.txt

"Name jhon" (1995)
"Name foo" (2000)
"Name fake" (3000)
"Name george" (2000)
"Name george" (2500)

listtosearch.txt

"Name (george)"
\(2000\)

这就是我在 fileout.txt 中得到的

"Name george" (2000)

所以这也应该在你的机器上工作。

score 1 · Accepted Answer

可能不是一个有效的解决方案，但尝试使用 set 和它的相交属性。

from_db = tuple([line.rstrip("\n") for line in open('database.txt') if line.rstrip('\n')])
keywords = set([line.rstrip("\n") for line in open('listtosearch.txt') if line.rstrip('\n')])
with open("output_file.txt", "w") as fp:
    for line in from_db:
        line_set = set(line.split(" "))
        if line_set.intersection(keywords):
            fp.write(line + "\n")

Intersection 将检查任何常见的字符串。由于比较了哈希值，我想搜索会更快，而不是一次又一次地遍历整个列表。

python - 从文本中提取行的替代方法（python-regex）

4 回答 4

Related

Reference