0

I have a database of words and a dataset with text lines. Every time there is a word in the line of the text file that appears in the words file as well, I want to do a trick. My code looks like:

import re
f = open(r"words.txt")
print len(flist)
d = open(r"text.txt", "r")
dlist = d.readlines()

for line in flist:
    lowline = line.lower()
    for word in dlist:
        lowword = word.lower()
        if lowword in lowline:
            *trick*

However, this code finds no matches, altough there are many words that are exactly the same. Any thoughts on this one?

4

1 回答 1

0

将数据库中的单词保存到第一个并set应用到它们。将删除前导和尾随空格字符,例如等。str.stripstr.lowerstr.strip'\n'

集合提供O(1)查找,并且设置交叉点将比您当前的方法更有效O(n^2)

然后遍历words文件中的每一行并应用str.stripstr.lower首先在集合中搜索它之前。

with open(r"words.txt") as f1, open(r"text.txt", "r") as f2:

    dlist = set(line.strip().lower() for line in f2)  #set of words from database
    for line in f1:
        line = line.strip().lower()     #use strip to remove '\n'
        words = set(line.split())    #use split to get the words from the line
                                     #and convert it into a set
        common_words = words & dlist  #use set intersection to find common words
        for word in common_words:  
           *trick* 

请适当替换f1f2因为我很困惑哪个是数据库,哪个是文本数据集。

于 2013-06-26T18:50:12.433 回答