0

我有一个文本文件,每行包含几个单词。现在给定一组查询词,我必须在文件中找到查询词同时出现的行数。即包含两个查询词的行数,包含三个查询词的行数等。

我尝试使用以下代码:请注意,rest(list,word) 从“list”中删除“word”并返回更新后的列表。linecount 是原始的行数。

raw=open("raw_dataset_1","r")
queryfile=open("queries","r")
query=queryfile.readline().split()
query_size=len(query)
two=0
three=0
four=0

while linecount>0:
    line=raw.readline().split()
    if query_size>=2:
        for word1 in query:
            beta=rest(query,word1)
            for word2 in beta:
                if (word1 in line) and (word2 in line):
                    two+=1
                    print line
    if (query_size>=3):
        for word3 in query:
            beta=rest(query,word3)
            for word4 in beta:
                gama=rest(beta,word4)
                for word5 in gama:
                    if (((word3 in line) and (word4 in line)) and (word5 in line)):
                        three+=1
                        print line
    linecount-=1

print two
print three

它有效,虽然有冗余,但我可以将“二”除以 2 以获得所需的数字)是否有更好的方法来做到这一点?

4

2 回答 2

2

我会采取更一般的方法。假设query是您的查询词列表并且raw_dataset_1是您正在分析的文件的名称,我会执行以下操作:

# list containing the number of lines with 0,1,2,3... occurrances of query words.
wordcount = [0,0,0,0,0]    
for line in file("raw_dataset_1").readlines():
    # loop over each query word, see if it occurs in the given line, and just count them. 
    # The bracket inside will create a list of elements (query_word) from your query word list (query)
    # but add only those words which occur in the line (if  query_word in line). [See list comprehension]
    # E.g. if your line contain three query words those three will be in the list.
    # You are not interested in what those words are, so you just take the length of the list (len). 
    # Finally, number_query_words_found is the number of query words present in the current line of text. 
    number_query_words_found = len([query_word for query_word in query if query_word in line])  
    if number_query_words_found<5:
        # increase the line-number by one. The index corresponds to the number of query-words present
        wordcount[number_query_words_found] += 1

print "Number of lines with 2 query words: ", wordcount[2]
print "Number of lines with 3 query words: ", wordcount[3]

此代码未经测试,可以优化。该文件将被完全读取(对于较大的文件效率低下)并且wordcount它的静态列表应该动态完成(以允许任何单词出现。但是这样的事情应该有效,除非我误解了你的问题。对于列表理解,请参见例如here

于 2013-07-03T05:37:40.273 回答
2

我会为此使用集合:

raw=open("raw_dataset_1","r")
queryfile=open("queries","r")
query_line = queryfile.readline()
query_words = query_line.split()
query_set = set(query_words)
query_size = len(query_set)  # Note that this isn't actually used below

for line in raw: # Iterating over a file gives you one line at a time
    words = line.strip().split()
    word_set = set(words)
    common_set = query_set.intersection(word_set)
    if len(common_set) == 2:
        two += 1
    elif len(common_set) == 3:
        three += 1
    elif len(common_set) == 4:
        four += 1

当然,您可能希望将该行保存到结果文件或其他任何文件中,而不是仅仅计算出现次数。但这应该给您一个总体思路:使用集合将极大地简化您的逻辑。

于 2013-07-03T05:53:02.283 回答