1

我有一个文本文件,其中每一行都有一堆文本。(在实际文件中没有行号)像这样:

行#:文本:
0 这是一些文字
1 更多文字
2 午餐吃什么

我想要一个函数,它返回一个字典,将每个单词映射到它的行号出现,本质上是设计一个逆索引。

IE{'This':{1}, 'text':{0,1}, 'for':{2} ... }

扫描文本文件后(这需要 0.18 秒),我将这些行放入列表列表中,以便列表中的每个位置存储分割线。IE:

[['This', 'is', 'some', 'text'], ['More', ...] ...]

之后我enumerate()用来提取位置并创建字典。我已经有了一个解决方案,但它太丑了,我花了很长时间,以至于我想看到另一个更优雅的解决方案。

作为参考,我的算法在 1099 行和 753210 个单词上运行了 882.28 秒,即 15 分钟。换句话说,绝对不是pythonic。

def invidx(strlist):
    # return algoritm execution time
    start = time.time()  

    f = open(strlist, 'r')
    wordLoc = []
    for line in f:    
        s = line.split()
        wordLoc.append(list(s)) 
    f.close()

    # benchmark
    print 'job completed in %.2fs' % (time.time() - start) 

    try:
        q = {}
        for a, b in enumerate(wordLoc):
            l = set()
            for w in b :
                if w not in q:
                    l = {a for a, b in enumerate(wordLoc) if w in b}
                    q[w] = l
    except KeyboardInterrupt:
        print 'Interrupt detected: aborting...'
        print 'Failed to complete indexing, ran for %.2fs' % \
            (time.time() - start)
        exit(0)                  

    return q

编辑:

根据上面的请求代码。伙计们,对我轻点。

4

4 回答 4

3

您可以在enumerate最初扫描文件时获取行号,并随时将行号添加到 s 的字典中set

我的文件.txt:

a b c
b x y
a c b

索引它:

index = {}
with open('myfile.txt') as F:
    for line_num, line in enumerate(F):
        for word in line.split():
            index.setdefault(word, set()).add(line_num)

index
=> {'a': set([0, 2]),
 'b': set([0, 1, 2]),
 'c': set([0, 2]),
 'x': set([1]),
 'y': set([1])}
于 2013-07-05T19:36:34.910 回答
2

导致减速的线路是这一行:

l = {a for a, b in enumerate(wordLoc) if w in b}

每次你找到一个你还没有见过的单词时,你都会重新枚举每一行,看看是否包含这个单词。这将贡献 O(NumberOfUniqueWords * NumberOfLines) 操作,这在输入的大小上是二次方的。

您已经在枚举每一行的每个单词。为什么不边走边加?

for w in b :
    if w not in q: q[w] = []
    q[w].append(a)

这应该花费 O(NumberOfWords) 时间,这在输入的大小上是线性的,而不是二次的(ish)。您触摸每件事一次,而不是每个唯一单词一次。

于 2013-07-05T19:52:23.937 回答
1

您可以使用collections.defaultdict

from collections import defaultdict
dic = defaultdict(set)
with open('abc') as f:
   for i,line in enumerate(f): #enumerate returns the line number as well as the line
       words = line.split()    #splt the line using str.split()
       for word in words:      #iterate over words and add to it's corresponding set
           dic[word.lower()].add(i)
print dic

输出:

defaultdict(<type 'set'>,
{'whats': set([2]),
 'for': set([2]),
 'this': set([0]),
 'text': set([0, 1]),
 'is': set([0]),
 'some': set([0]),
 'lunch': set([2]),
 'more': set([1])})
于 2013-07-05T19:37:26.483 回答
0

这似乎有效,我相当确定它比您的版本更快:

from time import time

def invidx(strlist):
    # return algoritm execution time
    start = time()

    wordLocs = []
    unique_words = set()
    with open(strlist, 'r') as f:
        for line in f:
            words = line.split()
            unique_words.update(words)
            wordLocs.append(set(words))

    # benchmark
    print 'job completed in %.2fs' % (time() - start)

    try:
        q = {}
        for unique_word in unique_words:
            occurrences = set()
            for line, words in enumerate(wordLocs):
                if unique_word in words:
                    occurrences.add(line)
            q[unique_word] = occurrences

    except KeyboardInterrupt:
        print ('Interrupt detected: aborting...\n'
              ('Failed to complete indexing, ran for %.2fs' % (time() - start)))
        exit(0)

    return q

from pprint import pprint
pprint(invidx('strlist.txt'))

简单测试文件的输出:

job completed in 0.00s
{'More': set([1]),
 'This': set([0]),
 'for': set([2]),
 'is': set([0]),
 'lunch': set([2]),
 'some': set([0]),
 'text': set([0, 1]),
 'whats': set([2])}
于 2013-07-05T20:24:09.103 回答