python - 我的逆指数很慢有什么建议吗？

Question

我有一个文本文件，其中每一行都有一堆文本。（在实际文件中没有行号）像这样：

行#：文本：
0 这是一些文字
1 更多文字
2 午餐吃什么

我想要一个函数，它返回一个字典，将每个单词映射到它的行号出现，本质上是设计一个逆索引。

IE{'This':{1}, 'text':{0,1}, 'for':{2} ... }

扫描文本文件后（这需要 0.18 秒），我将这些行放入列表列表中，以便列表中的每个位置存储分割线。IE：

[['This', 'is', 'some', 'text'], ['More', ...] ...]

之后我enumerate()用来提取位置并创建字典。我已经有了一个解决方案，但它太丑了，我花了很长时间，以至于我想看到另一个更优雅的解决方案。

作为参考，我的算法在 1099 行和 753210 个单词上运行了 882.28 秒，即 15 分钟。换句话说，绝对不是pythonic。

def invidx(strlist):
    # return algoritm execution time
    start = time.time()  

    f = open(strlist, 'r')
    wordLoc = []
    for line in f:    
        s = line.split()
        wordLoc.append(list(s)) 
    f.close()

    # benchmark
    print 'job completed in %.2fs' % (time.time() - start) 

    try:
        q = {}
        for a, b in enumerate(wordLoc):
            l = set()
            for w in b :
                if w not in q:
                    l = {a for a, b in enumerate(wordLoc) if w in b}
                    q[w] = l
    except KeyboardInterrupt:
        print 'Interrupt detected: aborting...'
        print 'Failed to complete indexing, ran for %.2fs' % \
            (time.time() - start)
        exit(0)                  

    return q

编辑：

根据上面的请求代码。伙计们，对我轻点。

score 3 · Accepted Answer

您可以在enumerate最初扫描文件时获取行号，并随时将行号添加到 s 的字典中set。

我的文件.txt：

a b c
b x y
a c b

索引它：

index = {}
with open('myfile.txt') as F:
    for line_num, line in enumerate(F):
        for word in line.split():
            index.setdefault(word, set()).add(line_num)

index
=> {'a': set([0, 2]),
 'b': set([0, 1, 2]),
 'c': set([0, 2]),
 'x': set([1]),
 'y': set([1])}

score 2 · Accepted Answer

导致减速的线路是这一行：

l = {a for a, b in enumerate(wordLoc) if w in b}

每次你找到一个你还没有见过的单词时，你都会重新枚举每一行，看看是否包含这个单词。这将贡献 O(NumberOfUniqueWords * NumberOfLines) 操作，这在输入的大小上是二次方的。

您已经在枚举每一行的每个单词。为什么不边走边加？

for w in b :
    if w not in q: q[w] = []
    q[w].append(a)

这应该花费 O(NumberOfWords) 时间，这在输入的大小上是线性的，而不是二次的（ish）。您触摸每件事一次，而不是每个唯一单词一次。

score 1 · Accepted Answer

您可以使用collections.defaultdict：

from collections import defaultdict
dic = defaultdict(set)
with open('abc') as f:
   for i,line in enumerate(f): #enumerate returns the line number as well as the line
       words = line.split()    #splt the line using str.split()
       for word in words:      #iterate over words and add to it's corresponding set
           dic[word.lower()].add(i)
print dic

输出：

defaultdict(<type 'set'>,
{'whats': set([2]),
 'for': set([2]),
 'this': set([0]),
 'text': set([0, 1]),
 'is': set([0]),
 'some': set([0]),
 'lunch': set([2]),
 'more': set([1])})

score 0 · Accepted Answer

这似乎有效，我相当确定它比您的版本更快：

from time import time

def invidx(strlist):
    # return algoritm execution time
    start = time()

    wordLocs = []
    unique_words = set()
    with open(strlist, 'r') as f:
        for line in f:
            words = line.split()
            unique_words.update(words)
            wordLocs.append(set(words))

    # benchmark
    print 'job completed in %.2fs' % (time() - start)

    try:
        q = {}
        for unique_word in unique_words:
            occurrences = set()
            for line, words in enumerate(wordLocs):
                if unique_word in words:
                    occurrences.add(line)
            q[unique_word] = occurrences

    except KeyboardInterrupt:
        print ('Interrupt detected: aborting...\n'
              ('Failed to complete indexing, ran for %.2fs' % (time() - start)))
        exit(0)

    return q

from pprint import pprint
pprint(invidx('strlist.txt'))

简单测试文件的输出：

job completed in 0.00s
{'More': set([1]),
 'This': set([0]),
 'for': set([2]),
 'is': set([0]),
 'lunch': set([2]),
 'some': set([0]),
 'text': set([0, 1]),
 'whats': set([2])}

python - 我的逆指数很慢有什么建议吗？

4 回答 4

Related

Reference