python - 使用 Python 在列表中查找字符串匹配计数

Question

我阅读了一个文本文件进行一些分析，每个单词都附加到一个列表并给出一个 id

#!/usr/bin/python3
with fi as myfile:
  for line in myfile:
    for item in line.split(' '):
      db[0].append(id_+1)
      db[2].append(item)
      ...more stuff

然后我通过列表搜索每个单词以找到其匹配项，并将计数存储为sim1。如果找到匹配项，我会测试下一个单词是否也与连续单词匹配，并将其计数存储为sim2。对于sim3也是如此。我的代码如下所示：

for i in range(id_-3):
  sim1=0
  sim2=0
  sim3=0
  for j in range(id_-3):
    if i==j:  continue;
    if db[2][i] == db[2][j]:
      sim1 += 1
      if db[2][i+1] == db[2][j+1]:
        sim2 += 1
        if db[2][i+2] == db[2][j+2]:
          sim3 += 1
  db[3].append(sim1)
  db[4].append(sim2)
  db[5].append(sim3)

这行得通，但是太慢了！我相信python提供了更快的搜索方法，但我还是一个Py新手！

score 2 · Accepted Answer

算法的缓慢主要来自这样一个事实，即您有一个迭代 len(db[2]) 次的内部循环，该内部循环包含在一个也迭代 len(db[2]) 次的外部循环中。这意味着内部代码正在执行 len(db[2])^2 次。例如，如果您的文件很大并且要解析 5000 个单词，那么代码会运行 5000^2 = 25,000,000 次！

因此，解决问题的攻角是找到一种方法来消除或显着降低该内循环的成本。下面是一个示例解决方案，它只需要遍历 len(db[2]) 一次，然后执行第二个单独的循环，该循环遍历一组更小的项目。第二次迭代中有一些内部循环，但它们运行的次数更少，而且成本几乎无关紧要。

我使用重约 48kb 的文本文件对您的算法和我的算法进行计时。你的算法在我的计算机上平均大约 14 秒，而我的算法平均为 0.6 秒。因此，通过去掉那个内部循环，算法现在快了 23 倍以上。我还进行了一些其他小的优化，例如将比较更改为数字而不是文本之间的比较，以及从一开始就以完整大小创建存储数组以避免使用 append()。Append() 使解释器根据需要动态增加数组的大小，这样比较慢。

from collections import defaultdict

# Create zero-filled sim1, sim2, sim3 arrays to avoid append() overhead
len_ = len(db[2]) - 2
for _ in range(3):
    db.append([0] * len_)

# Create dictionary, containing d['word'] = [count, [indexes]]
# Do just one full iteration, and make good use of it by calculating
# sim1 (as 'count') and storing an array of number indexes for each word,
# allowing for a very efficient loop coming up...
d = defaultdict(lambda: [0, []])
for index, word in enumerate(db[2]):
    if index < len_:
        # Accumulate sim1
        d[word][0] += 1 
    # Store all db[2] indexes where this word exists
    d[word][1].append(index) 

# Now loop only through words which occur more than once (smaller loop)
for word, (count, indexes) in d.iteritems():
    if count > 1:
        # Place the sim1 values into the db[3] array
        for i in indexes:
            if i < len_:
                db[3][i] = count - 1
                # Look for sim2 matches by using index numbers
                next_word = db[2][i+1]
                for next_word_index in d[next_word][1]:
                    if next_word_index - 1 != i and next_word_index - 1 in indexes:
                        # Accumulate sim2 value in db[4]
                        db[4][i] += 1
                        # Look for sim3 matches
                        third_word = db[2][i+2]
                        if third_word == db[2][next_word_index + 1]:
                            # Accumulate sim3 value in db[5]
                            db[5][i] += 1

score -2 · Accepted Answer

是的，您正在执行字符串比较。这真的很慢。您想要的是将您的字符串编译为常规模式。:)

看看来自 python 的库re 。蟒蛇：重新

python - 使用 Python 在列表中查找字符串匹配计数

2 回答 2

Related

Reference