design-patterns - 从单词列表中查找模式

Question

有没有办法匹配文件中的单词列表。我有两个文件，A 和 B。A 有一个单词列表

A
abcd
xyzt

和文件 B

B
abcdefgh abcd
abcdytqw wert
zswertyu xyzt

我想从 fileB 中提取第 1 行和第 3 行。我想将 A 匹配到 B 的第二列，如果它匹配打印 B 的那一行。

输出将是

abcdefgh abcd
zswertyu xyzt

我已经在 for 循环中使用 perl 中的 grep 进行了尝试。但它太慢了。我有超过 100K 的列表。

score 0 · Accepted Answer

这种方式将所有 A 加载到一个集合中以加快速度。如果您不将 A 加载到内存中，那么您必须将 A 的每一行与整个文件 B 进行比较。通过将 A 加载到内存中，您只需遍历每个文件一次。此外，由于 A 在内存中，因此检查 B 的第二列是否在 A 中会更快。

这是python中的一个示例：

#!/usr/bin/env python

def load_data(filename):
    with open(filename, 'r') as infile:
        Aset = set()
        for line in infile:
            word = line.strip()
            if word == '':
                continue
            Aset.add(word)
    return Aset

if __name__ == '__main__':
    Aset = load_data('A')

    with open('B', 'r') as infile:
        for line in infile:
            # Assumes that each line in B will have at least 2 columns.
            # And that the column you are checking against is the last.
            word = line.strip().split()[-1]
            if word in Aset:
                print line.strip()

如果机器没有足够的（空闲）内存来将所有文件 A 加载到集合中，这将不起作用。

design-patterns - 从单词列表中查找模式

1 回答 1

Related

Reference