python - 检查字符串是否在 python 中的 2-GB 字符串列表中

Question

我有一个A.txt包含字符串列表的 2 GB 大文件 () ['Question','Q1','Q2','Q3','Ans1','Format','links',...]。

现在我有另一个更大的文件（1TB），其中包含第二个位置的上述字符串：

输出：

a, Question, b
The, quiz, is
This, Q1, Answer
Here, Ans1, is
King1, links, King2
programming,language,drupal,
.....

我想保留其第二个位置包含存储在文件中的列表中的字符串的行A.txt。也就是说，我想保留（存储在另一个文件中）下面提到的行：

a, Question, b
This, Q1, Answer
Here, Ans1, is
King1, links, King2

当文件（A.txt）中的列表长度为 100..使用“任何”时，我知道如何执行此操作。但是当文件 (A.txt) 中的列表长度为 2 GB 时，我不知道该怎么做。

score 8 · Accepted Answer

不要使用列表；改用一套。

将第一个文件读入一个集合：

with open('A.txt') as file_a:
    words = {line.strip() for line in file_a}

0.5 GB 的单词存储在一组中并不多。

现在您可以words在 O(1) 恒定时间内进行测试：

if second_word in words:
    # ....

打开第二个文件并逐行处理它，csv如果行单词是逗号分隔的，则可能使用模块。

对于更大的单词集，请改用数据库；Python 自带sqlite3库：

import sqlite3

conn = sqlite3.connect(':memory:')
conn.execute('CREATE TABLE words (word UNIQUE)')

with open('A.txt') as file_a, conn:
    cursor = conn.cursor()
    for line in file_a:
        cursor.execute('INSERT OR IGNORE INTO words VALUES (?)', (line.strip(),))

然后对此进行测试：

cursor = conn.cursor()
for line in second_file:
    second_word = hand_waving
    cursor.execute('SELECT 1 from words where word=?', (second_word,))
    if cursor.fetchone():
         # ....

即使我:memory:在这里使用数据库，SQLite 也足够聪明，可以在您开始填满内存时将数据存储在临时文件中。连接基本上只是一个临时的:memory:一次性数据库。如果您想重新使用单词数据库，您也可以使用真实的文件路径。

score 1 · Accepted Answer

从Martijn Pieters的答案开始。如果这太慢了，您可以使用布隆过滤器来减少使用数据库的次数，方法是消除可能无法匹配列表中任何单词的行。Python 带有一个内置hash函数，您可以将其用于过滤表中的一个哈希值，并且您可以查找任意数量的其他哈希值。

python - 检查字符串是否在 python 中的 2-GB 字符串列表中

2 回答 2

Related

Reference