python - 比较多个 csv 文件并查找匹配项

Question

我有两个包含 csv 文件的文件夹。一组“主”文件和一组“不匹配”文件。在主文件（约 25 个文件，总共约 50,000 行）中，有唯一的 ID。不匹配文件的每一行（约 250 个文件，总共约 700,000 行）应该在行中具有与其中一个主文件中的单个 id 匹配的 id。在每个不匹配的文件中，所有 id 都应与单个主文件匹配。此外，unmatched 中的所有 id 都应属于单个 master。

不幸的是，列并不总是一致的，id 字段可能出现在 row[2] 或 row[155] 中。（我为此使用python）我最初使用set.intersection并查找长度> 5的匹配实例（缺少标有“。”的值或我想避免的空白。）但很快就学会了运行时间太长了。一般来说，我需要将“不匹配”文件与其“主”文件进行匹配，并且我希望将“不匹配”文件中的列索引与使用的 id 匹配。因此，如果不匹配文件 unmatched_a 的 id 大多属于 master_d，并且 unmatched_a 中的匹配列是第 35 列，它将返回一行：

unmatched_a，master_d，35

如果不清楚，请道歉 - 如果需要，我很乐意尝试澄清。关于stackoverflow的第一篇文章。我可以发布到目前为止的代码，但我认为它没有用，因为问题在于我比较多个（相对）大的 csv 文件的方法。我看到很多帖子比较了两个 csv 文件或 index_id 已知的文件，但没有多个文件和多个文件可能匹配。

score 0 · Accepted Answer

您必须首先将所有主文件读入内存——这是不可避免的，因为匹配的 id 可能位于主文件中的任何位置。

然后，对于每个不匹配的文件，您可以读取第一条记录并找到它的 id（给您 id 列），然后找到包含该 id 的主文件（给您匹配的主文件）。根据您的描述，一旦您匹配了第一条记录，所有其余的 id 都将在同一个文件中，这样就完成了。

将 id 读入一个集合——检查成员资格是 O(1)。将每个集合放入以 master_file 名称为键的字典中。遍历 master 的字典是 O(n)。所以这是主文件数量和不匹配文件数量的 O(nm)。

import csv

def read_master_file(master_file):
    with open(master_file, "r") as file:
        reader = csv.reader(file)
        ids = set(line[0] for line in file) # I assumed the id is the first value in each row in the master files. Depending on the file format you will want to change this.
    return ids

def load_master_files(file_list):
    return {file: read_master_file(file) for file in file_list}

def check_unmatched_file(unmatched_file, master_dict):
    with open(unmatched_file, "r") as file:
        reader = csv.reader(file)
        record = next(reader)
    for id_column in [2, 155]: # if you can identify an id by semantics, rather than by attempting to match it against the masters, you can reduce running time by 25% by finding the id before this step
        id = record[id_column]
        for master in master_dict:
            if id in master_dict[master]:
                return unmatched_file, master, id
    return None # id not in any master. Feel free to return or raise whatever works best for you

python - 比较多个 csv 文件并查找匹配项

1 回答 1

Related

Reference