python - 在行不相等的情况下，如何通过公共列合并两个 csv 文件？

Question

我有一组 100 个文件。包含美国每个州的人口普查信息的 50 个文件。其他五十个是需要与每个州的正确文件合并的地理数据。

对于每个州，人口普查文件及其对应的地理文件通过一个公共变量 LOGRECNO 关联，即人口普查文件中的第 10 列和地理文件中的第 7 列。

问题是地理文件比人口普查文件有更多的行；我的人口普查数据不涵盖地理位置的某些子集，因此行数少于地理数据文件。

如何将人口普查数据与地理日期合并（仅保留人口普查数据所在的行/地理位置，不关心其余部分）？

我是 Python 的新手，我有点知道如何在 python 中执行基本的 csv 文件 i/o。同时操作 2 个 csv 被证明是令人困惑的。

例子：

sample_state_census.csv

Varname 1 Varname 2 ... Varname 10 (LOGRECNO) ... Varname 16000
xxx       xxx    ...       1             ...               xxx
xxx       xxx    ...       2             ...               xxx
...
...
xxx       xxx   ...        514           ...                xxx
xxx       xxx   ...        1312          ...                xxx
...
...
xxx       xxx   ...        1500          ...                xxx

sample_state_geo.csv

GeoVarname 1 GeoVarname 2 ... GeoVarname 7 (LOGRECNO) ... GeoVarname 65
yyy       yyy    ...       1             ...               yyy
yyy       yyy    ...       2             ...               yyy
...
...
yyy      yyy  ...        514           ...                yyy
yyy      yyy   ...        515          ...                yyy
...
...
yyy     yyy  ...        1500          ...                yyy

预期输出（不要合并 sample_state_census.csv 中不存在的 LOGRECNO 值的行）

Varname 1 Varname 2 ... Varname 10 (LOGRECNO) GeoVarname 1 GeoVarname 2 ... GeoVarname 65 Varname 11... Varname 16000 
xxx       xxx    ...       1  yyy yyy ... yyy xxx            ...               xxx
xxx       xxx    ...       2 yyy yyy ... yyy xxx            ...               xxx
...
...
xxx       xxx   ...        514    yyy yyy ... yyy xxx       ...                xxx
xxx       xxx   ...        1312      yyy yyy ... yyy xxx    ...                xxx
...
...
xxx       xxx   ...        1500    yyy yyy ... yyy xxx      ...                xxx

score 2 · Accepted Answer

将较短文件中的数据读入内存，读入以行为键的字典LOGRECNO：

import csv

with open('sample_state_census.csv', 'rb') as census_file:
    reader = csv.reader(census_file, delimiter='\t')
    census_header = next(reader, None)  # store header
    census = {row[9]: row for row in reader}

然后使用这个字典来匹配地理数据，写出匹配：

with open('sample_state_geo.csv', 'rb') as geo_file:
    with open('outputfile.csv', 'wd') as outfile:
        reader = csv.reader(geo_file, delimiter='\t')
        geo_header = next(reader, None)  # grab header
        geo_header.pop(6) # no need to list LOGRECNO header twice

        writer = csv.writer(outfile, delimiter='\t')
        writer.writerow(census_header + geo_header)

        for row in reader:
            if row[6] not in census:
                # no census data for this LOGRECNO entry
                continue
            # new row is all of the census data plus all of geo minus column 7
            newrow = census[row[6]] + row[:6] + row[7:]
            writer.writerow(newrow)

这一切都假设人口普查文件不会太大而占用太多内存。如果是这种情况，您将不得不使用数据库（将所有数据读入 SQLite 数据库，以同样的方式与地理数据匹配）。

score 2 · Accepted Answer

对于基于一个或多个公共列合并多个文件（甚至 > 2），python 中最好和最有效的方法之一是使用“brewery”。您甚至可以指定需要考虑合并哪些字段以及需要保存哪些字段。

import brewery
from brewery
import ds
import sys

sources = [
    {"file": "grants_2008.csv",
     "fields": ["receiver", "amount", "date"]},
    {"file": "grants_2009.csv",
     "fields": ["id", "receiver", "amount", "contract_number", "date"]},
    {"file": "grants_2010.csv",
     "fields": ["receiver", "subject", "requested_amount", "amount", "date"]}
]

创建所有字段的列表并添加文件名以存储信息

关于数据记录的来源

all_fields = brewery.FieldList(["file"])

浏览源定义并收集字段

for source in sources:
    for field in source["fields"]:
        if field not in all_fields:

out = ds.CSVDataTarget("merged.csv")
out.fields = brewery.FieldList(all_fields)
out.initialize()

for source in sources:

    path = source["file"]

# Initialize data source: skip reading of headers
# use XLSDataSource for XLS files
# We ignore the fields in the header, because we have set-up fields
# previously. We need to skip the header row.

    src = ds.CSVDataSource(path,read_header=False,skip_rows=1)

    src.fields = ds.FieldList(source["fields"])

    src.initialize()


    for record in src.records():

   # Add file reference into ouput - to know where the row comes from
    record["file"] = path

        out.append(record)

# Close the source stream

    src.finalize()


cat merged.csv | brewery pipe pretty_printer

python - 在行不相等的情况下，如何通过公共列合并两个 csv 文件？

2 回答 2

创建所有字段的列表并添加文件名以存储信息

关于数据记录的来源

浏览源定义并收集字段

Related

Reference