python - 基于键的 CSV 连接

Question

这可能是一个简单/重复的问题，但我可以找到/弄清楚如何去做。

我有两个 csv 文件：

信息.csv：

"Last Name", First Name, ID, phone, adress, age X [Total age: 100] |009076

abc, xyz, 1234, 982-128-0000, pqt,

bcd, uvw, 3124, 813-222-1111, tre, 

poi, ccc, 9087, 123-45607890, weq,

接着

年龄.csv：

student_id,age_1

3124,20

9087,21

1234,45

我想比较两个 csv 文件，基于来自 info.csv 的id“ ”和student_id来自age.csv的“”列，并获取相应的“ age_1”数据并将其放入info.csvage的“ ”列中。

所以最终的输出应该是：

信息.csv：

"Last Name", First Name, ID, phone, adress, age X [Total age: 100] |009076
 abc, xyz, 1234, 982-128-0000, pqt,45
 bcd, uvw, 3124, 813-222-1111, tre,20
 poi, ccc, 9087, 123-45607890, weq,21

我可以简单地将基于键的表加入到new.csv中，但不能将数据放在列标题“ age”中。我用 " csvkit" 来做到这一点。

这是我使用的：

csvjoin -c 3,1 info.csv age.csv > new.csv

score 3 · Accepted Answer

您可以使用Pandas和更新info dataframe使用age数据。您可以通过将两个数据框的索引分别设置为 ID和来实现student_id，然后更新info dataframe. 之后，您重置索引，因此ID再次成为一列。

from StringIO import StringIO
import pandas as pd

info = StringIO("""Last Name,First Name,ID,phone,adress,age X [Total age: 100] |009076
abc, xyz, 1234, 982-128-0000, pqt,
bcd, uvw, 3124, 813-222-1111, tre, 
poi, ccc, 9087, 123-45607890, weq,""")


age = StringIO("""student_id,age_1
3124,20
9087,21
1234,45""")

info_df = pd.read_csv(info, sep=",", engine='python')
age_df = pd.read_csv(age, sep=",", engine='python')

info_df = info_df.set_index('ID')
age_df = age_df.set_index('student_id')
info_df['age X [Total age: 100] |009076'].update(age_df.age_1)
info_df.reset_index(level=0, inplace=True)
info_df

输出：

    ID      Last Name   First Name      phone           adress  age X [Total age: 100] |009076
0   1234    abc         xyz              982-128-0000   pqt     45
1   3124    bcd         uvw              813-222-1111   tre     20
2   9087    poi         ccc              123-45607890   weq     21

score 1 · Accepted Answer

尝试这个...

import csv

info = list(csv.reader(open("info.csv", 'rb')))
age = list(csv.reader(open("age.csv", 'rb')))

def copyCSV(age, info, outFileName = 'out.csv'):
    # put age into dict, indexed by ID
    # assumes no duplicate entries

    # 1 - build a dict ageDict to represent data
    ageDict = dict([(entry[0].replace(' ',''), entry[1]) for entry in age[1:] if entry != []])

    # 2 - setup output
    with open(outFileName, 'wb') as outFile:
        outwriter = csv.writer(outFile)
        # 3 - run through info and slot in ages and write to output
        # nb: had to use .replace(' ','') to strip out whitespaces - these may not be in original .csv
        outwriter.writerow(info[0])
        for entry in info[1:]:
            if entry != []:
                key = entry[2].replace(' ','')
                if key in ageDict: # checks that you have data from age.csv
                    entry[5] = ageDict[key]
            outwriter.writerow(entry)

copyCSV(age, info)

让我知道它是否有效或是否有任何不清楚的地方。我使用了 dict，因为如果您的文件很大，它应该会更快，因为您只需遍历 age.csv 中的数据一次。

可能有一种更简单的方法/已经实现的东西......但这应该可以解决问题。

python - 基于键的 CSV 连接

2 回答 2

Related

Reference