我正在尝试使用其他来源提供的一些学生数据来更新 csv 文件,但是他们的 csv 数据格式与我们的略有不同。

它需要根据三个标准来匹配学生的姓名、班级,最后是位置的前几个字母,因此 B 班的前几个学生Dumpt实际上来自 Dumpton Park。


  • 如果 CSV 2 中学生的记分卡为 0 或空白,则不应更新 CSV 1 中的分数列
  • 如果 CSV 2 中的学生编号为 0 或空白,则不应更新 CSV 1 中的 No 列
  • 否则它应该将数字从 CSV 2 导入到 CSV1



Class A,York,Tom,x,x,32,
Class A,York,Jim,x,x,10,
Class A,York,Sam,x,x,32,
Class B,Dumpton Park,Sarah,x,x,,
Class B,Dumpton Park,Bob,x,x,,
Class B,Dumpton Park,Bill,x,x,,
Class A,Dover,Andy,x,x,,
Class A,Dover,Hannah,x,x,,
Class B,London,Jemma,x,x,,
Class B,London,James,x,x,,


"Class A","York","Jim","0","742"
"Class A","York","Sam","0","931"
"Class A","York","Tom","0","653"
"Class B","Dumpt","Bob","23.1","299"
"Class B","Dumpt","Bill","23.4","198"
"Class B","Dumpt","Sarah","23.5","12"
"Class A","Dover","Andy","23","983"
"Class A","Dover","Hannah","1","293"
"Class B","Lond","Jemma","32.2","0"
"Class B","Lond","James","32.0","0"

CSV 1 已更新(这是所需的输出)

Class A,York,Tom,x,x,32,653
Class A,York,Jim,x,x,10,742
Class A,York,Sam,x,x,32,653
Class B,Dumpton Park,Sarah,x,x,23.5,12
Class B,Dumpton Park,Bob,x,x,23.1,299
Class B,Dumpton Park,Bill,x,x,23.4,198
Class A,Dover,Andy,x,x,23,983
Class A,Dover,Hannah,x,x,1,293
Class B,London,Jemma,x,x,32.2,
Class B,London,James,x,x,32.0,



这里有两种解决方案:pandas 解决方案和普通 python 解决方案。首先是一个熊猫解决方案,不出所料,它看起来很像其他熊猫解决方案......


import pandas
import numpy as np

cdf1 = pandas.read_csv('csv1',dtype=object)  #dtype = object allows us to preserve the numeric formats
cdf2 = pandas.read_csv('csv2',dtype=object)

col_order = cdf1.columns  #pandas will shuffle the column order at some point---this allows us to reset ot original column order


In [6]: cdf1
     Class         Local    Name DPE JJK Score   No
0  Class A          York     Tom   x   x    32  NaN
1  Class A          York     Jim   x   x    10  NaN
2  Class A          York     Sam   x   x    32  NaN
3  Class B  Dumpton Park   Sarah   x   x   NaN  NaN
4  Class B  Dumpton Park     Bob   x   x   NaN  NaN
5  Class B  Dumpton Park    Bill   x   x   NaN  NaN
6  Class A         Dover    Andy   x   x   NaN  NaN
7  Class A         Dover  Hannah   x   x   NaN  NaN
8  Class B        London   Jemma   x   x   NaN  NaN
9  Class B        London   James   x   x   NaN  NaN

In [7]: cdf2
     Class Location Student Scorecard Number
0  Class A     York     Jim         0    742
1  Class A     York     Sam         0    931
2  Class A     York     Tom         0    653
3  Class B    Dumpt     Bob      23.1    299
4  Class B    Dumpt    Bill      23.4    198
5  Class B    Dumpt   Sarah      23.5     12
6  Class A    Dover    Andy        23    983
7  Class A    Dover  Hannah         1    293
8  Class B     Lond   Jemma      32.2      0
9  Class B     Lond   James      32.0      0


dcol = cdf2.Location 
cdf2['Location'] = dcol.apply(lambda x: x[0:4])  #Replacement in cdf2 since we don't need original data

dcol = cdf1.Local
cdf1['Location'] = dcol.apply(lambda x: x[0:4])  #Here we add a column leaving 'Local' because we'll need it for the final output

cdf2 = cdf2.rename(columns={'Student': 'Name', 'Scorecard': 'Score', 'Number': 'No'})
cdf2 = cdf2.replace('0', np.nan)  #Replacing '0' by np.nan means zeros don't overwrite

cdf1 = cdf1.set_index(['Class', 'Location', 'Name'])
cdf2 = cdf2.set_index(['Class', 'Location', 'Name'])

现在 cdf1 和 cdf2 看起来像

In [16]: cdf1
                                Local DPE JJK Score   No
Class   Location Name                                   
Class A York     Tom             York   x   x    32  NaN
                 Jim             York   x   x    10  NaN
                 Sam             York   x   x    32  NaN
Class B Dump     Sarah   Dumpton Park   x   x   NaN  NaN
                 Bob     Dumpton Park   x   x   NaN  NaN
                 Bill    Dumpton Park   x   x   NaN  NaN
Class A Dove     Andy           Dover   x   x   NaN  NaN
                 Hannah         Dover   x   x   NaN  NaN
Class B Lond     Jemma         London   x   x   NaN  NaN
                 James         London   x   x   NaN  NaN

In [17]: cdf2
                        Score   No
Class   Location Name             
Class A York     Jim      NaN  742
                 Sam      NaN  931
                 Tom      NaN  653
Class B Dump     Bob     23.1  299
                 Bill    23.4  198
                 Sarah   23.5   12
Class A Dove     Andy      23  983
                 Hannah     1  293
Class B Lond     Jemma   32.2  NaN
                 James   32.0  NaN

用 cdf2 中的数据更新 cdf1 中的数据

cdf1.update(cdf2, overwrite=False)


In [19]: cdf1
                                Local DPE JJK Score   No
Class   Location Name                                   
Class A York     Tom             York   x   x    32  653
                 Jim             York   x   x    10  742
                 Sam             York   x   x    32  931
Class B Dump     Sarah   Dumpton Park   x   x  23.5   12
                 Bob     Dumpton Park   x   x  23.1  299
                 Bill    Dumpton Park   x   x  23.4  198
Class A Dove     Andy           Dover   x   x    23  983
                 Hannah         Dover   x   x     1  293
Class B Lond     Jemma         London   x   x  32.2  NaN
                 James         London   x   x  32.0  NaN

最后将 cdf1 恢复为原始形式并将其写入 csv 文件。

cdf1 = cdf1.reset_index()  #These two steps allow us to remove the 'Location' column
del cdf1['Location']    
cdf1 = cdf1[col_order]     #This will switch Local and Name back to their original order

cdf1.to_csv('temp.csv',index = False)

两个注意事项:首先,考虑到使用 cdf1.Local.value_counts() 或 len(cdf1.Local.value_counts()) 等是多么容易。我强烈建议添加一些校验和以确保从 Location 转移到位置的前几个字母,您不会意外删除位置。其次,我真诚地希望您想要的输出的第 4 行有一个错字。


#Open all of the necessary files
csv1 = open('csv1','r')
csv2 = open('csv2','r')
csvout = open('csv_out','w')

#Read past both headers and write the header to the outfile
wstr = csv1.readline()

#Read csv1 into a dictionary with keys of Class,Name,and first four digits of Local and keep a list of keys for line ordering
line_keys = []
line_dict = {}
for line in csv1:
    s = line.split(',')
    this_key = (s[0],s[1][0:4],s[2])
    line_dict[this_key] = s

#Go through csv2 updating the data in csv1 as necessary
for line in csv2:
    s = line.replace('\"','').split(',')
    this_key = (s[0],s[1][0:4],s[2])
    if this_key in line_dict:   #Lowers the crash rate...
        #Check if need to replace Score...
        if len(s[3]) > 0 and float(s[3]) != 0:
            line_dict[this_key][5] = s[3]
        #Check if need to repace No...
        if len(s[4]) > 0 and float(s[4]) != 0:
            line_dict[this_key][6] = s[4]
        print "Line not in csv1: %s"%line

#Write the updated line_dict to csvout
for key in line_keys:
    wstr = ','.join(line_dict[key])

#Close all of the open filehandles
df1 = pd.read_csv(csv1)
df2 = pd.read_csv(csv2)

towns = df1.Local.unique()  # assuming this is complete list of towns

from fuzzywuzzy.fuzz import partial_ratio

In [11]: df2['Local'] =  df2.Location.apply(lambda short_location: max(towns, key=lambda t: partial_ratio(short_location, t)))

In [12]: df2
     Class Location Student  Scorecard  Number         Local
0  Class A     York     Jim        0.0     742          York
1  Class A     York     Sam        0.0     931          York
2  Class A     York     Tom        0.0     653          York
3  Class B    Dumpt     Bob       23.1     299  Dumpton Park
4  Class B    Dumpt    Bill       23.4     198  Dumpton Park
5  Class B    Dumpt   Sarah       23.5      12  Dumpton Park
6  Class A    Dover    Andy       23.0     983         Dover
7  Class A    Dover  Hannah        1.0     293         Dover
8  Class B     Lond   Jemma       32.2       0        London
9  Class B     Lond   James       32.0       0        London


In [13]: df2.rename_axis({'Student': 'Name'}, axis=1, inplace=True)


In [14]: res = df1.merge(df2, how='outer')

In [15]: res
     Class         Local    Name DPE JJK  Score  No Location  Scorecard  Number
0  Class A          York     Tom   x   x     32 NaN     York        0.0     653
1  Class A          York     Jim   x   x     10 NaN     York        0.0     742
2  Class A          York     Sam   x   x     32 NaN     York        0.0     931
3  Class B  Dumpton Park   Sarah   x   x    NaN NaN    Dumpt       23.5      12
4  Class B  Dumpton Park     Bob   x   x    NaN NaN    Dumpt       23.1     299
5  Class B  Dumpton Park    Bill   x   x    NaN NaN    Dumpt       23.4     198
6  Class A         Dover    Andy   x   x    NaN NaN    Dover       23.0     983
7  Class A         Dover  Hannah   x   x    NaN NaN    Dover        1.0     293
8  Class B        London   Jemma   x   x    NaN NaN     Lond       32.2       0
9  Class B        London   James   x   x    NaN NaN     Lond       32.0       0


In [16]: res['Score'] = res.loc[:, ['Score', 'Scorecard']].max(1)

In [17]: del res['Scorecard'] 
         del res['No']
         del res['Location']


In [18]: res
     Class         Local    Name DPE JJK  Score  Number
0  Class A          York     Tom   x   x   32.0     653
1  Class A          York     Jim   x   x   10.0     742
2  Class A          York     Sam   x   x   32.0     931
3  Class B  Dumpton Park   Sarah   x   x   23.5      12
4  Class B  Dumpton Park     Bob   x   x   23.1     299
5  Class B  Dumpton Park    Bill   x   x   23.4     198
6  Class A         Dover    Andy   x   x   23.0     983
7  Class A         Dover  Hannah   x   x    1.0     293
8  Class B        London   Jemma   x   x   32.2       0
9  Class B        London   James   x   x   32.0       0

In [18]: res.to_csv('foo.csv')

注意:要强制 dtype 为对象(并混合 dtype、int 和浮点数,而不是所有浮点数),您可以使用 apply。如果您要进行任何分析,我建议您不要这样做!

res['Score'] = res['Score'].apply(lambda x: int(x) if int(x) == x else x, convert_dtype=False)
希望这段代码更具可读性。;) Python 的新 Enum 类型的反向移植在这里

from enum import Enum       # see PyPI for the backport (enum34)

class Field(Enum):

    course = 0
    location = 1
    student = 2
    dpe = 3
    jjk = 4
    score = -2
    number = -1

    def __index__(self):
        return self._value_

def Float(text):
    if not text:
        return 0.0
    return float(text)

def load_our_data(filename):
    "return a dict using the first three fields as the key"
    data = dict()
    with open(filename) as input:
        next(input)  # throw away header
        for line in input:
            fields = line.strip('\n').split(',')
            fields[Field.score] = Float(fields[Field.score])
            fields[Field.number] = Float(fields[Field.number])
            key = (
            data[key] = fields
    return data

def load_their_data(filename):
    "return a dict using the first three fields as the key"
    data = dict()
    with open(filename) as input:
        next(input)  # throw away header
        for line in input:
            fields = line.strip('\n').split(',')
            fields = [f.strip('"') for f in fields]
            fields[Field.score] = Float(fields[Field.score])
            fields[Field.number] = Float(fields[Field.number])
            key = (
            data[key] = fields
    return data

def merge_data(ours, theirs):
    "their data is only used if not blank and non-zero"
    for key, our_data in ours.items():
        their_data = theirs[key]
        if their_data[Field.score]:
            our_data[Field.score] = their_data[Field.score]
        if their_data[Field.number]:
            our_data[Field.number] = their_data[Field.number]

def write_our_data(data, filename):
    with open(filename, 'w') as output:
        for record in sorted(data.values()):
            line = ','.join([str(f) for f in record])
            output.write(line + '\n')

if __name__ == '__main__':
    ours = load_our_data('one.csv')
    theirs = load_their_data('two.csv')
    merge_data(ours, theirs)
    write_our_data(ours, 'three.csv')
Python 字典是这里的方法:

studentDict = {}

with open(<csv1>, 'r') as f:
  for line in f:
    LL = line.rstrip('\n').replace('"','').split(',')
    studentDict[LL[0], LL[1], LL[2]] = LL[3:]

with open(<csv2>, 'r') as f:
  for line in f:
    LL = line.rstrip('\n').replace('"','').split(',')
    if LL[-2] not in ('0', ''): studentDict[LL[0], LL[1], LL[2]][-2] = LL[-2]
    if LL[-1] not in ('0', ''): studentDict[LL[0], LL[1], LL[2]][-1] = LL[-1]

with open(<outFile>, 'w') as f:
  for k in studentDict.keys():
    v = studentDict[k[0], k[1], k[2]]
    f.write(k[0] + ',' + k[1] + ',' + k[2] + ',' + v[0] + ',' + v[1] + ',' + v[2] + ',' + v[3] + '\n')
编辑:好的,因为您不能依赖手动重命名列,Roman 建议只匹配前几个字母是一个很好的建议。不过,在此之前我们必须改变一些事情。

In [62]: df1 = pd.read_clipboard(sep=',')

In [63]: df2 = pd.read_clipboard(sep=',')

In [68]: df1
     Class Location Student  Scorecard  Number
0  Class A     York     Jim        0.0     742
1  Class A     York     Sam        0.0     931
2  Class A     York     Tom        0.0     653
3  Class B    Dumpt     Bob       23.1     299
4  Class B    Dumpt    Bill       23.4     198
5  Class B    Dumpt   Sarah       23.5      12
6  Class A    Dover    Andy       23.0     983
7  Class A    Dover  Hannah        1.0     293
8  Class B     Lond   Jemma       32.2       0
9  Class B     Lond   James       32.0       0

In [69]: df2
     Class         Local    Name DPE JJK  Score   No
0  Class A          York     Tom   x   x   32.0  653
1  Class A          York     Jim   x   x   10.0  742
2  Class A          York     Sam   x   x   32.0  653
3  Class B  Dumpton Park   Sarah   x   x   23.5   12
4  Class B  Dumpton Park     Bob   x   x   23.1  299
5  Class B  Dumpton Park    Bill   x   x   23.4  198
6  Class A         Dover    Andy   x   x   23.0  983
7  Class A         Dover  Hannah   x   x    1.0  293
8  Class B        London   Jemma   x   x   32.2  NaN
9  Class B        London   James   x   x   32.0  NaN


In [70]: df1 = df1.rename(columns={'Location': 'Local', 'Student': 'Name', 'Scorecard': 'Score', 'Number': 'No'}


In [71]: locations = df2['Local']

In [72]: df1['Local'] = df1['Local'].str.slice(0, 4)

In [73]: df2['Local'] = df2['Local'].str.slice(0, 4)

使用字符串方法截断到前 4 个(假设这不会导致任何错误匹配)。


In [78]: df1 = df1.set_index(['Class', 'Local', 'Name'])

In [79]: df2 = df2.set_index(['Class', 'Local', 'Name'])

In [80]: df1
                      Score   No
Class   Local Name              
Class A York  Jim       0.0  742
              Sam       0.0  931
              Tom       0.0  653
Class B Dump  Bob      23.1  299
              Bill     23.4  198
              Sarah    23.5   12
Class A Dove  Andy     23.0  983
              Hannah    1.0  293
Class B Lond  Jemma    32.2    0
              James    32.0    0

In [83]: df1 = df1.replace(0, np.nan)
In [84]: df2 = df2.replace(0, np.nan)


In [85]: df1.update(df2, overwrite=False)


In [91]: df1 = df1.reset_index()
In [92]: df1['Local'] = locations

你可以写输出到 csv (和一堆其他格式df1.to_csv('path/to/csv')

您可以尝试使用标准库中的 csv 模块。我的解决方案与 Chris H 的解决方案非常相似,但我使用 csv 模块来读取和写入文件。(事实上​​,我偷了他将键存储在列表中以保存顺序的技术)。

如果您使用 csv 模块,您不必太担心引号,它还允许您将行直接读入字典,以列名作为键。

import csv

# Open first CSV, and read each line as a dictionary with column names as keys.
with open('csv1.csv', 'rb') as csvfile1:
    table1 = csv.DictReader(csvfile1,['Class', 'Local', 'Name',
                            'DPE', 'JJK', 'Score', 'No'])
    table1.next() #skip header row
    first_table = {}
    original_order = [] #list keys to save original order
    # build dictionary of rows with name, location, and class as keys
    for row in table1:
        id = "%s from %s in %s" % (row['Name'], row['Local'][:4], row['Class'])
        first_table[id] = row

# Repeat for second csv, but don't worry about order
with open('csv2.csv', 'rb') as csvfile2:
    table2 = csv.DictReader(csvfile2, ['Class', 'Location',
                            'Student', 'Scorecard', 'Number'])
    second_table = {}
    for row in table2:
        id = "%s from %s in %s" % (row['Student'], row['Location'][:4], row['Class'])
        second_table[id] = row

with open('student_data.csv', 'wb') as finalfile:
    results = csv.DictWriter(finalfile, ['Class', 'Local', 'Name',
                             'DPE', 'JJK', 'Score', 'No'])
    # Replace data in first csv with data in second csv when conditions are satisfied.
    for student in original_order:
        if second_table[student]['Scorecard'] != "0" and second_table[student]['Scorecard'] != "":
            first_table[student]['Score'] = second_table[student]['Scorecard']
        if second_table[student]['Number'] != "0" and second_table[student]['Number'] != "":
            first_table[student]['No'] = second_table[student]['Number']


