5

我正在尝试使用其他来源提供的一些学生数据来更新 csv 文件,但是他们的 csv 数据格式与我们的略有不同。

它需要根据三个标准来匹配学生的姓名、班级,最后是位置的前几个字母,因此 B 班的前几个学生Dumpt实际上来自 Dumpton Park。

找到匹配项时

  • 如果 CSV 2 中学生的记分卡为 0 或空白,则不应更新 CSV 1 中的分数列
  • 如果 CSV 2 中的学生编号为 0 或空白,则不应更新 CSV 1 中的 No 列
  • 否则它应该将数字从 CSV 2 导入到 CSV1

下面是一些示例数据:

CSV 1

Class,Local,Name,DPE,JJK,Score,No
Class A,York,Tom,x,x,32,
Class A,York,Jim,x,x,10,
Class A,York,Sam,x,x,32,
Class B,Dumpton Park,Sarah,x,x,,
Class B,Dumpton Park,Bob,x,x,,
Class B,Dumpton Park,Bill,x,x,,
Class A,Dover,Andy,x,x,,
Class A,Dover,Hannah,x,x,,
Class B,London,Jemma,x,x,,
Class B,London,James,x,x,,

CSV 2

"Class","Location","Student","Scorecard","Number"
"Class A","York","Jim","0","742"
"Class A","York","Sam","0","931"
"Class A","York","Tom","0","653"
"Class B","Dumpt","Bob","23.1","299"
"Class B","Dumpt","Bill","23.4","198"
"Class B","Dumpt","Sarah","23.5","12"
"Class A","Dover","Andy","23","983"
"Class A","Dover","Hannah","1","293"
"Class B","Lond","Jemma","32.2","0"
"Class B","Lond","James","32.0","0"

CSV 1 已更新(这是所需的输出)

Class,Local,Name,DPE,JJK,Score,No
Class A,York,Tom,x,x,32,653
Class A,York,Jim,x,x,10,742
Class A,York,Sam,x,x,32,653
Class B,Dumpton Park,Sarah,x,x,23.5,12
Class B,Dumpton Park,Bob,x,x,23.1,299
Class B,Dumpton Park,Bill,x,x,23.4,198
Class A,Dover,Andy,x,x,23,983
Class A,Dover,Hannah,x,x,1,293
Class B,London,Jemma,x,x,32.2,
Class B,London,James,x,x,32.0,

我真的很感激这个问题的任何帮助。谢谢奥利弗

4

6 回答 6

9

这里有两种解决方案:pandas 解决方案和普通 python 解决方案。首先是一个熊猫解决方案,不出所料,它看起来很像其他熊猫解决方案......

首先加载数据

import pandas
import numpy as np

cdf1 = pandas.read_csv('csv1',dtype=object)  #dtype = object allows us to preserve the numeric formats
cdf2 = pandas.read_csv('csv2',dtype=object)

col_order = cdf1.columns  #pandas will shuffle the column order at some point---this allows us to reset ot original column order

此时数据框看起来像

In [6]: cdf1
Out[6]: 
     Class         Local    Name DPE JJK Score   No
0  Class A          York     Tom   x   x    32  NaN
1  Class A          York     Jim   x   x    10  NaN
2  Class A          York     Sam   x   x    32  NaN
3  Class B  Dumpton Park   Sarah   x   x   NaN  NaN
4  Class B  Dumpton Park     Bob   x   x   NaN  NaN
5  Class B  Dumpton Park    Bill   x   x   NaN  NaN
6  Class A         Dover    Andy   x   x   NaN  NaN
7  Class A         Dover  Hannah   x   x   NaN  NaN
8  Class B        London   Jemma   x   x   NaN  NaN
9  Class B        London   James   x   x   NaN  NaN

In [7]: cdf2
Out[7]: 
     Class Location Student Scorecard Number
0  Class A     York     Jim         0    742
1  Class A     York     Sam         0    931
2  Class A     York     Tom         0    653
3  Class B    Dumpt     Bob      23.1    299
4  Class B    Dumpt    Bill      23.4    198
5  Class B    Dumpt   Sarah      23.5     12
6  Class A    Dover    Andy        23    983
7  Class A    Dover  Hannah         1    293
8  Class B     Lond   Jemma      32.2      0
9  Class B     Lond   James      32.0      0

接下来将两个数据帧操作为匹配的格式。

dcol = cdf2.Location 
cdf2['Location'] = dcol.apply(lambda x: x[0:4])  #Replacement in cdf2 since we don't need original data

dcol = cdf1.Local
cdf1['Location'] = dcol.apply(lambda x: x[0:4])  #Here we add a column leaving 'Local' because we'll need it for the final output

cdf2 = cdf2.rename(columns={'Student': 'Name', 'Scorecard': 'Score', 'Number': 'No'})
cdf2 = cdf2.replace('0', np.nan)  #Replacing '0' by np.nan means zeros don't overwrite

cdf1 = cdf1.set_index(['Class', 'Location', 'Name'])
cdf2 = cdf2.set_index(['Class', 'Location', 'Name'])

现在 cdf1 和 cdf2 看起来像

In [16]: cdf1
Out[16]: 
                                Local DPE JJK Score   No
Class   Location Name                                   
Class A York     Tom             York   x   x    32  NaN
                 Jim             York   x   x    10  NaN
                 Sam             York   x   x    32  NaN
Class B Dump     Sarah   Dumpton Park   x   x   NaN  NaN
                 Bob     Dumpton Park   x   x   NaN  NaN
                 Bill    Dumpton Park   x   x   NaN  NaN
Class A Dove     Andy           Dover   x   x   NaN  NaN
                 Hannah         Dover   x   x   NaN  NaN
Class B Lond     Jemma         London   x   x   NaN  NaN
                 James         London   x   x   NaN  NaN

In [17]: cdf2
Out[17]: 
                        Score   No
Class   Location Name             
Class A York     Jim      NaN  742
                 Sam      NaN  931
                 Tom      NaN  653
Class B Dump     Bob     23.1  299
                 Bill    23.4  198
                 Sarah   23.5   12
Class A Dove     Andy      23  983
                 Hannah     1  293
Class B Lond     Jemma   32.2  NaN
                 James   32.0  NaN

用 cdf2 中的数据更新 cdf1 中的数据

cdf1.update(cdf2, overwrite=False)

结果是

In [19]: cdf1
Out[19]: 
                                Local DPE JJK Score   No
Class   Location Name                                   
Class A York     Tom             York   x   x    32  653
                 Jim             York   x   x    10  742
                 Sam             York   x   x    32  931
Class B Dump     Sarah   Dumpton Park   x   x  23.5   12
                 Bob     Dumpton Park   x   x  23.1  299
                 Bill    Dumpton Park   x   x  23.4  198
Class A Dove     Andy           Dover   x   x    23  983
                 Hannah         Dover   x   x     1  293
Class B Lond     Jemma         London   x   x  32.2  NaN
                 James         London   x   x  32.0  NaN

最后将 cdf1 恢复为原始形式并将其写入 csv 文件。

cdf1 = cdf1.reset_index()  #These two steps allow us to remove the 'Location' column
del cdf1['Location']    
cdf1 = cdf1[col_order]     #This will switch Local and Name back to their original order

cdf1.to_csv('temp.csv',index = False)

两个注意事项:首先,考虑到使用 cdf1.Local.value_counts() 或 len(cdf1.Local.value_counts()) 等是多么容易。我强烈建议添加一些校验和以确保从 Location 转移到位置的前几个字母,您不会意外删除位置。其次,我真诚地希望您想要的输出的第 4 行有一个错字。

到一个普通的python解决方案。在下文中,根据需要调整文件名。

#Open all of the necessary files
csv1 = open('csv1','r')
csv2 = open('csv2','r')
csvout = open('csv_out','w')

#Read past both headers and write the header to the outfile
wstr = csv1.readline()
csvout.write(wstr)
csv2.readline()

#Read csv1 into a dictionary with keys of Class,Name,and first four digits of Local and keep a list of keys for line ordering
line_keys = []
line_dict = {}
for line in csv1:
    s = line.split(',')
    this_key = (s[0],s[1][0:4],s[2])
    line_dict[this_key] = s
    line_keys.append(this_key)

#Go through csv2 updating the data in csv1 as necessary
for line in csv2:
    s = line.replace('\"','').split(',')
    this_key = (s[0],s[1][0:4],s[2])
    if this_key in line_dict:   #Lowers the crash rate...
        #Check if need to replace Score...
        if len(s[3]) > 0 and float(s[3]) != 0:
            line_dict[this_key][5] = s[3]
        #Check if need to repace No...
        if len(s[4]) > 0 and float(s[4]) != 0:
            line_dict[this_key][6] = s[4]
    else:
        print "Line not in csv1: %s"%line

#Write the updated line_dict to csvout
for key in line_keys:
    wstr = ','.join(line_dict[key])
    csvout.write(wstr)
csvout.write('\n')

#Close all of the open filehandles
csv1.close()
csv2.close()
csvout.close()
于 2013-11-18T20:31:19.033 回答
5

您可以使用fuzzywuzzy进行城镇名称的匹配,并作为一列附加到df2:

df1 = pd.read_csv(csv1)
df2 = pd.read_csv(csv2)

towns = df1.Local.unique()  # assuming this is complete list of towns

from fuzzywuzzy.fuzz import partial_ratio

In [11]: df2['Local'] =  df2.Location.apply(lambda short_location: max(towns, key=lambda t: partial_ratio(short_location, t)))

In [12]: df2
Out[12]: 
     Class Location Student  Scorecard  Number         Local
0  Class A     York     Jim        0.0     742          York
1  Class A     York     Sam        0.0     931          York
2  Class A     York     Tom        0.0     653          York
3  Class B    Dumpt     Bob       23.1     299  Dumpton Park
4  Class B    Dumpt    Bill       23.4     198  Dumpton Park
5  Class B    Dumpt   Sarah       23.5      12  Dumpton Park
6  Class A    Dover    Andy       23.0     983         Dover
7  Class A    Dover  Hannah        1.0     293         Dover
8  Class B     Lond   Jemma       32.2       0        London
9  Class B     Lond   James       32.0       0        London

使名称保持一致(此时学生和名称被错误命名):

In [13]: df2.rename_axis({'Student': 'Name'}, axis=1, inplace=True)

现在您可以合并(在重叠的列上):

In [14]: res = df1.merge(df2, how='outer')

In [15]: res
Out[15]: 
     Class         Local    Name DPE JJK  Score  No Location  Scorecard  Number
0  Class A          York     Tom   x   x     32 NaN     York        0.0     653
1  Class A          York     Jim   x   x     10 NaN     York        0.0     742
2  Class A          York     Sam   x   x     32 NaN     York        0.0     931
3  Class B  Dumpton Park   Sarah   x   x    NaN NaN    Dumpt       23.5      12
4  Class B  Dumpton Park     Bob   x   x    NaN NaN    Dumpt       23.1     299
5  Class B  Dumpton Park    Bill   x   x    NaN NaN    Dumpt       23.4     198
6  Class A         Dover    Andy   x   x    NaN NaN    Dover       23.0     983
7  Class A         Dover  Hannah   x   x    NaN NaN    Dover        1.0     293
8  Class B        London   Jemma   x   x    NaN NaN     Lond       32.2       0
9  Class B        London   James   x   x    NaN NaN     Lond       32.0       0

要清理的一点是分数,我想我会取两者中的最大值:

In [16]: res['Score'] = res.loc[:, ['Score', 'Scorecard']].max(1)

In [17]: del res['Scorecard'] 
         del res['No']
         del res['Location']

然后你就剩下你想要的列了:

In [18]: res
Out[18]: 
     Class         Local    Name DPE JJK  Score  Number
0  Class A          York     Tom   x   x   32.0     653
1  Class A          York     Jim   x   x   10.0     742
2  Class A          York     Sam   x   x   32.0     931
3  Class B  Dumpton Park   Sarah   x   x   23.5      12
4  Class B  Dumpton Park     Bob   x   x   23.1     299
5  Class B  Dumpton Park    Bill   x   x   23.4     198
6  Class A         Dover    Andy   x   x   23.0     983
7  Class A         Dover  Hannah   x   x    1.0     293
8  Class B        London   Jemma   x   x   32.2       0
9  Class B        London   James   x   x   32.0       0

In [18]: res.to_csv('foo.csv')

注意:要强制 dtype 为对象(并混合 dtype、int 和浮点数,而不是所有浮点数),您可以使用 apply。如果您要进行任何分析,我建议您不要这样做!

res['Score'] = res['Score'].apply(lambda x: int(x) if int(x) == x else x, convert_dtype=False)
于 2013-11-14T21:20:17.090 回答
5

希望这段代码更具可读性。;) Python 的新 Enum 类型的反向移植在这里

from enum import Enum       # see PyPI for the backport (enum34)

class Field(Enum):

    course = 0
    location = 1
    student = 2
    dpe = 3
    jjk = 4
    score = -2
    number = -1

    def __index__(self):
        return self._value_

def Float(text):
    if not text:
        return 0.0
    return float(text)

def load_our_data(filename):
    "return a dict using the first three fields as the key"
    data = dict()
    with open(filename) as input:
        next(input)  # throw away header
        for line in input:
            fields = line.strip('\n').split(',')
            fields[Field.score] = Float(fields[Field.score])
            fields[Field.number] = Float(fields[Field.number])
            key = (
                fields[Field.course].lower(),
                fields[Field.location][:4].lower(),
                fields[Field.student].lower(),
                )
            data[key] = fields
    return data

def load_their_data(filename):
    "return a dict using the first three fields as the key"
    data = dict()
    with open(filename) as input:
        next(input)  # throw away header
        for line in input:
            fields = line.strip('\n').split(',')
            fields = [f.strip('"') for f in fields]
            fields[Field.score] = Float(fields[Field.score])
            fields[Field.number] = Float(fields[Field.number])
            key = (
                fields[Field.course].lower(),
                fields[Field.location][:4].lower(),
                fields[Field.student].lower(),
                )
            data[key] = fields
    return data

def merge_data(ours, theirs):
    "their data is only used if not blank and non-zero"
    for key, our_data in ours.items():
        their_data = theirs[key]
        if their_data[Field.score]:
            our_data[Field.score] = their_data[Field.score]
        if their_data[Field.number]:
            our_data[Field.number] = their_data[Field.number]

def write_our_data(data, filename):
    with open(filename, 'w') as output:
        for record in sorted(data.values()):
            line = ','.join([str(f) for f in record])
            output.write(line + '\n')

if __name__ == '__main__':
    ours = load_our_data('one.csv')
    theirs = load_their_data('two.csv')
    merge_data(ours, theirs)
    write_our_data(ours, 'three.csv')
于 2013-11-15T00:37:13.757 回答
4

Python 字典是这里的方法:

studentDict = {}

with open(<csv1>, 'r') as f:
  for line in f:
    LL = line.rstrip('\n').replace('"','').split(',')
    studentDict[LL[0], LL[1], LL[2]] = LL[3:]

with open(<csv2>, 'r') as f:
  for line in f:
    LL = line.rstrip('\n').replace('"','').split(',')
    if LL[-2] not in ('0', ''): studentDict[LL[0], LL[1], LL[2]][-2] = LL[-2]
    if LL[-1] not in ('0', ''): studentDict[LL[0], LL[1], LL[2]][-1] = LL[-1]

with open(<outFile>, 'w') as f:
  for k in studentDict.keys():
    v = studentDict[k[0], k[1], k[2]]
    f.write(k[0] + ',' + k[1] + ',' + k[2] + ',' + v[0] + ',' + v[1] + ',' + v[2] + ',' + v[3] + '\n')
于 2013-11-10T16:40:30.683 回答
4

pandas使这类任务更方便一些。

编辑:好的,因为您不能依赖手动重命名列,Roman 建议只匹配前几个字母是一个很好的建议。不过,在此之前我们必须改变一些事情。

In [62]: df1 = pd.read_clipboard(sep=',')

In [63]: df2 = pd.read_clipboard(sep=',')

In [68]: df1
Out[68]: 
     Class Location Student  Scorecard  Number
0  Class A     York     Jim        0.0     742
1  Class A     York     Sam        0.0     931
2  Class A     York     Tom        0.0     653
3  Class B    Dumpt     Bob       23.1     299
4  Class B    Dumpt    Bill       23.4     198
5  Class B    Dumpt   Sarah       23.5      12
6  Class A    Dover    Andy       23.0     983
7  Class A    Dover  Hannah        1.0     293
8  Class B     Lond   Jemma       32.2       0
9  Class B     Lond   James       32.0       0

In [69]: df2
Out[69]: 
     Class         Local    Name DPE JJK  Score   No
0  Class A          York     Tom   x   x   32.0  653
1  Class A          York     Jim   x   x   10.0  742
2  Class A          York     Sam   x   x   32.0  653
3  Class B  Dumpton Park   Sarah   x   x   23.5   12
4  Class B  Dumpton Park     Bob   x   x   23.1  299
5  Class B  Dumpton Park    Bill   x   x   23.4  198
6  Class A         Dover    Andy   x   x   23.0  983
7  Class A         Dover  Hannah   x   x    1.0  293
8  Class B        London   Jemma   x   x   32.2  NaN
9  Class B        London   James   x   x   32.0  NaN

获取名称相同的列。

In [70]: df1 = df1.rename(columns={'Location': 'Local', 'Student': 'Name', 'Scorecard': 'Score', 'Number': 'No'}

现在是地点。将原件保存df2到单独的系列中。

In [71]: locations = df2['Local']

In [72]: df1['Local'] = df1['Local'].str.slice(0, 4)

In [73]: df2['Local'] = df2['Local'].str.slice(0, 4)

使用字符串方法截断到前 4 个(假设这不会导致任何错误匹配)。

现在设置索引:

In [78]: df1 = df1.set_index(['Class', 'Local', 'Name'])

In [79]: df2 = df2.set_index(['Class', 'Local', 'Name'])

In [80]: df1
Out[80]: 
                      Score   No
Class   Local Name              
Class A York  Jim       0.0  742
              Sam       0.0  931
              Tom       0.0  653
Class B Dump  Bob      23.1  299
              Bill     23.4  198
              Sarah    23.5   12
Class A Dove  Andy     23.0  983
              Hannah    1.0  293
Class B Lond  Jemma    32.2    0
              James    32.0    0

In [83]: df1 = df1.replace(0, np.nan)
In [84]: df2 = df2.replace(0, np.nan)

最后,像以前一样更新分数:

In [85]: df1.update(df2, overwrite=False)

您可以通过以下方式取回原始位置:

In [91]: df1 = df1.reset_index()
In [92]: df1['Local'] = locations

你可以写输出到 csv (和一堆其他格式df1.to_csv('path/to/csv')

于 2013-11-10T19:43:14.900 回答
2

您可以尝试使用标准库中的 csv 模块。我的解决方案与 Chris H 的解决方案非常相似,但我使用 csv 模块来读取和写入文件。(事实上​​,我偷了他将键存储在列表中以保存顺序的技术)。

如果您使用 csv 模块,您不必太担心引号,它还允许您将行直接读入字典,以列名作为键。

import csv

# Open first CSV, and read each line as a dictionary with column names as keys.
with open('csv1.csv', 'rb') as csvfile1:
    table1 = csv.DictReader(csvfile1,['Class', 'Local', 'Name',
                            'DPE', 'JJK', 'Score', 'No'])
    table1.next() #skip header row
    first_table = {}
    original_order = [] #list keys to save original order
    # build dictionary of rows with name, location, and class as keys
    for row in table1:
        id = "%s from %s in %s" % (row['Name'], row['Local'][:4], row['Class'])
        first_table[id] = row
        original_order.append(id)

# Repeat for second csv, but don't worry about order
with open('csv2.csv', 'rb') as csvfile2:
    table2 = csv.DictReader(csvfile2, ['Class', 'Location',
                            'Student', 'Scorecard', 'Number'])
    table2.next()
    second_table = {}
    for row in table2:
        id = "%s from %s in %s" % (row['Student'], row['Location'][:4], row['Class'])
        second_table[id] = row

with open('student_data.csv', 'wb') as finalfile:
    results = csv.DictWriter(finalfile, ['Class', 'Local', 'Name',
                             'DPE', 'JJK', 'Score', 'No'])
    results.writeheader()
    # Replace data in first csv with data in second csv when conditions are satisfied.
    for student in original_order:
        if second_table[student]['Scorecard'] != "0" and second_table[student]['Scorecard'] != "":
            first_table[student]['Score'] = second_table[student]['Scorecard']
        if second_table[student]['Number'] != "0" and second_table[student]['Number'] != "":
            first_table[student]['No'] = second_table[student]['Number']
        results.writerow(first_table[student])

希望这可以帮助。

于 2013-11-21T06:58:44.707 回答