0

我想删除以下每个条目中的所有统计信息:

#ChangeColumnFullTimeGraduatesEmployedAtGraduation:74.3%    #ChangeColumnAverageStartingSalaryAndBonus:$134,360 3.4 #ChangeColumnFullTimeGraduatesEmployedThreeMonthsAfterGraduation:81.4%  #ChangeColumnPeerAssessmentScoreOutOf5.:4.3
#ChangeColumnFullTimeGraduatesEmployedAtGraduation:82.0%    #ChangeColumnAverageStartingSalaryAndBonus:$127,368 3.29    #ChangeColumnFullTimeGraduatesEmployedThreeMonthsAfterGraduation:89.8%  #ChangeColumnPeerAssessmentScoreOutOf5.:4.1
#ChangeColumnFullTimeGraduatesEmployedAtGraduation:80.7%    #ChangeColumnAverageStartingSalaryAndBonus:$123,177 3.4 #ChangeColumnFullTimeGraduatesEmployedThreeMonthsAfterGraduation:92.5%  #ChangeColumnPeerAssessmentScoreOutOf5.:4.0

我一直在尝试使用正则表达式(regex)。基于所需的最终输出不超过一个数字和一个百分号 / $ 符号这一事实,这就是我拼凑而成的:

import re
import csv

with(open('sheet.csv','rU')) as f:

    for row in f:
        re.sub([^0-9\$\%],'',row)

返回此语法错误:

re.sub([^0-9\$\%],'',row)
4

1 回答 1

4

正则表达式是从字符串中解析出来的,使用字符串作为 re.sub 的参数,即

>>> re.sub(r'[^0-9\$\%]','',row)

或者您可能想拆分:

>>> [c for c in re.split(r'[^0-9\$\%\.]',row) if c]
['74.3%', '$134', '360', '3.4', '81.4%', '5.', '4.3']

它实际上仍然不正确,因为您的列标签中有数字。如果您的输入看起来与您的示例完全一样,那么这样的操作可能会更好:

re.split(r'#[^:]+:|[ ,]',row)
'74.3%', '$134', '360', '3.4', '81.4%', '4.3'
于 2013-07-25T21:01:53.157 回答