我已经有一段时间了,我认为向专家寻求建议对我来说是最有利的。我知道我不是以最好的方式写这个,而且我已经陷入了一个兔子洞并让自己感到困惑。
我有一个.csv。一堆,其实。那部分不是问题。
CSV 顶部的行并不是真正的 CSV 数据,但它确实包含一条重要信息,即数据对其有效的数据。对于某些类型的报告,它位于一条线上,而另一些则位于另一条线上。
我的数据从顶部开始,通常是 10 或 11,但我不能总是确定。我知道第一列总是有相同的信息(数据表的标题)。
我想从前面的行中提取报告日期,对于文件类型 A,执行 stuffA,对于文件 tpye B,执行 stuffB,然后将该行写入新文件。我在增加行时遇到问题,我不知道我做错了什么。
样本数据:
"Attribute ""OPSURVEYLEVEL2_O"" [Category = ""Retail v1""]"
Date exported: 2/16/13
Exported by user: William
Project:
Classification: Online Retail v1
Report type: Attributes
Date range: from 12/14/12 to 12/14/12
"Filter OpSurvey Level 2(mine): [ LEVEL:SENTENCE TYPE:KEYWORD {OPSURVEYLEVEL2_O:""gift certificate redemption"", OPSURVEYLEVEL2_O:""combine accounts"", OPSURVEYLEVEL2_O:""cancel account"", OPSURVEYLEVEL2_O:""saved project moved to purchased project"", OPSURVEYLEVEL2_O:""unlock account"", OPSURVEYLEVEL2_O:""affiliate promotions"", OPSURVEYLEVEL2_O:""print to store coupons"", OPSURVEYLEVEL2_O:""disclaimer not clear"", OPSURVEYLEVEL2_O:""prepaid issue"", OPSURVEYLEVEL2_O:""customer wants to use coupons for print to store"", OPSURVEYLEVEL2_O:""customer received someone else's order"", OPSURVEYLEVEL2_O:""hi-res images unavailable"", OPSURVEYLEVEL2_O:""how to re-order"", OPSURVEYLEVEL2_O:""missing items"", OPSURVEYLEVEL2_O:""missing envelopes: print to store"", OPSURVEYLEVEL2_O:""missing envelopes: mail order"", OPSURVEYLEVEL2_O:""group rooms"", OPSURVEYLEVEL2_O:""print to store"", OPSURVEYLEVEL2_O:""print to store coupons"", OPSURVEYLEVEL2_O:""publisher: card not available for print to store"", OPSURVEYLEVEL2_O:publisher}]"
Total: 905
OPSURVEYLEVEL2_O,Distinct Document,% of Document,Sentiment Score
PRINT TO STORE,297,32.82,-0.1
...
示例代码
#!/usr/bin/python
import csv, os, glob, sys, errno
path = '/path/to/Downloads'
for infile in glob.glob(os.path.join(path,'report_ATTRIBUTE_OP*.csv')):
if 'OPSURVEYLEVEL2' in infile:
prime_column = 'ops2'
elif 'OPSURVEYLEVEL3' in infile:
prime_column = 'ops3'
else:
sys.exit(errno.ENOENT)
with open(infile, "r") as csvfile:
reader = csv.reader(csvfile)
report_date = 'DATE NOT FOUND'
# import pdb; pdb.set_trace()
for row in reader:
foo = 0
while foo < 1:
if row[0][0:].find('OPSURVEYLEVEL') == 0:
foo = 1
if "Date range" in row:
report_date = row[0][-8:]
break
if foo >= 1:
if row[0][0:].find('OPSURVEYLEVEL') == 0:
break
if 'ops2' in prime_column:
dup_col = row[0]
row.insert(0,dup_col)
row.append(report_date)
elif 'ops3' in prime_column:
row.append(report_date)
with open('report_merge.csv', 'a') as outfile:
outfile.write(row)
reader.next()