我正在使用组合器以及映射器和减速器。
我的映射器代码如下:
#!/usr/bin/env python
import sys
import datetime
def main():
for line in sys.stdin:
data = line.strip().split("\t")
if len(data) != 6:
continue
try:
float(data[4])
except:
continue
else: # no exception
sale_date = data[0]
ymd = sale_date.split('-')
date_obj = datetime.date(int(ymd[0]), int(ymd[1]), int(ymd[2]))
print "{0}\t{1}".format(date_obj.weekday(), data[4])
main()
我的减速器代码如下:
#!/usr/bin/env python
# this reducer also acts as a combiner
import collections
import sys
def main():
sales_counter = collections.defaultdict(int)
sales_sum = collections.defaultdict(float)
for line in sys.stdin:
data = line.strip().split("\t")
if len(data) == 3: # acting as reducer
sales_counter[data[0]] = sales_counter[data[0]] + int(data[1])
sales_sum[data[0]] = sales_sum[data[0]] + float(data[2])
elif len(data) == 2: # acting as combiner
sales_counter[data[0]] = sales_counter[data[0]] + 1
sales_sum[data[0]] = sales_sum[data[0]] + float(data[1])
else:
continue # invalid line read, ignore
for key in sorted(sales_sum):
print key,"\t",sales_counter[key],"\t",sales_sum[key]
main()
数据文件格式如下(只显示前10行):
2012-01-01 09:00 San Jose Men's Clothing 214.05 Amex
2012-01-01 09:00 Fort Worth Women's Clothing 153.57 Visa
2012-01-01 09:00 San Diego Music 66.08 Cash
2012-01-01 09:00 Pittsburgh Pet Supplies 493.51 Discover
2012-01-01 09:00 Omaha Children's Clothing 235.63 MasterCard
2012-01-01 09:00 Stockton Men's Clothing 247.18 MasterCard
2012-01-01 09:00 Austin Cameras 379.6 Visa
2012-01-01 09:00 New York Consumer Electronics 296.8 Cash
2012-01-01 09:00 Corpus Christi Toys 25.38 Discover
2012-01-01 09:00 Fort Worth Toys 213.88 Visa
我得到的结果如下:
0 34034 8529272.78
0 567400 141834839.29
1 22715 5660345.68
1 566889 141586312.46
2 22611 5625669.74
2 555219 138745830.2
3 22666 5633975.27
3 567051 141719805.3
4 25365 6363847.75
4 563769 141051081.75
5 34131 8560716.09
5 555310 138849461.48
6 34071 8503163.7
6 567245 141793631.77
我希望每个键(第一列)只看到一个条目。并且正确的结果确实是通过组合每个键的部分结果来获得的。但我的问题是为什么每个键都有部分结果?