0

我正在使用组合器以及映射器和减速器。

我的映射器代码如下:

#!/usr/bin/env python

import sys
import datetime

def main():
    for line in sys.stdin:
        data = line.strip().split("\t")
        if len(data) != 6:
            continue
        try:
            float(data[4])
        except:
            continue
        else: # no exception
            sale_date = data[0]
            ymd = sale_date.split('-')
            date_obj = datetime.date(int(ymd[0]), int(ymd[1]), int(ymd[2]))
            print "{0}\t{1}".format(date_obj.weekday(), data[4])

main()

我的减速器代码如下:

#!/usr/bin/env python
# this reducer also acts as a combiner
import collections
import sys

def main():

    sales_counter = collections.defaultdict(int)
    sales_sum = collections.defaultdict(float)

    for line in sys.stdin:
        data = line.strip().split("\t")
        if len(data) == 3: # acting as reducer
            sales_counter[data[0]] = sales_counter[data[0]] + int(data[1])
            sales_sum[data[0]] = sales_sum[data[0]] + float(data[2])
        elif len(data) == 2: # acting as combiner
            sales_counter[data[0]] = sales_counter[data[0]] + 1
            sales_sum[data[0]] = sales_sum[data[0]] + float(data[1])
        else:
            continue # invalid line read, ignore

    for key in sorted(sales_sum):
        print key,"\t",sales_counter[key],"\t",sales_sum[key]

main()

数据文件格式如下(只显示前10行):

2012-01-01  09:00   San Jose    Men's Clothing  214.05  Amex
2012-01-01  09:00   Fort Worth  Women's Clothing    153.57  Visa
2012-01-01  09:00   San Diego   Music   66.08   Cash
2012-01-01  09:00   Pittsburgh  Pet Supplies    493.51  Discover
2012-01-01  09:00   Omaha   Children's Clothing 235.63  MasterCard
2012-01-01  09:00   Stockton    Men's Clothing  247.18  MasterCard
2012-01-01  09:00   Austin  Cameras 379.6   Visa
2012-01-01  09:00   New York    Consumer Electronics    296.8   Cash
2012-01-01  09:00   Corpus Christi  Toys    25.38   Discover
2012-01-01  09:00   Fort Worth  Toys    213.88  Visa

我得到的结果如下:

0   34034   8529272.78
0   567400  141834839.29
1   22715   5660345.68
1   566889  141586312.46
2   22611   5625669.74
2   555219  138745830.2
3   22666   5633975.27
3   567051  141719805.3
4   25365   6363847.75
4   563769  141051081.75
5   34131   8560716.09
5   555310  138849461.48
6   34071   8503163.7
6   567245  141793631.77

我希望每个键(第一列)只看到一个条目。并且正确的结果确实是通过组合每个键的部分结果来获得的。但我的问题是为什么每个键都有部分结果?

4

0 回答 0