python - 带有“分组依据”或“uniq”的统计信息

Question

我有以下“basefile.csv”

AAM7676,2012-02-02 11:55:52,32,2012-02-03 19:55:30,62,1
AAM7676,2012-02-11 13:56:11,32,2012-02-12 21:00:18,52,2
AAM7676,2012-02-21 16:30:55,32,2012-02-23 13:29:41,62,1
AAM7676,2012-03-07 20:03:32,32,2012-03-09 13:31:35,62,1
AAM7676,2012-05-28 06:08:05,32,2012-05-29 15:49:55,52,2
AAM7676,2012-08-22 12:47:28,32,2012-08-24 08:03:09,52,1
AAO9229,2012-01-10 07:19:29,32,2012-01-11 16:39:16,52,2
AAP0678,2012-04-09 16:35:19,32,2012-04-10 19:46:55,52,2
AAP0678,2012-04-30 16:44:28,32,2012-05-01 19:20:00,52,2
AAP0678,2012-06-01 19:31:34,32,2012-06-03 10:34:33,52,3
AAU6100,2012-01-09 17:49:13,32,2012-01-11 02:00:33,52,3
AAU6100,2012-01-20 21:18:16,32,2012-01-22 14:09:00,52,3
AAU6100,2012-02-20 13:35:39,32,2012-02-21 19:45:55,52,2
AAU6100,2012-03-13 09:50:51,32,2012-03-14 22:35:51,52,3

根据第 1 列（车牌）和第 4 列（日期时间），我想统计一下每个月（第 4 列）有多少次车牌（第 1 列）出现。

最终格式，应该是：

plate,jan,feb,mar,abr,may,jun,jul,aug,sep.oct,nov,dec,total
AAM7676,0,3,1,0,1,0,0,1,0,0,0,0,6
AAO9229,1,0,0,0,0,0,0,0,0,0,0,0,1
AAP0678,0,0,0,1,1,1,0,0,0,0,0,0,3
AAU6100,2,1,1,0,0,0,0,0,0,0,0,0,4

我已经玩过（并搜索了解决方案）shell脚本和MySQL，但没有弄清楚如何解决它......可能是因为我是新手......

任何类型的解决方案都将受到欢迎（MySQL、sh、perl、python、...）

score 4 · Accepted Answer

这基本上是Python Pandas的典型问题。我假设列标题名称：license、time1、num1、time2、num2、count（按此顺序）。

import pandas, numpy as np
df = pandas.io.parsers.read_csv("baseline.csv")
df["month"] = df["time2"].map(lambda x: int(x.split('-')[1]))
df.groupby(["license","month"]).apply(len)

输出：

license  month
AAM7676  2        3
         3        1
         5        1
         8        1
AAO9229  1        1
AAP0678  4        1
         5        1
         6        1
AAU6100  1        2
         2        1
         3        1

然后你就有了一个多索引的 Pandas 系列计数，所以你可以根据需要格式化输出表。但是，格式化以直接从 Pandas 打印非常类似于您想要的内容并不难：

t = df.groupby(["license","month"]).apply(len)
t.unstack(level=0).reindex(index=range(1,13), fill_value=0).T.fillna(0)

打印出来：

               1   2   3   4   5   6   7   8   9   10  11  12
      license
count AAM7676   0   3   1   0   1   0   0   1   0   0   0   0
      AAO9229   1   0   0   0   0   0   0   0   0   0   0   0
      AAP0678   0   0   0   1   1   1   0   0   0   0   0   0
      AAU6100   2   1   1   0   0   0   0   0   0   0   0   0

虽然此解决方案需要 (a) 标头名称和 (b) 两个第三方库；它带来了巨大的胜利。您可以非常轻松地聚合和应用分组操作，并且它们像 NumPy 一样进行了优化。如果需要，这对于非常大的数据或使用您的数据计算许多不同的辅助统计数据非常有效。

让我说清楚，因为我通常会因为这样的答案而被击落。知道如何用纯 Python 做到这一点是一件很棒的事情，Python 程序员应该花时间去学习它。但是，不要仅仅为了 Python 而重新发明轮子。Pandas 为这种数据操作提供了一些很棒的工具。

score 3 · Accepted Answer

以下 Python 解决方案应该可以工作：

import csv
import collections

result = collections.OrderedDict()
for cols in csv.reader(open('basefile.csv')):
    if len(cols) != 6:
        continue
    plate = cols[0]
    month = int(cols[3][5:7])
    result.setdefault(plate, [plate] + [0]*12)[month] += 1

print 'plate,jan,feb,mar,abr,may,jun,jul,aug,sep.oct,nov,dec,total'
for row in result.values():
    print ','.join(map(str, row)) + ',' + str(sum(row[1:]))

score 3 · Accepted Answer

使用 gawk：

gawk -F, '
    {
        plate[$1]++
        split($4, dt, /-0*/)
        count[$1,dt[2]]++
    }
    END {
        print "plate,jan,feb,mar,apr,may,jun,jul,aug,sep,oct,nov,dec,total"
        n = asorti(plate, ordered_plates)
        for (i=1; i<=n; i++) {
            p = ordered_plates[i]
            printf("%s,", p)
            for (m=1; m<=12; m++) 
                printf("%d,", count[p,m])
            print plate[p]
        }
    }
' basefile.csv

输出

plate,jan,feb,mar,apr,may,jun,jul,aug,sep,oct,nov,dec,total
AAM7676,0,3,1,0,1,0,0,1,0,0,0,0,6
AAO9229,1,0,0,0,0,0,0,0,0,0,0,0,1
AAP0678,0,0,0,1,1,1,0,0,0,0,0,0,3
AAU6100,2,1,1,0,0,0,0,0,0,0,0,0,4

score 3 · Accepted Answer

Perl 解决方案。它假定任何字段中都没有嵌入的逗号。

#!/usr/bin/perl
use strict;
use warnings;
use List::Util qw/ sum /;

my %data;
while (<DATA>) {
    my ($plate, $col4) = (split /,/)[0, 3];
    my ($month) = $col4 =~ /-(\d\d)-/;
    $data{$plate}{$month}++;
}

print join(",", qw/ plate jan feb mar apr may jun jul aug sep oct nov dec total /), "\n";

for my $plate (sort keys %data) {
    my @per_month = map $data{$plate}{$_} || 0, '01' .. '12';
    print join(",", $plate, @per_month, sum @per_month), "\n";
}

__DATA__
AAM7676,2012-02-02 11:55:52,32,2012-02-03 19:55:30,62,1 
AAM7676,2012-02-11 13:56:11,32,2012-02-12 21:00:18,52,2 
AAM7676,2012-02-21 16:30:55,32,2012-02-23 13:29:41,62,1 
AAM7676,2012-03-07 20:03:32,32,2012-03-09 13:31:35,62,1 
AAM7676,2012-05-28 06:08:05,32,2012-05-29 15:49:55,52,2 
AAM7676,2012-08-22 12:47:28,32,2012-08-24 08:03:09,52,1 
AAO9229,2012-01-10 07:19:29,32,2012-01-11 16:39:16,52,2 
AAP0678,2012-04-09 16:35:19,32,2012-04-10 19:46:55,52,2 
AAP0678,2012-04-30 16:44:28,32,2012-05-01 19:20:00,52,2 
AAP0678,2012-06-01 19:31:34,32,2012-06-03 10:34:33,52,3 
AAU6100,2012-01-09 17:49:13,32,2012-01-11 02:00:33,52,3 
AAU6100,2012-01-20 21:18:16,32,2012-01-22 14:09:00,52,3 
AAU6100,2012-02-20 13:35:39,32,2012-02-21 19:45:55,52,2 
AAU6100,2012-03-13 09:50:51,32,2012-03-14 22:35:51,52,3

score 1 · Accepted Answer

我会保留一个列表字典：

from collections import defaultdict
d = defaultdict(lambda : [None]+[0]*12)

with open('yourfile') as f:
    for line in f:
        plate,_,_,time,_,_ = line.split(',')  #maybe use csv instead
        month = int(time.split('-')[1])       #get the month
        d[plate][month] += 1

score 1 · Accepted Answer

第 1 阶段：生成带有车牌和年/月的条目

cut -d, -f1,4 basefile.csv |
sed 's/,2012-\([0-9][0-9]\)-[0-9][0-9] ..:..:..$/ \1/'

这假定日期都是 2012 年，并且还将逗号分隔符映射到空格。

示例输出：

AAM7676 02
AAM7676 02
AAM7676 02
AAM7676 03
AAM7676 05
AAM7676 08
AAO9229 01
AAP0678 04
AAP0678 05
AAP0678 06
AAU6100 01
AAU6100 01
AAU6100 02
AAU6100 03

第 2 阶段：每月生成计数

... |
sort | uniq -c

样本输出：

3 AAM7676 02
1 AAM7676 03
1 AAM7676 05
1 AAM7676 08
1 AAO9229 01
1 AAP0678 04
1 AAP0678 05
1 AAP0678 06
2 AAU6100 01
1 AAU6100 02
1 AAU6100 03

第 3 阶段：枢轴

数据按板块和月份的顺序排列。此时，我将使用awk创建控制中断报告：

cut -d, -f1,4 basefile.csv |
sed 's/,2012-\([0-9][0-9]\)-[0-9][0-9] ..:..:..$/ \1/' |
sort |
uniq -c |
awk '
    {   if ($2 != last_plate && last_plate != "")
        {
            printf "%s", last_plate
            for (i = 1; i <= 12; i++)
            {
                printf ",%d", count[i]
                count[i] = 0;
            }
            print ""
        }
        last_plate = $2
        count[$3+0] = $1
    }
    END {   if (last_plate != "")
            {
                printf "%s", last_plate
                for (i = 1; i <= 12; i++)
                    printf ",%d", count[i]
                print ""
            }
    }'

那里唯一的“技巧”是下标count[$3+0]；这会将字符串转换01为下标的纯数字 1。

样本数据的输出为：

AAM7676,0,3,1,0,1,0,0,1,0,0,0,0
AAO9229,1,0,0,0,0,0,0,0,0,0,0,0
AAP0678,0,0,0,1,1,1,0,0,0,0,0,0
AAU6100,2,1,1,0,0,0,0,0,0,0,0,0

如果您也想要列标题，那么只需在脚本中添加一个 BEGIN 块和适当print的语句即可。awk

这一切都可以完成awk吗？大概……做我的客人吧。排序是唯一棘手的地方。它也可以全部用 Perl 或 Python 或其他类似的脚本语言完成。

score 1 · Accepted Answer

尽管在 RBDS 中，monthes 的 12 列并不完全自然，但以下 SQL COUNT 和 GROUP BY 可以解决问题：

drop table if exists toto;
create table toto(
    plate VARCHAR(32),
    date1 DATETIME,
    something1 INT(10),
    date2 DATETIME,
    something2 INT(10),
    something3 INT(10)
);

INSERT INTO toto VALUES('AAM7676','2012-02-02 11:55:52',32,'2012-02-03 19:55:30',62,1);
INSERT INTO toto VALUES('AAM7676','2012-02-11 13:56:11',32,'2012-02-12 21:00:18',52,2);
INSERT INTO toto VALUES('AAM7676','2012-02-21 16:30:55',32,'2012-02-23 13:29:41',62,1);
INSERT INTO toto VALUES('AAM7676','2012-03-07 20:03:32',32,'2012-03-09 13:31:35',62,1);
INSERT INTO toto VALUES('AAM7676','2012-05-28 06:08:05',32,'2012-05-29 15:49:55',52,2);
INSERT INTO toto VALUES('AAM7676','2012-08-22 12:47:28',32,'2012-08-24 08:03:09',52,1);
INSERT INTO toto VALUES('AAO9229','2012-01-10 07:19:29',32,'2012-01-11 16:39:16',52,2);
INSERT INTO toto VALUES('AAP0678','2012-04-09 16:35:19',32,'2012-04-10 19:46:55',52,2);
INSERT INTO toto VALUES('AAP0678','2012-04-30 16:44:28',32,'2012-05-01 19:20:00',52,2);
INSERT INTO toto VALUES('AAP0678','2012-06-01 19:31:34',32,'2012-06-03 10:34:33',52,3);
INSERT INTO toto VALUES('AAU6100','2012-01-09 17:49:13',32,'2012-01-11 02:00:33',52,3);
INSERT INTO toto VALUES('AAU6100','2012-01-20 21:18:16',32,'2012-01-22 14:09:00',52,3);
INSERT INTO toto VALUES('AAU6100','2012-02-20 13:35:39',32,'2012-02-21 19:45:55',52,2);
INSERT INTO toto VALUES('AAU6100','2012-03-13 09:50:51',32,'2012-03-14 22:35:51',52,3);


SELECT 
    t.plate,
    (SELECT COUNT(*) FROM toto tt WHERE tt.plate=t.plate AND EXTRACT(MONTH FROM date1)=1),
    (SELECT COUNT(*) FROM toto tt WHERE tt.plate=t.plate AND EXTRACT(MONTH FROM date1)=2),
    (SELECT COUNT(*) FROM toto tt WHERE tt.plate=t.plate AND EXTRACT(MONTH FROM date1)=3),
    (SELECT COUNT(*) FROM toto tt WHERE tt.plate=t.plate AND EXTRACT(MONTH FROM date1)=4),
    (SELECT COUNT(*) FROM toto tt WHERE tt.plate=t.plate AND EXTRACT(MONTH FROM date1)=5),
    (SELECT COUNT(*) FROM toto tt WHERE tt.plate=t.plate AND EXTRACT(MONTH FROM date1)=6),
    (SELECT COUNT(*) FROM toto tt WHERE tt.plate=t.plate AND EXTRACT(MONTH FROM date1)=7),
    (SELECT COUNT(*) FROM toto tt WHERE tt.plate=t.plate AND EXTRACT(MONTH FROM date1)=8),
    (SELECT COUNT(*) FROM toto tt WHERE tt.plate=t.plate AND EXTRACT(MONTH FROM date1)=9),
    (SELECT COUNT(*) FROM toto tt WHERE tt.plate=t.plate AND EXTRACT(MONTH FROM date1)=10),
    (SELECT COUNT(*) FROM toto tt WHERE tt.plate=t.plate AND EXTRACT(MONTH FROM date1)=11),
    (SELECT COUNT(*) FROM toto tt WHERE tt.plate=t.plate AND EXTRACT(MONTH FROM date1)=12),
    COUNT(*)
FROM toto t
GROUP BY plate;

结果：

AAM7676 0   3   1   0   1   0   0   1   0   0   0   0   6
AAO9229 1   0   0   0   0   0   0   0   0   0   0   0   1
AAP0678 0   0   0   2   0   1   0   0   0   0   0   0   3
AAU6100 2   1   1   0   0   0   0   0   0   0   0   0   4

score 1 · Accepted Answer

试试这个查询 -

SELECT
  plate,
  COUNT(IF(MONTH(dt2) = 1, 1, NULL)) jan,
  COUNT(IF(MONTH(dt2) = 2, 1, NULL)) feb,
  COUNT(IF(MONTH(dt2) = 3, 1, NULL)) mar,
  COUNT(IF(MONTH(dt2) = 4, 1, NULL)) apr,
  COUNT(IF(MONTH(dt2) = 5, 1, NULL)) may,
  COUNT(*) total
FROM
  basefile_table
WHERE
  YEAR(dt2) = 2012
GROUP BY
  plate;

+---------+-----+-----+-----+-----+-----+-------+
| plate   | jan | feb | mar | apr | may | total |
+---------+-----+-----+-----+-----+-----+-------+
| AAM7676 |   0 |   3 |   1 |   0 |   1 |     6 |
| AAO9229 |   1 |   0 |   0 |   0 |   0 |     1 |
| AAP0678 |   0 |   0 |   0 |   1 |   1 |     3 |
| AAU6100 |   2 |   1 |   1 |   0 |   0 |     4 |
+---------+-----+-----+-----+-----+-----+-------+

添加其他月份 - 六月，七月，... 注意我已将年份过滤器添加到查询中。

python - 带有“分组依据”或“uniq”的统计信息

8 回答 8

第 1 阶段：生成带有车牌和年/月的条目

第 2 阶段：每月生成计数

第 3 阶段：枢轴

Related

Reference