database - 如何按频率对 Google 数据库（或托管在 AWS 上的数据库）中的 ngram 进行排序

Question

我正在寻找一种按频率订购 Google Book 的 Ngram 的方法。

原始数据集在这里：http: //books.google.com/ngrams/datasets。在每个文件中，ngram 按字母顺序排序，然后按时间顺序排序。

我的电脑功能不够强大，无法处理 2.2 TB 的数据，所以我认为唯一的排序方法是“在云中”。

AWS 托管的版本在这里：http ://aws.amazon.com/datasets/8172056142375670 。

有没有一种经济有效的方法来找到 10,000 个最常见的 1 克、2 克、3 克、4 克和 5 克？

举个例子，数据集包含多年的数据：

As an example, here are the 30,000,000th and 30,000,001st lines from file 0 
of the English 1-grams (googlebooks-eng-all-1gram-20090715-0.csv.zip):

circumvallate   1978   313    215   85 
circumvallate   1979   183    147   77

The first line tells us that in 1978, the word "circumvallate" (which means 
"surround with a rampart or other fortification", in case you were wondering) 
occurred 313 times overall, on 215 distinct pages and in 85 distinct books 
from our sample.

理想情况下，频率列表仅包含 1980 年至今的数据（每年的总和）。

任何帮助，将不胜感激！

干杯，

score 4 · Accepted Answer

我会推荐使用猪！

Pig 让这样的事情变得非常简单直接。这是一个示例猪脚本，它几乎可以满足您的需要：

raw = LOAD '/foo/input' USING PigStorage('\t') AS (ngram:chararray, year:int, count:int, pages:int, books:int);
filtered = FILTER raw BY year >= 1980;
grouped = GROUP filtered BY ngram;
counts = FOREACH grouped GENERATE group AS ngram, SUM(filtered.count) AS count;
sorted = ORDER counts BY count DESC;
limited = LIMIT sorted 10000;
STORED limited INTO '/foo/output' USING PigStorage('\t');

AWS Elastic MapReduce 上的 Pig 甚至可以直接对 S3 数据进行操作，因此您也可能会替换/foo/inputS3/foo/output存储桶。

database - 如何按频率对 Google 数据库（或托管在 AWS 上的数据库）中的 ngram 进行排序

1 回答 1

Related

Reference