java - 统计大文本文件的词频

Question

我有一个巨大的文本文件（大于可用的 RAM 内存）。我需要计算所有单词的频率并将单词和频率计数输出到一个新文件中。结果应按频率计数的降序排序。

我的方法：

对给定文件进行排序 - 外部排序
依次统计每个单词的频率，将计数存储在另一个文件中（连同单词）
根据频率计数对输出文件进行排序 - 外部排序。

我想知道是否有更好的方法来做到这一点。我听说过基于磁盘的哈希表吗？或 B+ 树，但以前从未尝试过。

注意：我在 SO 上看到过类似的问题，但没有一个必须解决数据大于内存的问题。

编辑：根据评论，同意实践中的字典应该适合当今计算机的内存。但是，让我们假设一个单词词典，它大到不适合记忆。

score 14 · Accepted Answer

I would go with a map reduce approach:

Distribute your text file on nodes, assuming each text in a node can fit into RAM.
Calculate each word frequency within the node. (using hash tables )
Collect each result in a master node and combine them all.

score 4 · Accepted Answer

All unique words probably fit in memory so I'd use this approach:

Create a dictionary (HashMap<string, int>).
Read the huge text file line by line.
Add new words into the dictionary and set value to 1.
Add 1 to the value of existing words.

After you've parsed the entire huge file:

Sort the dictionary by frequency.
Write, to a new file, the sorted dictionary with words and frequency.

Mind though to convert the words to either lowercase or uppercase.

score 3 · Accepted Answer

实现它的最佳方法是逐行读取文件并将单词存储到 Multimap（例如Guava）中。如果此 Map 扩展了您的内存，您可以尝试使用键值存储（例如 Berkeley JE DB 或MapDB）。这些键值存储的工作方式类似于地图，但它们将值存储在 HDD 上。我使用 MapDB 解决了类似的问题，而且速度非常快。

score 1 · Accepted Answer

If the list of unique words and the frequency fits in memory (not the file just the unique words) you can use a hash table and read the file sequentially (without storing it).

You can then sort the entries of the hash table by the number of occurrences.

java - 统计大文本文件的词频

4 回答 4

Related

Reference