0

I have a file that represent a table recorded in .csv or similar format. Table may include missing values. I look for a solution (preferably in java), that would process my file in the incremental manner without loading everything into memory, as my file can be huge. I need to identify duplicate records in my file, being able to specify which columns I want to exclude from consideration; then produce an output grouping those duplicate records. I would add an additional value at the end with a group number and output in the same format (.csv) sorted by group number.

I hope an effective solution can be found with some hashing function. For example, reading all lines and storing a hash value with each line number, hash calculated based on the set of variables I provide as an input.

Any ideas?

4

1 回答 1

0

好的,这是掌握答案关键的论文:P. Gopalan & J. Radhakrishnan “Finding duplicates in a data stream”。

于 2012-09-12T15:44:18.920 回答