java - java solution for hashing lines that contain variables in .csv

Question

I have a file that represent a table recorded in .csv or similar format. Table may include missing values. I look for a solution (preferably in java), that would process my file in the incremental manner without loading everything into memory, as my file can be huge. I need to identify duplicate records in my file, being able to specify which columns I want to exclude from consideration; then produce an output grouping those duplicate records. I would add an additional value at the end with a group number and output in the same format (.csv) sorted by group number.

I hope an effective solution can be found with some hashing function. For example, reading all lines and storing a hash value with each line number, hash calculated based on the set of variables I provide as an input.

Any ideas?

score 0 · Accepted Answer

好的，这是掌握答案关键的论文：P. Gopalan & J. Radhakrishnan “Finding duplicates in a data stream”。

java - java solution for hashing lines that contain variables in .csv

1 回答 1

Related

Reference