我有多个项目,其中包含不同数量的 csv 文件,我使用 SuperCSV CsvBeanReader 来执行映射和单元格验证。我为每个 csv 文件创建了一个 bean 并覆盖;每个 bean 的 equals、hashCode 和 toString。

我正在寻找有关执行 csv 行重复识别的最佳“所有项目”实施方法的建议。报告(不删除)原始 csv 行号和行内容,以及找到的所有重复行的行号和行内容。一些文件可以达到数十万行,大小超过 GB,并且希望最小化每个文件的读取次数,并认为可以在 CsvBeanReader 打开文件时完成。



1 回答 1


考虑到文件的大小以及您想要原始和副本的行内容这一事实,我认为您能做的最好的事情就是对文件进行 2 次传递。

如果您只想复制最新的行内容,则可以通过 1 次。在 1 遍中跟踪原始行内容以及所有重复项意味着您必须存储每一行​​的内容 - 您可能会用完内存。

我的解决方案假设两个相同的 beanhashCode()是重复的。如果你必须使用,equals()那么它会变得更加复杂。

  • Pass 1:识别重复项(记录每个重复哈希的行号)

  • 通行证 2:重复报告

通过 1:识别重复项

 * Finds the row numbers with duplicate records (using the bean's hashCode()
 * method). The key of the returned map is the hashCode and the value is the
 * Set of duplicate row numbers for that hashcode.
 * @param reader
 *            the reader
 * @param preference
 *            the preferences
 * @param beanClass
 *            the bean class
 * @param processors
 *            the cell processors
 * @return the map of duplicate rows (by hashcode)
 * @throws IOException
private static Map<Integer, Set<Integer>> findDuplicates(
    final Reader reader, final CsvPreference preference,
    final Class<?> beanClass, final CellProcessor[] processors)
    throws IOException {

  ICsvBeanReader beanReader = null;
  try {
    beanReader = new CsvBeanReader(reader, preference);

    final String[] header = beanReader.getHeader(true);

    // the hashes of any duplicates
    final Set<Integer> duplicateHashes = new HashSet<Integer>();

    // the hashes for each row
    final Map<Integer, Set<Integer>> rowNumbersByHash = 
      new HashMap<Integer, Set<Integer>>();

    Object o;
    while ((o = beanReader.read(beanClass, header, processors)) != null) {
      final Integer hashCode = o.hashCode();

      // get the row no's for the hash (create if required)
      Set<Integer> rowNumbers = rowNumbersByHash.get(hashCode);
      if (rowNumbers == null) {
        rowNumbers = new HashSet<Integer>();
        rowNumbersByHash.put(hashCode, rowNumbers);

      // add the current row number to its hash
      final Integer rowNumber = beanReader.getRowNumber();

      if (rowNumbers.size() == 2) {


    // create a new map with just the duplicates
    final Map<Integer, Set<Integer>> duplicateRowNumbersByHash = 
      new HashMap<Integer, Set<Integer>>();
    for (Integer duplicateHash : duplicateHashes) {

    return duplicateRowNumbersByHash;

  } finally {
    if (beanReader != null) {

作为此方法的替代方法,您可以使用 aCsvListReader并使用getUntokenizedRow().hashCode()- 这将根据原始 CSV 字符串计算散列(它会快很多,但您的数据可能存在细微差异,这意味着它不起作用)。

通行证 2:重复报告


   * Reports the details of duplicate records.
   * @param reader
   *            the reader
   * @param preference
   *            the preferences
   * @param beanClass
   *            the bean class
   * @param processors
   *            the cell processors
   * @param duplicateRowNumbersByHash
   *            the row numbers of duplicate records
   * @throws IOException
  private static void reportDuplicates(final Reader reader,
      final CsvPreference preference, final Class<?> beanClass,
      final CellProcessor[] processors,
      final Map<Integer, Set<Integer>> duplicateRowNumbersByHash)
      throws IOException {

    ICsvBeanReader beanReader = null;
    try {
      beanReader = new CsvBeanReader(reader, preference);

      final String[] header = beanReader.getHeader(true);

      Object o;
      while ((o = beanReader.read(beanClass, header, processors)) != null) {
        final Set<Integer> duplicateRowNumbers = 
        if (duplicateRowNumbers != null) {
            "row %d is a duplicate of rows %s, line content: %s",


    } finally {
      if (beanReader != null) {



  // rows (2,4,8) and (3,7) are duplicates
  private static final String CSV = "a,b,c\n" + "1,two,01/02/2013\n"
      + "2,two,01/02/2013\n" + "1,two,01/02/2013\n"
      + "3,three,01/02/2013\n" + "4,four,01/02/2013\n"
      + "2,two,01/02/2013\n" + "1,two,01/02/2013\n";

  private static final CellProcessor[] PROCESSORS = { new ParseInt(),
      new NotNull(), new ParseDate("dd/MM/yyyy") };

  public static void main(String[] args) throws IOException {

    final Map<Integer, Set<Integer>> duplicateRowNumbersByHash = findDuplicates(
        new StringReader(CSV), CsvPreference.STANDARD_PREFERENCE,
        Bean.class, PROCESSORS);

    reportDuplicates(new StringReader(CSV),
        CsvPreference.STANDARD_PREFERENCE, Bean.class, PROCESSORS,



row 2 is a duplicate of rows [2, 4, 8], line content: 1,two,01/02/2013
row 3 is a duplicate of rows [3, 7], line content: 2,two,01/02/2013
row 4 is a duplicate of rows [2, 4, 8], line content: 1,two,01/02/2013
row 7 is a duplicate of rows [3, 7], line content: 2,two,01/02/2013
row 8 is a duplicate of rows [2, 4, 8], line content: 1,two,01/02/2013
于 2013-03-04T01:02:41.353 回答