java - 读取 .tsv 文件时跳过交替行

Question

我有一个 .tsv 文件，其中最后一列有 39 列，但一列的数据为字符串，其长度超过 100,000 个字符现在发生的事情是当我尝试读取文件时，第 1 行有标题，然后数据如下

发生的事情是它在读取第 1 行之后它转到第 3 行然后第 5 行然后第 7 行虽然所有行都有相同的数据在我得到的日志之后

lineNo=3, rowNo=2, customer=503837-100 , last but one cell length=111275
lineNo=5, rowNo=3, customer=503837-100 , last but one cell length=111275
lineNo=7, rowNo=4, customer=503837-100 , last but one cell length=111275
lineNo=9, rowNo=5, customer=503837-100 , last but one cell length=111275
lineNo=11, rowNo=6, customer=503837-100 , last but one cell length=111275
lineNo=13, rowNo=7, customer=503837-100 , last but one cell length=111275
lineNo=15, rowNo=8, customer=503837-100 , last but one cell length=111275
lineNo=17, rowNo=9, customer=503837-100 , last but one cell length=111275
lineNo=19, rowNo=10, customer=503837-100 , last but one cell length=111275

以下是我的代码：

import java.io.FileReader;
import org.supercsv.cellprocessor.Optional;
import org.supercsv.cellprocessor.constraint.NotNull;
import org.supercsv.cellprocessor.ift.CellProcessor;
import org.supercsv.io.CsvBeanReader;
import org.supercsv.io.ICsvBeanReader;
import org.supercsv.prefs.CsvPreference;

public class readWithCsvBeanReader {
    public static void main(String[] args) throws Exception{
        readWithCsvBeanReader();
    }


private static void readWithCsvBeanReader() throws Exception {

    ICsvBeanReader beanReader = null;

    try {

        beanReader = new CsvBeanReader(new FileReader("C:\MAP TSV\abc.tsv"), CsvPreference.TAB_PREFERENCE);
        // the header elements are used to map the values to the bean (names must match)
        final String[] header = beanReader.getHeader(true);
        final CellProcessor[] processors = getProcessors();
        TSVReaderBrandDTO tsvReaderBrandDTO = new TSVReaderBrandDTO();

        int i = 0;
        int last = 0;

        while( (tsvReaderBrandDTO = beanReader.read(TSVReaderBrandDTO.class, header, processors)) != null ) {
            if(null == tsvReaderBrandDTO.getPage_cache()){
                last = 0;
            }
            else{
                last = tsvReaderBrandDTO.getPage_cache().length();
            }
            System.out.println(String.format("lineNo=%s, rowNo=%s, customer=%s , last but one cell length=%s", beanReader.getLineNumber(),
                beanReader.getRowNumber(), tsvReaderBrandDTO.getUnique_ID(), last));
            i++;
        }

        System.out.println("Number of rows : "+i);

    }
    finally {
        if( beanReader != null ) {
            beanReader.close();
        }
    }
}

private static CellProcessor[] getProcessors() {

    final CellProcessor[] processors = new CellProcessor[] { 
         new Optional(), new NotNull(), new NotNull(), new NotNull(), new NotNull(),
         new NotNull(), new NotNull(), new NotNull(), new NotNull(), new NotNull(),
         new NotNull(), new NotNull(), new NotNull(), new NotNull(), new NotNull(),
         new NotNull(), new NotNull(), new NotNull(), new NotNull(), new NotNull(),
         new NotNull(), new NotNull(), new NotNull(), new NotNull(), new NotNull(),
         new NotNull(), new NotNull(), new NotNull(), new NotNull(), new NotNull(),
         new NotNull(), new NotNull(), new NotNull(), new NotNull(), new NotNull(),
         new NotNull(), new NotNull(), new NotNull(), new Optional()};

        return processors;
    }
}

请让我知道我哪里出错了

score 1 · Accepted Answer

如果您使用 CSV 解析器来解析 TSV 输入，那么您将度过一段糟糕的时光。使用适当的 TSV 解析器。uniVocity-parsers带有一个 TSV 解析器/编写器。您也可以使用带注释的 Java bean 将文件直接解析为类的实例。

例子：

此代码将 TSV 解析为行。

TsvParserSettings settings = new TsvParserSettings();

// creates a TSV parser
TsvParser parser = new TsvParser(settings);

// parses all rows in one go.
List<String[]> allRows = parser.parseAll(new FileReader(yourFile));

使用 BeanListProcessor 解析成 java bean：

BeanListProcessor<TestBean> rowProcessor = new BeanListProcessor<TestBean>(TestBean.class);

TsvParserSettings parserSettings = new TsvParserSettings();
parserSettings.setRowProcessor(rowProcessor);

TsvParser parser = new TsvParser(parserSettings);
parser.parse(new FileReader(yourFile));

// The BeanListProcessor provides a list of objects extracted from the input.
List<TestBean> beans = rowProcessor.getBeans();

这是 TestBean 类的样子： class TestBean {

// if the value parsed in the quantity column is "?" or "-", it will be replaced by null.
@NullString(nulls = { "?", "-" })
// if a value resolves to null, it will be converted to the String "0".
@Parsed(defaultNullRead = "0")
private Integer quantity;


@Trim
@LowerCase
@Parsed(index = 4)
private String comments;

// you can also explicitly give the name of a column in the file.
@Parsed(field = "amount")
private BigDecimal amount;

@Trim
@LowerCase
// values "no", "n" and "null" will be converted to false; values "yes" and "y" will be converted to true
@BooleanString(falseStrings = { "no", "n", "null" }, trueStrings = { "yes", "y" })
@Parsed
private Boolean pending;

披露：我是这个库的作者。它是开源和免费的（Apache V2.0 许可证）。

score 0 · Accepted Answer

我检查了http://supercsv.sourceforge.net/examples_reading.html。仔细查看示例 CSV 文件和输出。难道您的行包含非转义"（双撇号）字符，因此解析器认为数据记录跨越两条物理行吗？

如果您不使用双撇号字符作为引号字符，则可以更改 CsvPreference - 请参阅http://supercsv.sourceforge.net/apidocs/org/supercsv/prefs/CsvPreference.html - 这样双引号是不被视为引号字符：

CsvPreference MY_PREFERENCES = new CsvPreference.Builder(
    SOME_NEVER_USED_CHARACTER, ',', "\r\n").build();

当然，对于制表符分隔的 CSV，请使用以下内容：

CsvPreference MY_PREFERENCES = new CsvPreference.Builder(
    SOME_NEVER_USED_CHARACTER, '\t', "\r\n").build();

请参阅CsvPreferencejavadoc 以获取 Builder 的签名并相应地修改实际值。

java - 读取 .tsv 文件时跳过交替行

2 回答 2

Related

Reference