2

我一直在尝试在 groovy 中解析 csv 文件,目前使用库 org.apache.commons.csv 2.4。我的要求是 csv 单元格中有无效的数据值,例如无效字符,而不是在第一个无效行/单元格上抛出异常,我想收集这些异常并在 csv 文件中不断迭代直到结束,然后我将获得此 csv 文件具有的无效数据的完整列表。

出于这个目的,我尝试了多种使用这个 apache lib 的方法,但不幸的是,只要它使用 CSVParser.getNextRecord() 进行迭代,迭代器就会中止。

输入代码,如下所示:

    def  records = new CSVParser(reader, CSVFormat.EXCEL.withHeader().withIgnoreSurroundingSpaces())

     // at this line, the iterator() inside CSVParser is always using getNextRecord() for its next() implementation, and it may throw exception on invalid char
     records.each {record->
         // if the exception is thrown from .each, that makes below try/catch in vain
         try{

         }catch(e){ //want collect Errors here }
     }

那么,还有什么我应该在这个库中挖掘的吗?或者有人可以指出我另一个更可行的解决方案吗?非常感谢大家!

更新:示例 CSV

"Company code for WBS element","WBS Element","PS: Short description (1st text line)","Responsible Cost Center for WBS Element","OBJNR","WBS Status"

"1001","RE-01768-011","Opex - To present a paper on Career con","0000016400","PR00031497","X"
"1001","RE-01768-011","Opex - To present a paper on "Career con","0000016400","PR00031497","X"

第二个数据行包含无效字符",导致解析器抛出异常

4

2 回答 2

2

您遇到的问题是一个单元格中的quote字符之一是解析器根据所选格式使用的字符:CSVFormat.EXCEL.

引号字符是

用于封装包含特殊字符的值的字符

所以在你的例子中,引用被滥用,解析器抱怨它。

您可以使用不同的CSVFormat. 例如,一个没有引号字符:

@Grapes(
    @Grab(group='org.apache.commons', module='commons-csv', version='1.2')
)

import java.nio.charset.*
import org.apache.commons.csv.*

def text = '''"Company code for WBS element","WBS Element","PS: Short description (1st text line)","Responsible Cost Center for WBS Element","OBJNR","WBS Status"

"1001","RE-01768-011","Opex - To present a paper on Career con","0000016400","PR00031497","X"
"1002","RE-01768-011","Opex - To present a paper on "Career con","0000016400","PR00031497","X"
"1003","RE-01768-011","Opex - To present a paper on Career con","0000016400","PR00031497","X"'''

def parsed = CSVParser.parse(text, CSVFormat.EXCEL.withHeader().withIgnoreSurroundingSpaces().withQuote(null))

parsed.getRecords().each {
    println it.toMap().values()
}

以上产生:

[]
["0000016400", "1001", "RE-01768-011", "Opex - To present a paper on Career con", "X", "PR00031497"]
["0000016400", "1002", "RE-01768-011", "Opex - To present a paper on "Career con", "X", "PR00031497"]
["0000016400", "1003", "RE-01768-011", "Opex - To present a paper on Career con", "X", "PR00031497"]

当然,通过上述解决方法,您可以在每个字段中包含引号( )。"

如果需要,您可以全部替换它们:

parsed.getRecords().each {
    println it.toMap().values().collect({ it.replace('"', '') })
}
于 2015-11-11T14:45:04.173 回答
0

问题是,如果 csv 文件包含无效数据,即违反 csv 格式规则的数据,则解析器无法...解析。这就是为什么它不能可靠地解析比遇到的第一个错误更多的内容。

于 2015-11-11T11:29:01.920 回答