java - Choosing a parsing technology for a large project

Question

I have to deal with lots of different file formats. At least 50, maybe more than 100.

I've played around with Antlr in the past. However, I'm not sure that Antlr would be suitable for this project for a couple of reasons:

it's difficult to combine and reuse grammars and/or pieces of grammars
Antlr does code generation -- making a change to an existing parser requires going back to Antlr, making the change, regenerating the code, integrating the code back into the codebase, and running the unit-tests
doing tree-building/-processing requires dealing with another language inside Antlr -- a potential problem for future developers

Basically, I like Antlr, but I think that it may be better suited for creating one or two parsers for complex languages, rather than 100 parsers for somewhat simpler languages/formats.

An alternative to Antlr-like parser generators is parser combinators. The advantages are the parsers are directly integrated into code, making reuse, testing, and further abstraction very easy. Also, future developers wouldn't have to learn how to use a new tool. The downside of parser combinators is that I don't know of any heavy-duty libraries for using them in Java.

So the questions are:

Is Antlr suitable/intended for such a massive parsing project?
What are other options for large-scale parsing in Java?

Note: some of the file formats are CSV or tab-delimited, some are somewhat more complex, some are as complex as Java. Semantics-wise, they can also be quite complicated (although not all are).

score 0 · Accepted Answer

我个人过去曾使用过Apache Tika，它非常适合我的需求，并且涵盖了多种格式。我从未使用过 Antlr，因此无法对此发表评论。

score 0 · Accepted Answer

有一种解析技术非常适合组合、重用、继承和扩展解析器组件（甚至在运行时扩展正在运行的解析器）。

我永远不会把代码生成工具和一个好的声明性 DSL 算作一个缺点，但我可能离 Java 亚文化太远了。如果这些担忧在某种程度上是有效的，那么这仍然不是问题 - 您可以使用组合子实现 Packrat。它在 Java 中可能有点笨拙（由于缺乏适当的闭包和 lambda），但仍然比典型的即席递归下降解析器更具可读性。

java - Choosing a parsing technology for a large project

2 回答 2

Related

Reference