8

Is anybody familiar with the the RTF document format and parsing using any Java libaries. The standard way people have done this is by using the RTFEditorKit in the JDK Swing API:

Swing RTFEditorKit API

but it isn't that accurate when it comes to parsing RTF documents. In fact there's a comment in the API:

The RTF support was not written by the Swing team. In the future we hope to improve the support provided.

I don't think I'm going to wait for this to happen :)

The other approach taken is to define a grammar using JavaCC and generate a parser. This works better, but I'm having trouble finding a complete grammar. I've tried:

PMD Applied JavaCC Grammar

which is ok and the following (which is the best so far).

Koders RTFParserDelegate and ETranslate Grammar

There are various implementations of the ETranslate grammar about (I know the Nutch API may use this). Does anybody know which is the most accurate grammar or whether there is a better approach to this?

I could start ploughing through the JavaCC docs to understand the .jj files and test it against the RTF files... this is my current approach, but it's taking a while... any help would be appreciated

4

2 回答 2

1

有谁知道哪个是最准确的语法,或者是否有更好的方法?

很多年前,我花了一些时间用 C#阅读RTF维基百科)。我之所以说阅读,是因为如果您详细了解 RTF 并按照设计的方式使用它,您会意识到 RTF 并不意味着在编辑时被作为一个整体阅读并一遍又一遍地作为一个整体进行解析。在文档中,您将找到 RTF 的语法,但不要误以为您应该使用词法分析器/解析器。在文档中,他们为 RTF提供了一个示例阅读器。

请记住,RTF 是在很多年前创建的,当时内存以 KB 而不是 MB 为单位,并且以传统方式编辑数百页的长文档会占用系统资源。因此,RFT 能够在更小的小节中进行编辑,而无需加载或修改整个文档。这就是它能够在内存有限的情况下处理如此大的文档的原因。这也是为什么语法起初看起来很奇怪的原因。

于 2013-03-11T12:59:52.057 回答
0

Presumably, the source of OpenOffice contains what you're looking for.

于 2009-05-13T11:46:54.937 回答