java - How to do Document Analysis using Text Mining?

Question

I would like to analyze the given document to find out whether the document contains the content of my interested domain knowledge or something not related to the domain knowledge.

For example, I have a document that contains data about Android OS, and I have a domain ontology which specifies the full knowledge about android.Now I have to find out how many percentage of valid content my document poses with regard to domain ontology.

One way of reaching near to the solution is to use ANNIE(GATE) to extract Named Entities(NE) from document and compare them with the instances of domain ontology and the percentage of valid content can be found.

Can you suggest any other better technique that I can use?
Are there any other open source APIs are available? I tried, Lingpipe but I can't use that in a commercial product.
Are there any Open source applications available of this kind? I searched a lot but I could not find any application.

score 1 · Accepted Answer

您可以将此视为文档分类问题：

最简单的一种是贝叶斯分类器

或文档检索问题：

实际上，您是在比较文档和本体类之间的余弦相似度。您可以使用 Lucene 作为本体文档存储引擎的基础。

在这两种情况下，您可能希望通过提取前 N 个（例如 10 个）一元组（不包括停止）和具有统计意义的二元组来减少文档中的维度（术语）数量，并将它们用作您的词袋（朴素贝叶斯）或搜索查询（文档检索）。

java - How to do Document Analysis using Text Mining?

1 回答 1

Related

Reference