nlp - Tagging and Categorizing text automatically using NLP and Ontology respectively

Question

I am working on a project in which user adds some text into Database, and while saving also adds tags to his/her entry, so that others can search using those tags.

EXAMPLE:

TEXT: "Next Formula 1 race is in Spain"

TAGS: "Formula 1", "race", Spain"

if any user will search for these tags will get this entry in the results.

But i want users who search for "Sports" or "Motor Sport" or "Europe" should also get this entry, although these tags were not explicitly tagged into the entry, but are related because "Formula 1" is type of "Motor Sport" which is a type of "Sport" and "Spain" is in "Europe".

At the moment on my submission form, users write their text in one text box, and then write their tags into the second text box below, and submit.

these tags are later then categorized manually by the admin. So in the above case the admin will manually put "Spain" as the child element of "Europe". (MS SQL Server Hierarchy Column)

I think this can be achieved using some Ontologies software. dotNetRdf, OWL ... but am not sure. I just go to know about this side of the world few days back, and I am not sure how these can help me. Is this the solution, or am I looking into completely wrong thing? Any suggestions to achieve the above?

Also, before doing the categorization, I would want to automatically pick tags from the text and fill in to the lower text box as Tags.

For this I guess I'll have to use some NLP service? Any ideas which one to use, or any other suggestion?

score 0 · Accepted Answer

如果您正在寻找的关系足够普遍（即国家和大陆），本体可能会有所帮助。对于语义相关性“种族”与“运动”。我建议您可以利用单词（或标签）之间的某种语义相似性。

基本上，如果您生成一个 MxM 矩阵来模拟不同标签之间的依赖/相似性，那么您可以使用这些权重来获得相似的概念。例如，“种族”和“运动”将比“种族”和“西班牙”更相关。

如何计算权重？这可以通过多种技术来解决，例如 [Explicit Semantic Analysis] ( http://en.wikipedia.org/wiki/Explicit_semantic_analysis ) 或 [Distributional Semantics] ( http://en.wikipedia.org/wiki/Distributional_semantics ) 技术. 最简单的度量之一是使用一些共现度量（即文档“种族”和“运动”一起出现的百分比）。

此外，还可以使用同义词等更多的 NLP 技术。

您还可以将这些权重与本体关系结合起来。如果您知道西班牙是欧洲的一部分，您可以在一般矩阵中增加他们的权重。

对于标签的提取，您应该研究实体提取，nltk可能是一个很好的开始工具。

我希望这有帮助。

score 0 · Accepted Answer

在这种情况下，您正在使用的解决方案（MS SQL Server 层次结构列）可以通过 OWL 本体（它是层次结构/分类法）来补充。我给你一个例子，说明它在你的情况下的样子以及你能从中得到什么。

与运动相关的本体看起来像：

Class: Sport

Class: Formula_1
  SubClassOf: Motor_Sport

Class: Motor_Sport
  SubClassOf: Sport

然后在一个称为推理器的程序的帮助下，您可以提出以下问题：比什么更具体 Sport？（的子类Sport）

结果列表包含Motor_Sport和Formula_1。然后，您可以使用这些类来注释您的数据。

入门的一个好方法是查看Protégé OWL 教程。

nlp - Tagging and Categorizing text automatically using NLP and Ontology respectively

2 回答 2

Related

Reference