0

I am looking for some general guidance here.

The high-level use case is such that I receive some product documents from which I need to extract some information and process it. Before doing that, I need to verify that the document is actually referring to the correct product. For that I need to validate the product heading/description from document against what I know to be correct.

So I have 2 texts

  1. Text 1 - this refers to the product information extracted from some document
  2. Text 2 - this is the actual product heading/description available with me, which can be considered as correct.

I need to validate that both texts refer to same product or object.

Example:

Text 1 (to be validated) - Optimus Prime Costume, Blue, with good packaging and warranty
Text 2 (correct info) - Optimus Prime Blue Costume, Medium Size`

You see, I need to validate that both text refer to Optimus Prime Costume.

I tried following methods -

  • Cosine Similarity
  • TF-IDF similarity
  • Overlapping words between strings

But the problem with them is that they depends on the entire text rather than the primary object being referred in the text.

I was thinking of processing as follows:-

  • Remove colors, size info etc. from the text 2. The text 2, is very concise and does not contain random data. It contains product name and size, colour info.
  • Validate that the remaining elements from Text 2 are present in Text 1, or atleast a majority of them are.

I am not quite sure what different NLP techniques might be there, which would be better than this approach, so any suggestions would be appreciated.

4

1 回答 1

0

根据您的目标,这可能是中等难度或非常难度。

您可以使用以下几样东西:

NER肯定会有所帮助: 在此处输入图像描述

Wikiifier 也可能有用:http ://cogcomp.org/page/demo_view/Wikifier

还有语义角色标签。在此处查看更多注释:http: //nlp.cogcomp.org/

很难从一个示例中判断什么是确切的算法,但是如果您有更多示例,可能会更容易提出更好的形式化。

可以 在这项工作中使用的内容中找到对此的扩展。

于 2017-11-17T05:10:17.503 回答