I am looking for some general guidance here.
The high-level use case is such that I receive some product documents from which I need to extract some information and process it. Before doing that, I need to verify that the document is actually referring to the correct product. For that I need to validate the product heading/description from document against what I know to be correct.
So I have 2 texts
- Text 1 - this refers to the product information extracted from some document
- Text 2 - this is the actual product heading/description available with me, which can be considered as correct.
I need to validate that both texts refer to same product or object.
Example:
Text 1 (to be validated) - Optimus Prime Costume, Blue, with good packaging and warranty
Text 2 (correct info) - Optimus Prime Blue Costume, Medium Size`
You see, I need to validate that both text refer to Optimus Prime Costume
.
I tried following methods -
- Cosine Similarity
- TF-IDF similarity
- Overlapping words between strings
But the problem with them is that they depends on the entire text rather than the primary object being referred in the text.
I was thinking of processing as follows:-
- Remove colors, size info etc. from the text 2. The text 2, is very concise and does not contain random data. It contains product name and size, colour info.
- Validate that the remaining elements from Text 2 are present in Text 1, or atleast a majority of them are.
I am not quite sure what different NLP techniques might be there, which would be better than this approach, so any suggestions would be appreciated.