我将输入句子分类为不同的类别。比如时间、距离、速度、位置等
我训练分类器使用MultinomialNB.
分类器主要考虑tf
特征,我也尝试考虑句子结构(使用1-4克)
使用multinomialNB
with alpha
= 0.001 这是少数查询的结果
what is the value of Watch
{"1": {"other": "33.27%"}, "2": {"identity": "25.40%"}, "3": {"desc": "16.20%"}, "4": {"country": "9.32%"}}
what is the price of Watch
{"1": {"other": "25.37%"}, "2": {"money": "23.79%"}, "3": {"identity": "19.37%"}, "4": {"desc": "12.35%"}, "5": {"country": "7.11%"}}
what is the cost of Watch
{"1": {"money": "48.34%"}, "2": {"other": "17.20%"}, "3": {"identity": "13.13%"}, "4": {"desc": "8.37%"}} #for above two query also result should be money
How early can I go to mumbai
{"1": {"manner": "97.77%"}} #result should be time
How fast can I go to mumbai
{"1": {"speed": "97.41%"}}
How come can I go to mumbai
{"1": {"manner": "100.00%"}}
How long is a meter
{"1": {"period": "90.74%"}, "2": {"dist": "9.26%"}} #better result should be distance
multinomialNW
考虑使用ngram
(1-4)
what is the value of Watch
{"1": {"other": "33.27%"}, "2": {"identity": "25.40%"}, "3": {"desc": "16.20%"}, "4": {"country": "9.32%"}}
what is the price of Watch
{"1": {"other": "25.37%"}, "2": {"money": "23.79%"}, "3": {"identity": "19.37%"}, "4": {"desc": "12.35%"}, "5": {"country": "7.11%"}}
what is the cost of Watch
{"1": {"money": "48.34%"}, "2": {"other": "17.20%"}, "3": {"identity": "13.13%"}, "4": {"desc": "8.37%"}} # for above two query also result should be money
How early can I go to mumbai
{"1": {"manner": "97.77%"}} #result should be time
How fast can I go to mumbai
{"1": {"speed": "97.41%"}}
How come can I go to mumbai
{"1": {"manner": "100.00%"}}
How long is an hour
{"1": {"dist": "99.61%"}} #result should be time
所以结果完全取决于单词的出现。有没有办法在这里添加单词消歧(或任何其他可以带来某种理解的方法)?
我已经在 NLTK Python 中检查了 Word sense disambiguation
但这里的问题是识别句子中的主要单词,每个句子都不同。
POS
(给出NN,JJ,哪个句子不依赖),NER
(高度依赖大写,有时ner也不是像上面句子中的“early”,“cost”这样的歧义词)我已经尝试过了,它们都没有帮助。
**How long some times cosidered as time or distance. So based on sentence near by words, it should able to able understand what it is. Similarly for "how fast, "how come" "how early" [how + word] should be understable**
我正在使用 nltk、scikit learn、python
更新 :
- 40个班级(每个班级都有属于该班级的句子)
- 总数据 300 Kb
准确性取决于查询。有时非常好> 90%。有时结果是不相关的类。取决于查询与数据集的匹配方式