7

大局目标:我正在使用 NLTK 和 Gensim 在 Python 中制作产品评论的 LDA 模型。我想在不同的 n-gram 上运行它。

问题:使用 unigrams 一切都很好,但是当我使用 bigrams 运行时,我开始获得具有重复信息的主题。例如,主题 1 可能包含:['good product', 'good value'],主题 4 可能包含:['great product', 'great value']。对于人类来说,这些显然是在传达相同的信息,但显然'good product''great product'不同的二元组。我如何通过算法确定它'good product'并且'great product'足够相似,以便我可以将其中一个的所有出现转换为另一个(可能是语料库中出现频率更高的那个)?

我尝试过的:我玩过 WordNet 的 Synset 树,但运气不佳。事实证明,这good是一个“形容词”,但却great是一个“形容词卫星”,因此返回None路径相似性。我的思考过程是做以下事情:

  1. 词性标注句子
  2. 使用这些 POS 找到正确的 Synset
  3. 计算两个 Synset 的相似度
  4. 如果它们高于某个阈值,则计算两个单词的出现次数
  5. 用出现次数最多的词替换出现次数最少的词

不过,理想情况下,我想要一个可以确定这一点good并且在我的语料库great中相似的算法(也许在同时出现的意义上),以便它可以扩展到不属于一般英语的单词,但是出现在我的语料库中,因此它可以扩展到 n-gram(可能和在我的语料库中是同义词,或者和是相似的)。Oracleterriblefeature engineeringfeature creation

关于算法的任何建议,或让 WordNet synset 表现的建议?

4

2 回答 2

2

If you're going to use WordNet, you have

Problem 1: Word Sense Disambiguation (WSD), i.e. how to automatically determine which synset to use?

>>> for i in wn.synsets('good','a'):
...     print i.name, i.definition
... 
good.a.01 having desirable or positive qualities especially those suitable for a thing specified
full.s.06 having the normally expected amount
good.a.03 morally admirable
estimable.s.02 deserving of esteem and respect
beneficial.s.01 promoting or enhancing well-being
good.s.06 agreeable or pleasing
good.s.07 of moral excellence
adept.s.01 having or showing knowledge and skill and aptitude
good.s.09 thorough
dear.s.02 with or in a close or intimate relationship
dependable.s.04 financially sound
good.s.12 most suitable or right for a particular purpose
good.s.13 resulting favorably
effective.s.04 exerting force or influence
good.s.15 capable of pleasing
good.s.16 appealing to the mind
good.s.17 in excellent physical condition
good.s.18 tending to promote physical well-being; beneficial to health
good.s.19 not forged
good.s.20 not left to spoil
good.s.21 generally admired

>>> for i in wn.synsets('great','a'):
...     print i.name, i.definition
... 
great.s.01 relatively large in size or number or extent; larger than others of its kind
great.s.02 of major significance or importance
great.s.03 remarkable or out of the ordinary in degree or magnitude or effect
bang-up.s.01 very good
capital.s.03 uppercase
big.s.13 in an advanced stage of pregnancy

Let's say you somehow get the correct sense, maybe you tried something like this (https://github.com/alvations/pywsd) and let's say you get the POS and synset right:

good.a.01 having desirable or positive qualities especially those suitable for a thing specified great.s.01 relatively large in size or number or extent; larger than others of its kind

Problem 2: How are you going to compare the 2 synsets?

Let's try similarity functions, but you realized that they give you no score:

>>> good = wn.synsets('good','a')[0]
>>> great = wn.synsets('great','a')[0]
>>> print max(wn.path_similarity(good,great), wn.path_similarity(great, good))
None
>>> print max(wn.wup_similarity(good,great), wn.wup_similarity(great, good))

>>> print max(wn.res_similarity(good,great,semcor_ic), wn.res_similarity(great, good,semcor_ic))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1312, in res_similarity
    return synset1.res_similarity(synset2, ic, verbose)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 738, in res_similarity
    ic1, ic2, lcs_ic = _lcs_ic(self, other, ic)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1643, in _lcs_ic
    (synset1, synset2))
nltk.corpus.reader.wordnet.WordNetError: Computing the least common subsumer requires Synset('good.a.01') and Synset('great.s.01') to have the same part of speech.
>>> print max(wn.jcn_similarity(good,great,semcor_ic), wn.jcn_similarity(great, good,semcor_ic))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1316, in jcn_similarity
    return synset1.jcn_similarity(synset2, ic, verbose)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 759, in jcn_similarity
    ic1, ic2, lcs_ic = _lcs_ic(self, other, ic)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1643, in _lcs_ic
    (synset1, synset2))
nltk.corpus.reader.wordnet.WordNetError: Computing the least common subsumer requires Synset('good.a.01') and Synset('great.s.01') to have the same part of speech.
>>> print max(wn.lin_similarity(good,great,semcor_ic), wn.lin_similarity(great, good,semcor_ic))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1320, in lin_similarity
    return synset1.lin_similarity(synset2, ic, verbose)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 789, in lin_similarity
    ic1, ic2, lcs_ic = _lcs_ic(self, other, ic)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1643, in _lcs_ic
    (synset1, synset2))
nltk.corpus.reader.wordnet.WordNetError: Computing the least common subsumer requires Synset('good.a.01') and Synset('great.s.01') to have the same part of speech.
>>> print max(wn.lch_similarity(good,great), wn.lch_similarity(great, good))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1304, in lch_similarity
    return synset1.lch_similarity(synset2, verbose, simulate_root)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 638, in lch_similarity
    (self, other))
nltk.corpus.reader.wordnet.WordNetError: Computing the lch similarity requires Synset('good.a.01') and Synset('great.s.01') to have the same part of speech.

Let's try a different pair of synsets, since good has both satellite-adjective and adjective while great only have satellite, let's go with the lowest common denominator:

good.s.13 resulting favorably
great.s.01 relatively large in size or number or extent; larger than others of its kind

You realize that there is still no similarity information for comparing between satellite-adjective:

>>> print max(wn.lin_similarity(good,great,semcor_ic), wn.lin_similarity(great, good,semcor_ic))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1320, in lin_similarity
    return synset1.lin_similarity(synset2, ic, verbose)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 789, in lin_similarity
    ic1, ic2, lcs_ic = _lcs_ic(self, other, ic)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1645, in _lcs_ic
    ic1 = information_content(synset1, ic)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1666, in information_content
    raise WordNetError(msg % synset.pos)
nltk.corpus.reader.wordnet.WordNetError: Information content file has no entries for part-of-speech: s
>>> print max(wn.path_similarity(good,great), wn.path_similarity(great, good))None
None

Now seems like WordNet is creating more problems than it's solving anything here, let's try another means, let's try word clustering, see http://en.wikipedia.org/wiki/Word-sense_induction

This is when i also give up on answering the broad and opened question that the OP has posted because there's a LOT done in clustering that are automagics to mere mortals like me =)

于 2014-01-07T00:26:51.370 回答
0

你说(强调):

不过,理想情况下,我想要一种算法,可以确定我的语料库中的好和伟大是相似的(也许在同时发生的意义上)

您可以通过测量这些词与其他词在同一个句子中出现的频率(即共现)来衡量词的相似性。为了捕获更多的语义相关性,您可能还可以捕获搭配,即单词在单词附近的同一单词窗口中出现的频率。

本文处理词义消歧(WSD),它使用搭配和周围的词(共现)作为其特征空间的一部分。结果非常好,所以我想您可以使用相同的功能来解决您的问题。

在 Python 中,您可以使用sklearn,尤其是您可能希望查看SVM(带有示例代码)来帮助您入门。

总体思路是这样的:

  1. 获取一对要检查相似性的二元组
  2. 使用您的语料库,为每个二元组生成搭配和共现特征
  3. 训练 SVM 以学习第一个二元组的特征
  4. 在其他二元组的出现上运行 SVM(你在这里得到一些分数)
  5. 可能使用分数来确定两个二元组是否彼此相似
于 2014-01-07T10:13:49.140 回答