text - 如何测量查询和文档之间的句法相似度？

Question

有没有办法测量查询（句子）和文档（一组句子）之间的句法相似度？

score 4 · Accepted Answer

您是否考虑过使用涉及HPSG和 LFG等深度语法的深度语言处理工具？如果您正在寻找基于特征的句法相似性，您可以查看Kenji Sagae 和 Andrew S. Gordon的工作，他们使用 PropBank 计算动词的句法相似性，然后对相似的动词进行聚类以改进 HPSG 语法。

为了采用更简单的方法，我建议只查看具有相同解析节点的依赖解析和分组句子。或者只是 POS 标记句子并比较具有相同 POS 标记的句子。

为了一个简单的例子，首先下载并安装 NLTK ( http://nltk.org/ ) 和 hunpos 标记器 ( http://code.google.com/p/hunpos/ )。解压缩 en_wsj.model.gz 并将其保存在 python 脚本所在的位置。

import nltk 
from nltk.tag.hunpos import HunposTagger
from nltk.tokenize import word_tokenize

s1 = "This is a short sentence"
s2 = "That is the same sentence"

ht = HunposTagger('en_wsj.model')
print ht.tag(word_tokenize(corpus))http://nltk.org/

# Tag the sentences with HunPos
t1 = ht.tag(word_tokenize(s1))
t2 = ht.tag(word_tokenize(s2))

#Extract only the POS tags
pos1 = [i[1] for i in t1]
pos2 = [j[1] for j in t2]

if pos1 == pos2:
    print "same sentence according to POS tags"
else:
    print "diff sentences according to POS tags"

上面的这个脚本输出：

>>> print pos1
['DT', 'VBZ', 'DT', 'JJ', 'NN']
>>> print pos2
['DT', 'VBZ', 'DT', 'JJ', 'NN']
>>> if pos1 == pos2:
...     print "same sentence according to POS tags"
... else:
...     print "diff sentences according to POS tags"
... 
same sentence according to POS tags

要修改上述代码，请尝试：

而不是比较 POS 使用依赖解析
与其进行严格的列表比较，不如提出一些统计方法来衡量差异水平

score 1 · Accepted Answer

1

你在寻找像Apache Lucene这样的东西吗？

于 2013-03-03T21:24:53.753 回答

text - 如何测量查询和文档之间的句法相似度？

2 回答 2

Related

Reference