我正在检索 Wikipedia 类别的请求
http://en.wikipedia.org/w/api.php?format=json&action=query&prop=categories&cllimit=5000&titles=request
我接下来要做的是将每个类别的描述文章与一个字符串进行比较,我已经必须找到哪个是最佳匹配。我想找到一个度量来计算考虑语义含义的 2 个文本之间的相似性,你知道有什么库可以做到这一点或计算字符串之间的向量空间模型距离吗?
例如,请求http://en.wikipedia.org/w/api.php?format=json&action=query&prop=categories&cllimit=5000&titles=Machine%20learning
返回如下所示的数组。我想将每个类别的每篇文章与一个字符串进行比较,并找到最匹配的文章,在这种情况下将是http://en.wikipedia.org/wiki/Machine_learning
第七篇文章。
[categories] => Array
(
[0] => Array
(
[ns] => 14
[title] => Category:All articles needing additional references
)
[1] => Array
(
[ns] => 14
[title] => Category:All articles with unsourced statements
)
[2] => Array
(
[ns] => 14
[title] => Category:Articles needing additional references from February 2013
)
[3] => Array
(
[ns] => 14
[title] => Category:Articles with unsourced statements from March 2013
)
[4] => Array
(
[ns] => 14
[title] => Category:Cybernetics
)
[5] => Array
(
[ns] => 14
[title] => Category:Learning
)
[6] => Array
(
[ns] => 14
[title] => Category:Learning in computer vision
)
[7] => Array
(
[ns] => 14
[title] => Category:Machine learning
)
)