php - 维基百科文章的字符串比较

Question

我正在检索 Wikipedia 类别的请求

http://en.wikipedia.org/w/api.php?format=json&action=query&prop=categories&cllimit=5000&titles=request

我接下来要做的是将每个类别的描述文章与一个字符串进行比较，我已经必须找到哪个是最佳匹配。我想找到一个度量来计算考虑语义含义的 2 个文本之间的相似性，你知道有什么库可以做到这一点或计算字符串之间的向量空间模型距离吗？

例如，请求http://en.wikipedia.org/w/api.php?format=json&action=query&prop=categories&cllimit=5000&titles=Machine%20learning返回如下所示的数组。我想将每个类别的每篇文章与一个字符串进行比较，并找到最匹配的文章，在这种情况下将是http://en.wikipedia.org/wiki/Machine_learning第七篇文章。

[categories] => Array
                                (
                                    [0] => Array
                                        (
                                            [ns] => 14
                                            [title] => Category:All articles needing additional references
                                        )

                                    [1] => Array
                                        (
                                            [ns] => 14
                                            [title] => Category:All articles with unsourced statements
                                        )

                                    [2] => Array
                                        (
                                            [ns] => 14
                                            [title] => Category:Articles needing additional references from February 2013
                                        )

                                    [3] => Array
                                        (
                                            [ns] => 14
                                            [title] => Category:Articles with unsourced statements from March 2013
                                        )

                                    [4] => Array
                                        (
                                            [ns] => 14
                                            [title] => Category:Cybernetics
                                        )

                                    [5] => Array
                                        (
                                            [ns] => 14
                                            [title] => Category:Learning
                                        )

                                    [6] => Array
                                        (
                                            [ns] => 14
                                            [title] => Category:Learning in computer vision
                                        )

                                    [7] => Array
                                        (
                                            [ns] => 14
                                            [title] => Category:Machine learning
                                        )

                                )

score 1 · Accepted Answer

信息检索中比较主题相似度的常用方法是余弦相似度（http://en.wikipedia.org/wiki/Cosine_similarity）。我认为这就是“字符串之间的向量空间模型距离”的意思。

有几个库及其实现（Lucene，Weka，Rapidminer，...）。如果需要，您也可以自己实现它。

我希望这有帮助。

score 1 · Accepted Answer

Levenshtein，它比较两个字符串并返回需要进行多少更改才能使它们相同的值。

轻松我最喜欢的命名 php 方法

http://php.net/manual/en/function.levenshtein.php

虽然这只是两个简单字符串之间的直接比较，并且限制为 255 个字符，因此如果文本较长，您可能需要将文本切分并分块进行比较。

php - 维基百科文章的字符串比较

2 回答 2

Related

Reference