php - 将大量文本（聚类）与矩阵进行比较

Question

我有以下 PHP 函数来计算文本之间的关系：

function check($terms_in_article1, $terms_in_article2) {
    $length1 = count($terms_in_article1); // number of words
    $length2 = count($terms_in_article2); // number of words
    $all_terms = array_merge($terms_in_article1, $terms_in_article2);
    $all_terms = array_unique($all_terms);
    foreach ($all_terms as $all_termsa) {
        $term_vector1[$all_termsa] = 0;
        $term_vector2[$all_termsa] = 0;
    }
    foreach ($terms_in_article1 as $terms_in_article1a) {
        $term_vector1[$terms_in_article1a]++;
    }
    foreach ($terms_in_article2 as $terms_in_article2a) {
        $term_vector2[$terms_in_article2a]++;
    }
    $score = 0;
    foreach ($all_terms as $all_termsa) {
        $score += $term_vector1[$all_termsa]*$term_vector2[$all_termsa];
    }
    $score = $score/($length1*$length2);
    $score *= 500; // for better readability
    return $score;
}

该变量$terms_in_articleX必须是一个数组，其中包含出现在文本中的所有单个单词。

假设我有一个包含 20,000 条文本的数据库，这个函数将花费很长时间来运行所有连接。

我怎样才能加速这个过程？我应该将所有文本添加到一个巨大的矩阵中，而不是总是只比较两个文本吗？如果您有一些代码方法，最好是在 PHP 中，那就太好了。

我希望你能帮助我。提前致谢！

score 4 · Accepted Answer

您可以在添加文本时拆分文本。简单的例子：preg_match_all(/\w+/, $text, $matches);当然真正的分裂不是那么简单......但有可能，只需纠正模式:)

像这样创建表 id(int primary autoincrement)、value(varchar unique) 和链接表：word_id(int)、text_id(int)、word_count(int)。然后在拆分文本后用新值填充表格。

最后，您可以对这些数据做任何您想做的事情，快速使用 DB 中的索引整数 (ID) 进行操作。

更新：这里是表和查询：

CREATE TABLE terms (
    id int(11) NOT NULL auto_increment, value char(255) NOT NULL,
    PRIMARY KEY  (`id`), UNIQUE KEY `value` (`value`)
);

CREATE TABLE `terms_in_articles` (
    term int(11) NOT NULL, 
    article int(11) NOT NULL, 
    cnt int(11) NOT NULL default '1',
    UNIQUE KEY `term` (`term`,`article`)
);


/* Returns all unique terms in both articles (your $all_terms) */
SELECT t.id, t.value 
FROM terms t, terms_in_articles a 
WHERE a.term = t.id AND a.article IN (1, 2);

/* Returns your $term_vector1, $term_vector2 */
SELECT article, term, cnt 
FROM terms_in_articles 
WHERE article IN (1, 2) ORDER BY article;

/* Returns article and total count of term entries in it ($length1, $length2) */
SELECT article, SUM(cnt) AS total 
FROM terms_in_articles 
WHERE article IN (1, 2) GROUP BY article;

/* Returns your $score wich you may divide by ($length1 / $length2) from previous query */
SELECT SUM(tmp.term_score) * 500 AS total_score FROM 
(
    SELECT (a1.cnt * a2.cnt) AS term_score 
    FROM terms_in_articles a1, terms_in_articles a2 
    WHERE a1.article = 1 AND a2.article = 2 AND a1.term = a2.term
    GROUP BY a2.term, a1.term
) AS tmp;

好吧，现在，我希望这会有所帮助？最后的 2 个查询足以执行您的任务。其他查询以防万一。当然，您可以计算更多统计信息，例如“最流行的术语”等...

score 1 · Accepted Answer

编辑：试图更明确：

首先，将每一项编码为一个整数。您可以使用字典关联数组，如下所示：
```
   $count = 0;
    foreach ($doc as $term) {
      $val = $dict[$term];
      if (!defined($val)) {
        $dict[$term] = $count++;
      }
      $doc_as_int[$val] ++;
    }
```
这样，您可以将字符串计算替换为整数计算。例如，您可以将单词“cloud”表示为数字 5，然后使用数组的索引 5 来存储单词“cloud”的计数。请注意，我们这里只使用关联数组搜索，不需要 CRC 等。
将所有文本存储为矩阵，最好是稀疏的。
使用特征选择 (PDF)。
也许以更快的语言使用本机实现。
我建议您首先使用具有大约 20 个集群的 K-means，这样可以粗略了解哪个文档靠近另一个文档，然后仅比较每个集群内的对。假设集群大小一致，这会将比较次数提高到20*200 + 20*10*9- 大约 6000 次比较，而不是 19900 次。

score 1 · Accepted Answer

这是原始功能的略微优化版本。它产生完全相同的结果。（我在 Wikipedia 上的两篇文章中运行它，包含 10000 多个术语，每篇文章都运行 20 次：

check():
test A score: 4.55712524522
test B score: 5.08138042619
--Time: 1.0707

check2():
test A score: 4.55712524522
test B score: 5.08138042619
--Time: 0.2624

这是代码：

function check2($terms_in_article1, $terms_in_article2) {
    $length1 = count($terms_in_article1); // number of words
    $length2 = count($terms_in_article2); // number of words

    $score_table = array();
    foreach($terms_in_article1 as $term){
        if(!isset($score_table[$term])) $score_table[$term] = 0;
        $score_table[$term] += 1;
    }
    $score_table2 = array();
    foreach($terms_in_article2 as $term){
        if(isset($score_table[$term])){
            if(!isset($score_table2[$term])) $score_table2[$term] = 0;
            $score_table2[$term] += 1;
        }
    }
    $score =0;
    foreach($score_table2 as $key => $entry){
        $score += $score_table[$key] * $entry;
    }
    $score = $score / ($length1*$length2);
    $score *= 500;
    return $score;
}

（顺便说一句。不包括将所有单词拆分为数组所需的时间。）

score 0 · Accepted Answer

如果您可以使用简单的文本而不是数组进行比较，并且如果我理解您的目标在哪里，您可以使用levenshtein php 函数（通常用于提供类似 google 的“您的意思是……吗？”函数在 php 搜索引擎中）。

它的工作方式与您使用的相反：返回两个字符串之间的差异。

例子：

<?php
function check($a, $b) {
    return levenshtein($a, $b);
}

$a = 'this is just a test';
$b = 'this is not test';
$c = 'this is just a test';

echo check($a, $b) . '<br />';
//return 5
echo check($a, $c) . '<br />';
//return 0, the strings are identical
?>

但我不知道这是否会提高执行速度。但也许是的，你取出了许多 foreach 循环和 array_merge 函数。

编辑：

一个简单的速度测试（是一个 30 秒编写的脚本，它不是 100% 准确的，嗯）：

function check($terms_in_article1, $terms_in_article2) {
    $length1 = count($terms_in_article1); // number of words
    $length2 = count($terms_in_article2); // number of words
    $all_terms = array_merge($terms_in_article1, $terms_in_article2);
    $all_terms = array_unique($all_terms);
    foreach ($all_terms as $all_termsa) {
        $term_vector1[$all_termsa] = 0;
        $term_vector2[$all_termsa] = 0;
    }
    foreach ($terms_in_article1 as $terms_in_article1a) {
        $term_vector1[$terms_in_article1a]++;
    }
    foreach ($terms_in_article2 as $terms_in_article2a) {
        $term_vector2[$terms_in_article2a]++;
    }
    $score = 0;
    foreach ($all_terms as $all_termsa) {
        $score += $term_vector1[$all_termsa]*$term_vector2[$all_termsa];
    }
    $score = $score/($length1*$length2);
    $score *= 500; // for better readability
    return $score;
}


$a = array('this', 'is', 'just', 'a', 'test');
$b = array('this', 'is', 'not', 'test');

$timenow = microtime();
list($m_i, $t_i) = explode(' ', $timenow);

for($i = 0; $i != 10000; $i++){
    check($a, $b);
}
$last = microtime();
list($m_f, $t_f) = explode(' ', $last);
$fine = $m_f+$t_f;
$inizio = $m_i+$t_i;
$quindi = $fine - $inizio;
$quindi = substr($quindi, 0, 7);
echo 'end in ' . $quindi . ' seconds';

打印：在0.36765秒后结束

第二次测试：

<?php
function check($a, $b) {
    return levenshtein($a, $b);
}

$a = 'this is just a test';
$b = 'this is not test';

$timenow = microtime();
list($m_i, $t_i) = explode(' ', $timenow);
for($i = 0; $i != 10000; $i++){
    check($a, $b);
}
$last = microtime();
list($m_f, $t_f) = explode(' ', $last);
$fine = $m_f+$t_f;
$inizio = $m_i+$t_i;
$quindi = $fine - $inizio;
$quindi = substr($quindi, 0, 7);
echo 'end in ' . $quindi . ' seconds';
?>

打印：在0.05023秒后结束

所以，是的，看起来更快。尝试使用许多数组项会很好（以及许多单词用于 levenshtein）

2°编辑：

使用类似的文本，速度似乎等于 levenshtein 方法：

<?php
function check($a, $b) {
    return similar_text($a, $b);
}

$a = 'this is just a test ';
$b = 'this is not test';

$timenow = microtime();
list($m_i, $t_i) = explode(' ', $timenow);
for($i = 0; $i != 10000; $i++){
    check($a, $b);
}
$last = microtime();
list($m_f, $t_f) = explode(' ', $last);
$fine = $m_f+$t_f;
$inizio = $m_i+$t_i;
$quindi = $fine - $inizio;
$quindi = substr($quindi, 0, 7);
echo 'end in ' . $quindi . ' seconds';
?>

打印：在0.05988秒后结束

但它可能需要超过 255 个字符：

另请注意，该算法的复杂度为 O(N**3)，其中 N 是最长字符串的长度。

并且，它甚至可以返回百分比的相似值：

function check($a, $b) {
    similar_text($a, $b, $p);
    return $p;
}

又一个编辑

那么创建一个数据库函数，直接在 sql 查询中进行比较，而不是检索所有数据并循环它们呢？

如果你在运行 Mysql，看看这个（手工制作的 levenshtein 函数，仍然是 255 字符限制）否则，如果你在 Postgresql 上，这个另一个（应该评估的许多函数）

score 0 · Accepted Answer

另一种方法是潜在语义分析，它利用大量数据来查找文档之间的相似性。

它的工作方式是获取文本的共现矩阵并将其与语料库进行比较，本质上为您提供了“语义空间”中文档的抽象位置。这将加快您的文本比较，因为您可以在 LSA 语义空间中使用欧几里得距离比较文档。这是非常有趣的语义索引。因此，添加新文章不会花费太多时间。

我无法给出这种方法的具体用例，只是在学校学过，但 KnowledgeSearch 似乎是该算法的开源实现。

（抱歉，这是我的第一篇文章，所以无法发布链接，请自行查找）

php - 将大量文本（聚类）与矩阵进行比较

5 回答 5

Related

Reference