php - 使用 PHP 在正文中查找 3-8 个单词的常用短语

Question

我正在寻找一种使用 PHP 在正文中查找常用短语的方法。如果在 php 中不可能，我会对其他可以帮助我完成这项工作的网络语言感兴趣。

内存或速度不是问题。

现在，我可以轻松找到关键字，但不知道如何搜索短语。

score 4 · Accepted Answer

我已经写了一个 PHP 脚本来做这件事，就在这里。它首先将源文本拆分为单词数组及其出现次数。然后它计算具有指定参数的这些单词的常见序列。这是旧代码，没有注释，但也许你会发现它很有用。

score 1 · Accepted Answer

只使用 PHP？我能想到的最直接的是：

将每个短语添加到数组中
从数组中获取第一个短语并将其删除
找出与之匹配的短语数量并删除它们，保持匹配计数
将短语和匹配数推送到新数组
重复直到初始数组为空

对于正式的 CS，我是垃圾，但我认为这很n^2复杂，特别n(n-1)/2是在最坏的情况下涉及比较。我毫不怀疑有一些更好的方法可以做到这一点，但你提到效率不是问题，所以就可以了。

代码如下（我使用了一个新函数，array_keys接受搜索参数）：

// assign the source text to $text
$text = file_get_contents('mytext.txt');

// there are other ways to do this, like preg_match_all,
// but this is computationally the simplest
$phrases = explode('.', $text);

// filter the phrases
// if you're in PHP5, you can use a foreach loop here
$num_phrases = count($phrases);
for($i = 0; $i < $num_phrases; $i++) {
  $phrases[$i] = trim($phrases[$i]);
}

$counts = array();

while(count($phrases) > 0) {
  $p = array_shift($phrases);
  $keys = array_keys($phrases, $p);
  $c = count($keys);
  $counts[$p] = $c + 1;

  if($c > 0) {
    foreach($keys as $key) {
      unset($phrases[$key]);
    }
  }
}

print_r($counts);

查看实际操作：http: //ideone.com/htDSC

score 1 · Accepted Answer

我认为你应该去

str_word_count

$str = "Hello friend, you're
       looking          good today!";

print_r(str_word_count($str, 1));

会给

Array
(
    [0] => Hello
    [1] => friend
    [2] => you're
    [3] => looking
    [4] => good
    [5] => today
)

然后你可以使用array_count_values()

$array = array(1, "hello", 1, "world", "hello");
print_r(array_count_values($array));

这会给你

Array
(
    [1] => 2
    [hello] => 2
    [world] => 1
)

score 0 · Accepted Answer

一个丑陋的解决方案，因为你说丑陋是可以的，那就是搜索你的任何短语的第一个单词。然后，一旦找到该单词，检查它后面的下一个单词是否与短语中的下一个预期单词匹配。只要命中是肯定的，这将是一个循环，直到单词不存在或短语完成为止。

简单，但非常难看，可能非常非常慢。

score 0 · Accepted Answer

来晚了，但由于我在想做类似的事情时偶然发现了这一点，我想我会分享我在 2019 年登陆的地方：

https://packagist.org/packages/yooper/php-text-analysis

这个库使我的任务变得微不足道。就我而言，我有一系列搜索短语，我最终将它们分解为单个术语，标准化，然后创建两个和三个单词的 ngram。循环遍历生成的 ngram，我能够轻松总结特定短语的频率。

$words   = tokenize($searchPhraseText);
$words   = normalize_tokens($words);
$ngram2  = array_unique(ngrams($words, 2));
$ngram3  = array_unique(ngrams($words, 3));

非常酷的图书馆，提供了很多东西。

score -2 · Accepted Answer

如果您想在 html 文件中进行全文搜索，请使用Sphinx - 强大的搜索服务器。文档在这里

php - 使用 PHP 在正文中查找 3-8 个单词的常用短语

6 回答 6

Related

Reference