php - 在两组或更多组文本中查找模式

Question

我有很多数据需要搜索某些模式。

问题是在寻找上述模式时，我没有参考我正在寻找的东西。

或者换句话说，我有两段。每个主题都相似。我需要能够比较两个段落并找到模式。短语在两个段落中都说了多少次。

似乎找不到解决方案，因为preg_match您需要其他功能来提供您正在寻找的东西。

示例段落

第 1 段：

蜂花粉由蜜蜂制成，是幼蜂的食物。它被认为是自然界中营养最全面的食物之一，因为它几乎含有人类所需的所有营养素。蜜蜂采集的花粉富含蛋白质（约 40% 的蛋白质）、游离氨基酸、维生素（包括 B 族复合物）和叶酸。

第 2 段：

蜂花粉是由蜜蜂制成的。它是植物受精所必需的。这些微小的颗粒由 50/1,000 毫米的小体组成，形成于花朵中心的雄蕊的自由端，是大自然中最完全滋养的食物。宇宙中的每一种花都散发出花粉。许多果园水果和农业粮食作物也是如此。

因此，从这些示例中可以看出这些模式：

蜂花粉是由蜜蜂制成的

和：

大自然中最全营养的食物

这两个短语在两个段落中都可以找到。

score 1 · Accepted Answer

这可能是一个复杂的问题，具体取决于您是在寻找相似的短语还是逐字匹配的短语。

找到准确的逐字匹配非常简单，您只需在标点符号（例如.,;:）等常见断点上拆分，也可能在连词上拆分（例如and or）。然而，问题来了，例如，形容词两个短语可能完全相同但有一个单词不同，如下所示：

The world is spinnnig around its axis at a tremendous speed.
The world is spinning around its axis at a magnificent speed.

这将不匹配，因为tremendous和magnificent被用来代替另一个。您可能可以解决这个问题，但是，这将是一个更复杂的问题。

回答

如果我们坚持简单的一面，我们只需几行代码就可以实现短语匹配（本例中为4；不包括注释/可读性的格式）。

$wordSplits = 'and or on of as'; //List of words to split on
preg_match_all('/(?<m1>.*?)([.,;:\-]| '.str_replace(' ', ' | ', trim($wordSplits)).' )/i', $para1, $matches1);
preg_match_all('/(?<m2>.*?)([.,;:\-]| '.str_replace(' ', ' | ', trim($wordSplits)).' )/i', $para2, $matches2);
$commonPhrases = array_filter( //Removes blank $key=>$value pairs
                    array_intersect( //Finds matching paterns
                        array_map(function($item){
                            return(strtolower(trim($item))); //Cleans array for $para1 values - removes leading and following spaces
                        }, $matches1['m1']),
                        array_map(function($item){
                            return(strtolower(trim($item))); //Cleans array for $para2 values - removes leading and following spaces
                        }, $matches2['m2'])
                    )
                );


var_dump($commonPhrases);
/**
OUTPUT:

array(2) {
  [0]=>
  string(31) "bee pollen is made by honeybees"
  [5]=>
  string(41) "nature's most completely nourishing foods"
}
/*

上面的代码将发现匹配在标点符号（在模式中定义）上拆分，[...]它preg_match_all还将连接单词列表（仅匹配单词列表中带有前后空格的单词）。

词汇表

您可以更改单词列表以包含您喜欢的任何中断，编辑列表直到您获得所需的短语，例如：

$wordSplits = 'and or';
$wordSplits = 'and but if or';
$wordSplits = 'a an as and by but because if in is it of off on or';

标点

您可以将任何您喜欢的标点符号添加到列表中（介于[and之间]），但请记住，某些字符确实具有特殊含义，可能需要转义（或适当放置）：-并且^应该成为\-and\^或放置在其特殊含义没有的地方t 发挥作用。

您可以考虑更改：

([.,;:\-]|

至：

([.,;:\-] | //Adding a space before the pipe

这样你就只拆分标点符号，后面跟一个空格。例如：这意味着类似的项目50,000不会被拆分。

空格和休息

您也可以考虑将空格更改为\s包含tabs等newlines，而不仅仅是空格。像这样：

'/(?<m1>.*?)([.,;:\-]|\s'.str_replace(' ', '\s|\s', trim($wordSplits)).'\s)/i'

这也适用于：

([.,;:\-]\s|

如果你决定走那条路。

score 0 · Accepted Answer

我一直在研究这个代码，不知道它是否适合你的需要......随意扩展它！

$p1 = "Bee Pollen is made by honeybees, and is the food of the young bee. It is considered one of nature's most completely nourishing foods as it contains nearly all nutrients required by humans. Bee-gathered pollens are rich in proteins (approximately 40% protein), free amino acids, vitamins, including B-complex, and folic acid.";
$p2 = "Bee Pollen is made by honeybees. It is required for the fertilization of the plant. The tiny particles consist of 50/1,000-millimeter corpuscles, formed at the free end of the stamen in the heart of the blossom, nature's most completely nourishing foods. Every variety of flower in the universe puts forth a dusting of pollen. Many orchard fruits and agricultural food crops do, too.";

// Strip strings of periods etc.
$p1 = strtolower(str_replace(array('.', ',', '(', ')'), '', $p1));
$p2 = strtolower(str_replace(array('.', ',', '(', ')'), '', $p2));

// Extract words from first paragraph
$w1 = explode(" ", $p1);

// Build search string
$search = '';
$found = array();


foreach ($w1 as $word) {
    //echo 'Word: ' . $word . "<br />";
    $search .= ' ' . $word;
    $search = trim($search);

    //echo '. . Search string: '. $search . "<br /><br />";

    if (substr_count($p2, $search)) {
        $old_search = $search;
        $num_occured = substr_count($p2, $search);
        //echo " . . . found!" . "<br /><br /><br />";
        $add = TRUE;
    } else {
        //echo " . . . not found! Generating new search string: " . $word . '<br />';
        if ($add) {
            $found[] = array('pattern' => $old_search, 'occurences' => $num_occured);
            $add = FALSE;
        }
        $old_search = '';
        $search = $word;
    }
}

print_r($found);

上面的代码从第二个字符串中的第一个字符串中查找模式的出现。我敢肯定它可以写得更好，但由于已经过了午夜（当地时间），我并不像我想的那样“新鲜”......

键盘链接

php - 在两组或更多组文本中查找模式

2 回答 2

回答

词汇表

标点

空格和休息

Related

Reference