php - PHP - 需要高级正则表达式帮助

Question

所以我有很多大文本段落要解析。最终目标是将段落分成更小的帖子，这样我就可以将它们插入到 mysql 中。

这是字符串中的一个段落的一个非常简短的示例：

<?php
$longstring = '

(<b>John Smith</b>) at <b class="datetimeGMT">2011-01-10 22:13:01 GMT</b><hr>
Lots of text entered here under the first line.<br>And most of it is html, since it is for displaying in a web browser.<br></br></br>

(<b>Alan Slappy</b>) at <b class="datetimeGMT">2011-01-11 13:12:00 GMT</b><hr>
Forgot to put one more thing in the notes.........<br>blah blah blah
(<b>Joe Mama</b>) at <b class="datetimeGMT">2011-01-13 10:15:00 GMT</b><hr>
Groceries list:<br>Watermelons<br>Floss<br><br>email doctor
';

?>

是的，我有一个为每个条目解析这些字符串的怪异项目。是的，我同意任何人的观点，即这不是一项很酷的任务。原始开发人员允许将文本附加到原始文本。在某些情况下这不是一个坏主意，但对我来说确实如此。

我确实需要有关如何对这个野兽进行正则表达式并将其放入 foreach 循环的帮助，以便我可以开始清理它。

这是我走了多远：

<?php

if(preg_match_all('/\(<b>.*?<hr>/', $longstring, $matches)){
print_r($matches);
}
/* output: 
Array 
( 
    [0] => Array 
        ( 
         [0] => (<b>John Smith</b>) at <b class="datetimeGMT">2011-01-10 22:13:01 GMT</b><hr>
         [1] => (<b>Alan Slappy</b>) at <b class="datetimeGMT">2011-01-11 13:12:00 GMT</b><hr> 
         [2] => (<b>Joe Mama</b>) at <b class="datetimeGMT">2011-01-13 10:15:00 GMT</b><hr> 
        ) 
) 
*/ 
?>

所以，我实际上在遍历每个条目的顶部时做得很好。我有点自豪我想通了。（正则表达式是我的克星）

所以现在我一直在弄清楚如何在每次迭代下面包含实际文本。

任何人都知道如何调整preg_match_all以解释每个“标题”下方的文本？

score 1 · Accepted Answer

尝试使用 preg_split 代替：

$matches  = preg_split("/\s*(\(<b>.*?<hr>)\s*/s", trim($longstring), null, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);

print_r($matches);

注意：修剪应用于您的字符串以减少前导和尾随空格。

结果将类似于

Array
(
    [0] => (<b>John Smith</b>) at <b class="datetimeGMT">2011-01-10 22:13:01 GMT</b><hr>
    [1] => Lots of text entered here under the first line.<br>And most of it is html, since it is for displaying in a web browser.<br></br></br>
    [2] => (<b>Alan Slappy</b>) at <b class="datetimeGMT">2011-01-11 13:12:00 GMT</b><hr>
    [3] => Forgot to put one more thing in the notes.........<br>blah blah blah
    [4] => (<b>Joe Mama</b>) at <b class="datetimeGMT">2011-01-13 10:15:00 GMT</b><hr>
    [5] => Groceries list:<br>Watermelons<br>Floss<br><br>email doctor
)

score 0 · Accepted Answer

如果您解析 HTML 而不是仅仅尝试正则表达式，这将更容易，除非您可以保证 HTML 的格式。

您可能想看看Robust and Mature HTML Parser for PHP。

score 0 · Accepted Answer

尝试这个

if(preg_match_all('/\(<b>(?:(?!\(<b>).)*/s', $longstring, $matches)){
  print_r($matches);
}

php - PHP - 需要高级正则表达式帮助

3 回答 3

Related

Reference