php - 获取字符串提取中的所有段落

Question

我从数据库中提取了几段，并尝试将这些段落分成一个带有正则表达式和不同类的数组......但没有任何效果。

我试图这样做：

   public function get_first_para(){
        $doc = new DOMDocument();
    $doc->loadHTML($this->review);
    foreach($doc->getElementsByTagName('p') as $paragraph) {
      echo $paragraph."<br/><br/><br/>";
    } 
 }

但我明白了：

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Unexpected end tag : p in Entity, line: 9 in C:\Inetpub\vhosts\bestcamdirectory.com\httpdocs\sandbox\model\ReviewContentExtractor.php on line 18

可捕获的致命错误： DOMElement 类的对象无法在第 20 行的 C:\Inetpub\vhosts\bestcamdirectory.com\httpdocs\sandbox\model\ReviewContentExtractor.php 中转换为字符串

为什么我会收到消息，有没有一种简单的方法可以从字符串中提取所有段落？

更新：

   public function get_first_para(){
         $pattern="/<p>(.+?)<\/p>/i";
         preg_match_all($pattern,$this->review,$matches,PREG_PATTERN_ORDER);
         return $matches;
     }

我更喜欢第二种方式..但它也不好用..

score 4 · Accepted Answer

DOMDocument::getElementsByTagName返回一个可迭代但不是数组的 DOMNodeList 对象。在变量中foreach是DOMElement$paragraph的一个例子，所以简单地将它用作字符串是行不通的（正如错误所解释的那样）。

你想要的是 DOMElement 的文本内容，它可以通过那些（继承自 DOMNode 类）的textContent属性获得：

foreach($doc->getElementsByTagName('p') as $paragraph) {
  echo $paragraph->textContent."<br/><br/><br/>"; // for text only
}

或者，如果您需要 DOMNode 的全部内容，您可以使用DOMDocument::saveHTML：

foreach($doc->getElementsByTagName('p') as $paragraph) {
    echo $doc->saveHTML($paragraph)."<br/><br/><br/>\n"; // with the <p> tag

    // without the <p>
    // if you don't need the containing <p> tag, you can iterate trough it's childs and output them
    foreach ($paragraph->childNodes as $cnode) {
         echo $doc->saveHTML($cnode); 
    }
}

至于您的 loadHTML 错误，html 输入无效，您可以使用以下命令抑制警告：

libxml_use_internal_errors(true); // before loading the html content

如果您需要这些错误，请参阅手册中libxml 的错误处理部分。

编辑

由于您坚持使用正则表达式，因此您可以这样做：

preg_match_all('!<p>(.+?)</p>!sim',$html,$matches,PREG_PATTERN_ORDER);

模式修饰符:m表示多行，表示s可以.匹配行尾，i不区分大小写。

php - 获取字符串提取中的所有段落

1 回答 1

编辑

Related

Reference