php - 不能preg_match以下。我究竟做错了什么？

Question

我想提取具有以下描述格式的页面的描述。即使我相信我是对的，我也不明白。

$file_string = file_get_contents('');

preg_match('/<div class="description">(.*)<\/div>/i', $file_string, $descr);
$descr_out = $descr[1];

echo $descr_out; 


<div class="description">
<p>some text here</p>
</div>

score 3 · Accepted Answer

看起来您需要在正则表达式中打开单行模式。修改它以添加 -s 标志：

preg_match('/<div class="description">(.*)<\/div>/si', $file_string, $descr);

单行模式允许 . 匹配换行符的字符。没有它， .* 将不会匹配你在开始和结束 div 标签之间的换行符。

score 1 · Accepted Answer

我建议使用DOMDocument类和xpath从 HTML 文档中提取随机片段，基于正则表达式的解决方案在更改输入时非常脆弱（添加额外的属性，在奇怪的地方添加空格等），并且对于更复杂的场景来说它是可读的。

$html = '<html><body><div class="description"><p>some text here</p></div></body></html>';
// or you could fetch external sites 
// $html = file_get_contents('http://example.com');

$doc = new DOMDocument();
// prevent parsing errors (frequent with HTML)
libxml_use_internal_errors(true);
$doc->loadHTML($html);
// enable back parsing errors as the HTML document is already parsed and stored in $doc
libxml_use_internal_errors(false);
$xpath = new DOMXpath($doc);

foreach ($xpath->query('//div[@class="description"]') as $el) {
    var_dump($el->textContent);
}

php - 不能preg_match以下。我究竟做错了什么？

2 回答 2

Related

Reference