当您有结构化文本(如 HTML、XML、json、YAML 等)时,您应该始终使用适当的解析器,除非您有充分的理由不这样做。
在这种情况下,您可能能够摆脱正则表达式,但您将有一个非常脆弱的解决方案,并且可能会遇到与字符编码、实体或空格相关的问题。上述所有解决方案都会巧妙地打破。例如,如果您有这样的输入:
<html><head><meta http-equiv="content-type" content="text/html;charset=utf-8" />
<title>Page title</title></head>
<body><div><h1 title="attributes or the space in the closing tag may confuse code"
>Title — maybe emdash counted as 7 characters</h1 >
<p> and whitespace counted excessively too. And here's
a utf-8 character that may get split in the middle: ©; creating
an invalid string.</p></div></body></html>
这是一个解决方案DOMDocument
,DOMXPath
它应该适用于除了最差的 HTML 之外的所有内容,并且总是会给你一个 150 个字符(不是byte,字符)的 utf-8 回复,所有实体都标准化为它们的字符值。
$html = '<html><head><meta http-equiv="content-type" content="text/html;charset=utf-8" />
<title>Page title</title></head>
<body><div><h1 title="attributes or the space in the closing tag may confuse code"
>Title — maybe emdash counted as 7 characters</h1 >
<p> and whitespace counted excessively too. And here\'s
a utf-8 character that may get split in the middle: ©; creating
an invalid string.</p></div></body></html>';
$doc = new DOMDocument();
$doc->loadHTML($html);
// if you have a url or filename, you can use this instead:
// $doc->loadHTMLFile($url);
$xp = new DOMXPath($doc);
// you can easily modify the xquery to match the "title" of different documents
$titlenode = $xp->query('/html/body//h1[1]');
$xpath = 'normalize-space(substring(
concat(
normalize-space(.),
" ",
normalize-space(./following-sibling::*)
), 0, 150))';
$excerpt = null;
if ($titlenode->length) {
$excerpt = $xp->evaluate($xpath, $titlenode->item(0));
}
var_export($excerpt);
此代码将输出:
'Title — maybe emdash counted as 7 characters and whitespace counted excessively too. And here\'s a utf-8 character that may get split in the middle: ©'
这里的基本思想是将您的h1
(或任何标题元素)与 XPath 匹配,然后获取该元素和所有后续元素的字符串值并截断 150 个字符,同样使用 XPath。将所有内容保存在 XPath 中可以避免您必须使用 PHP 处理的所有混乱的字符集和实体问题。