1

我有一个使用 fopen() 包含文本文件(例如 .html)的字符串变量,接下来我将使用 strip_tags() 以便我可以使用该未标记文本进行文章预览,但在此之前,我需要获取 h1 nodeValue,并计算它的字符,所以我可以用该值替换下面代码中的零,并以 150+ 该值结束。

$f = fopen($filepath,"r");
$WholeFile = fread($f, filesize($filepath));
fclose($f);
$StrippedFile=strip_tags($WholeFile);
$TextExtract = mb_substr("$StrippedFile", 0,150);

对我来说最好的方法是什么?解析器是答案吗?由于这是唯一的情况 [到目前为止] 我将从 html 标签中提取值

4

2 回答 2

2

当您有结构化文本(如 HTML、XML、json、YAML 等)时,您应该始终使用适当的解析器,除非您有充分的理由不这样做。

在这种情况下,您可能能够摆脱正则表达式,但您将有一个非常脆弱的解决方案,并且可能会遇到与字符编码、实体或空格相关的问题。上述所有解决方案都会巧妙地打破。例如,如果您有这样的输入:

<html><head><meta http-equiv="content-type" content="text/html;charset=utf-8" />
<title>Page title</title></head>
<body><div><h1 title="attributes or the space in the closing tag may confuse code"
>Title &mdash;    maybe emdash counted as 7 characters</h1 >
<p> and      whitespace counted excessively too. And here's
a utf-8 character that may get split in the middle: ©; creating  
an invalid string.</p></div></body></html>

这是一个解决方案DOMDocumentDOMXPath它应该适用于除了最差的 HTML 之外的所有内容,并且总是会给你一个 150 个字符(不是byte,字符)的 utf-8 回复,所有实体都标准化为它们的字符值。

$html = '<html><head><meta http-equiv="content-type" content="text/html;charset=utf-8" />
<title>Page title</title></head>
<body><div><h1 title="attributes or the space in the closing tag may confuse code"
>Title &mdash;    maybe emdash counted as 7 characters</h1 >
<p> and      whitespace counted excessively too. And here\'s
a utf-8 character that may get split in the middle: ©; creating  
an invalid string.</p></div></body></html>';


$doc = new DOMDocument();
$doc->loadHTML($html);
// if you have a url or filename, you can use this instead:
// $doc->loadHTMLFile($url);
$xp = new DOMXPath($doc);

// you can easily modify the xquery to match the "title" of different documents
$titlenode = $xp->query('/html/body//h1[1]');

$xpath = 'normalize-space(substring(
        concat(
            normalize-space(.),
            " ",
            normalize-space(./following-sibling::*)
        ), 0, 150))';


$excerpt = null;
if ($titlenode->length) {
    $excerpt = $xp->evaluate($xpath, $titlenode->item(0));
}

var_export($excerpt);

此代码将输出:

'Title — maybe emdash counted as 7 characters and whitespace counted excessively too. And here\'s a utf-8 character that may get split in the middle: ©'

这里的基本思想是将您的h1(或任何标题元素)与 XPath 匹配,然后获取该元素和所有后续元素的字符串值并截断 150 个字符,同样使用 XPath。将所有内容保存在 XPath 中可以避免您必须使用 PHP 处理的所有混乱的字符集和实体问题。

于 2012-10-10T19:01:32.803 回答
0

If you are certain of the content of the file you are processing, and know that the title is in H1, you could potentially slice the string you are getting at the </h1> location (using strstr() for example although there are a plethora of ways to do that), into two strings.

You can then strip tags on the first one to get the title and strip tags on the second one to get the content. This is assuming your file ONLY has a single h1 containing the title, before the dom element that contains the content of the article.

Keep in mind this is not the best way to parse a wide range of articles online, for a more general solution I'd look into a dedicated parser class.

Here is a code sample :

Code sample

$f = fopen($filepath,"r");
$WholeFile = fread($f, filesize($filepath));
fclose($f);
// Modified part
$content = strip_tags(strstr($WholeFile, '</h1>'));
$title = strip_tags(strstr($WholeFile, '</h1>', true)); // Valid with PHP 5.3.0 only I think
$TextExtract = mb_substr($content, 0,150);
于 2012-10-10T17:43:11.687 回答