php - 使用 PHP 提取 HTML 文档的正文

Question

我知道为此目的使用 DOM 更好，但让我们尝试以这种方式提取文本：

<?php


$html=<<<EOD
<html>
<head>
</head>
<body>
<p>Some text</p>
</body>
</html>
EOD;


        preg_match('/<body.*?>/', $html, $matches, PREG_OFFSET_CAPTURE);

        if (empty($matches))
            exit;

        $matched_body_start_tag = $matches[0][0];
        $index_of_body_start_tag = $matches[0][1];

        $index_of_body_end_tag = strpos($html, '</body>');


        $body = substr(
                        $html,
                        $index_of_body_start_tag + strlen($matched_body_start_tag),
                        $index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)
        );

echo $body;

结果可以在这里看到：http: //ideone.com/vH2FZ

如您所见，我收到的文本比预期的要多。

有些东西我不明白，为了获得正确的substr($string, $start, $length)函数长度，我正在使用：

$index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)

我看不出这个公式有什么问题。

有人可以建议问题出在哪里吗？

非常感谢大家。

编辑：

非常非常感谢大家。我脑子里只有一个错误。阅读您的答案后，我现在了解问题所在，应该是：

  $index_of_body_end_tag - ($index_of_body_start_tag + strlen($matched_body_start_tag));

或者：

  $index_of_body_end_tag - $index_of_body_start_tag - strlen($matched_body_start_tag);

score 11 · Accepted Answer

问题是你的字符串有新行 where . 在模式中只匹配单行，你需要添加 /s 修饰符到 make 。匹配多行

这是我的解决方案，我更喜欢这种方式。

<?php

$html=<<<EOD
<html>
<head>
</head>
<body buu="grger"     ga="Gag">
<p>Some text</p>
</body>
</html>
EOD;

    // get anything between <body> and </body> where <body can="have_as many" attributes="as required">
    if (preg_match('/(?:<body[^>]*>)(.*)<\/body>/isU', $html, $matches)) {
        $body = $matches[1];
    }
    // outputing all matches for debugging purposes
    var_dump($matches);
?>

编辑：我正在更新我的答案，以便更好地解释您的代码失败的原因。

你有这个字符串：

<html>
<head>
</head>
<body>
<p>Some text</p>
</body>
</html>

一切似乎都很好，但实际上每行都有非打印字符（换行符）。您有 53 个可打印字符和 7 个不可打印字符（新行，\n == 每个新行实际上有 2 个字符）。

当您到达这部分代码时：

$index_of_body_end_tag = strpos($html, '</body>');

您得到了 </body> 的正确位置（从位置 51 开始），但这会计算新行。

所以当你到达这行代码时：

$index_of_body_start_tag + strlen($matched_body_start_tag)

它评估为 31（包括新行），并且：

$index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)

它被评估为 51 - 25 + 6 = 32 （您必须阅读的字符），但在 <body> 和 </body> 之间只有 16 个可打印的文本字符和 4 个不可打印的字符（在 <body> 之后的新行和新的</body> 之前的行）。这就是问题所在，您必须像这样对计算（优先级）进行分组：

$index_of_body_end_tag - ($index_of_body_start_tag + strlen($matched_body_start_tag))

评估为 51 - (25 + 6) = 51 - 31 = 20 (16 + 4)。

:) 希望这可以帮助您理解为什么优先排序很重要。（很抱歉误导您关于换行符，它仅在我上面给出的正则表达式示例中有效）。

score 4 · Accepted Answer

就个人而言，我不会使用正则表达式。

<?php

$html = <<<EOD

<html>
    <head>
        <title>Example</title>
    </head>
    <body>
        <h1>foobar</h1>
    </body>
</html>

EOD;

$s = strpos($html, '<body>') + strlen('<body>');
$f = '</body>';

echo trim(substr($html, $s, strpos($html, $f) - $s));

?>

返回<h1>foobar</h1>

score 2 · Accepted Answer

问题在于您substr对结束索引的计算。你应该一直减去：

$index_of_body_end_tag - $index_of_body_start_tag - strlen($matched_body_start_tag)

但你正在做：

+ strlen($matched_body_start_tag)

也就是说，考虑到您只能preg_match 使用. s您只需要使用修饰符确保跨新行匹配：

preg_match('/<body[^>]*>(.*?)<\/body>/s', $html, $matches);
echo $matches[1];

输出：

<p>Some text</p>

score 1 · Accepted Answer

有人可能已经发现了你的错误，我没有阅读所有的回复。
代数是错误的。

代码在这里

顺便说一句，第一次看到 ideone.com，那很酷。

$body = substr( 
          $html, 
          $index_of_body_start_tag + strlen($matched_body_start_tag),
          $index_of_body_end_tag - ($index_of_body_start_tag + strlen($matched_body_start_tag))
        );

或者 ..

$body = substr(
          $html,
          $index_of_body_start_tag + strlen($matched_body_start_tag),
          $index_of_body_end_tag - $index_of_body_start_tag - strlen($matched_body_start_tag)
       );

php - 使用 PHP 提取 HTML 文档的正文

4 回答 4

Related

Reference