php - 获取html中的内容不起作用

Question

我正在尝试从网站内部提取 html 内容。我只想要标签内的内容。

    //$validLink is a link with .htm extension, source code is rather large 
    //contains 24,000 lines of html code

    $thehtml = file_get_contents($validlink);
    $thehtml = preg_match("/<body.*?>(.*?)<\/body>/is", $thehtml);

我还可以做些什么？$thehtml 是空的......我正在尝试将其插入到 wordpress 帖子中......但 $thehtml 是空的......出于某种奇怪的原因。是否有可能的超时问题或什么？？？

不可能有超时问题.....因为我注意到如果我只输出 file_get_contents($validlink); 由于某种原因没有找到BODY.....

另一种可能的解决方案是获取文档中第一个 div 和最后一个 div 之间的内容....

score 0 · Accepted Answer

$thehtml = file_get_contents($validlink);
$thehtml = preg_match("/<body.*?>(.*?)<\/body>/is", $thehtml,$matches);
$thehtml = $matches[0];

score 0 · Accepted Answer

这是正确的代码：

$thehtml = file_get_contents($validlink);
preg_match('/<body.*?>(.*?)<\/body>/is', $thehtml, $matches);
$thehtml = $matches[1];

但我建议您改用DOM 解析器。

score 0 · Accepted Answer

使用标签开始和结束的'strpos（）'获取字符串位置，然后使用子字符串方法，即带有此位置的substr（）

php - 获取html中的内容不起作用

3 回答 3

Related

Reference