php - 解析网页文本内容并查看
相同的阵型

Question

这里我正在解析页面文本：

<?php
$url= 'http://www.paulgraham.com/herd.html';
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTMLFile($url);
libxml_clear_errors();
$xpath = new DOMXPath($doc);
foreach($xpath->query("//script") as $script) {
    $script->parentNode->removeChild($script);
}
$textContent = $doc->textContent; //inherited from DOMNode
$text=escapeshellarg($textContent);
$test = preg_replace("/[^a-zA-Z]+/", " ", html_entity_decode($text));

echo $test; //This gives entire content in one line loosing actual page text format
echo echo nl2br($textContent);  // This does not show in single line but some un usual form. 

?>

我也尝试过使用<pre>标签，但它也会在单行中显示整个内容。这里有什么变化，以便我可以在原始页面中获得带有换行符的段落？

我只想要文本内容，没有图像、按钮和所有内容。

score 1 · Accepted Answer

如果你更换：

$test = preg_replace("/[^a-zA-Z]+/", " ", html_entity_decode($text));

至

$test = preg_replace("/<br>/", "\r\n", html_entity_decode($text));
$test = preg_replace("/<.+?>/", " ", $test);
$test = preg_replace("/[^a-zA-Z\r\n]+/", " ", $test);

php - 解析网页文本内容并查看相同的阵型

1 回答 1

Related

Reference

php - 解析网页文本内容并查看
相同的阵型