0

我正在尝试从给定的 url 解析文本内容。这是代码:

<?php
$url = 'http://stackoverflow.com/questions/12097352/how-can-i-parse-dynamic-content-from-a-web-page';
$content = file_get_contents($url);
echo $content;                          // This parse everything on the page, including image + everything

$text=escapeshellarg(strip_tags($content));
echo "</br>";
echo $text;   // This gives source code also, not only the text content over page
?>

我只想获取写在页面上的文本。没有页面源代码。有什么想法吗?我已经用谷歌搜索了,但上面的方法只存在于任何地方。

4

4 回答 4

4

您可以使用DOMDocumentDOMNode

$doc = new DOMDocument();
$doc->loadHTMLFile($url);
$xpath = new DOMXPath($doc);
foreach($xpath->query("//script") as $script) {
    $script->parentNode->removeChild($script);
}
$textContent = $doc->textContent; //inherited from DOMNode

除了使用 xpath,您还可以执行以下操作:

$doc = new DOMDocument();
$doc->loadHTMLFile($url); // Load the HTML
foreach($doc->getElementsByTagName('script') as $script) { // for all scripts
    $script->parentNode->removeChild($script); // remove script and content 
                                               // so it will not appear in text
}
$textContent = $doc->textContent; //inherited from DOMNode, get the text.
于 2013-09-22T11:11:53.453 回答
1
$content = file_get_contents(strip_tags($url));

这将删除来自页面的 HTML 标记

于 2013-09-22T11:07:45.453 回答
1

要删除 html 标签,请使用:

$text = strip_tags($text);
于 2013-09-22T11:08:45.203 回答
1

一个简单cURL的就可以解决问题。[测试]

<?php
$ch = curl_init("http://stackoverflow.com/questions/12097352/how-can-i-parse-dynamic-content-from-a-web-page");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); //Sorry forgot to add this
echo strip_tags(curl_exec($ch));
curl_close($ch);
?>
于 2013-09-22T11:14:17.077 回答