php - html到带有domdocument类的文本

Question

如何获取没有html标签的html页面源代码？例如：

<meta http-equiv="content-type" content="text/html; charset=utf-8" /> 
<meta http-equiv="content-language" content="hu"/> 
<title>this is the page title</title>
<meta name="description" content="this is the description" />
<meta name="keywords" content="k1, k2, k3, k4" />
start the body content
<!-- <div>this is comment</div> -->
<a href="open.php" title="this is title attribute">open</a>
End now one noframes tag.
<noframes><span>text</span></noframes>
<select name="select" id="select"><option>ttttt</option></select>
<div class="robots-nocontent"><span>something</span></div>
<img src="url.png" alt="this is alt attribute" />

我需要这个结果：

this is the page title this is the description k1, k2, k3, k4 start the body content this is title attribute open End now one noframes tag. text ttttt something this is alt attribute

我也需要标题和 alt 属性。主意？

score 0 · Accepted Answer

我的解决方案有点复杂，但对我来说效果很好。

如果您确定您拥有 XHTML，您可以简单地将代码视为 XML（但您必须将所有内容放在适当的包装中）。

然后使用 XSLT，您可以定义一些基本模板来满足您的需要。

score 0 · Accepted Answer

你可以用正则表达式来做到这一点。

$regex = '/\<.\>/';

将是一个非常简单的开始，以删除它周围的任何<东西。>但为了做到这一点，您将不得不将 HTML 作为一个file_get_contents()或一些其他函数来将代码转换为文本。

附录：

如果您还想提取单个属性，您将不得不编写一个更复杂的正则表达式来提取该文本。例如：

$regex2 = '/\<.(?<=(title))(\=\").(?=\")/';

假设您在标题之前没有其他匹配的表达式，会拉出（我认为......我仍在学习正则表达式）<和之间的任何文本。title="同样，这将是一个非常复杂的正则表达式过程。

score 0 · Accepted Answer

这不能以自动方式完成。PHP 无法知道您要省略哪些节点属性。您要么必须创建一些代码来迭代所有可以提供地图的属性和文本节点，定义何时使用节点的内容，要么您只需使用 XPath 逐一选择您想要的内容。

另一种方法是使用 XMLReader。它允许您遍历整个文档并为元素名称定义回调。通过这种方式，您可以定义如何处理什么元素。看

http://www.ibm.com/developerworks/library/x-pullparsingphp.html

php - html到带有domdocument类的文本

3 回答 3

Related

Reference