php - 从 html 文档、php cURL、php、preg_match 中提取特定部分

Question

我正在尝试使用 php cURL+preg_match 或任何其他函数从网页中提取一些信息，但由于某些原因它根本不起作用。例如，从这个页面中，我想提取标题为“4bed house to rent, Caroline Place, Bayswater, W2”，价格为“2,300”，描述以“This wonderful...”开头，并且结束于“（环线和地区线）。”。我尝试使用 php cURL + dom 但我收到很多错误，例如“htmlParseEntityRef：期待';' 在实体中，行：243" 并且没有显示结果

我也尝试使用 preg_match 或 preg_match_all 但也不起作用。

一个非常基本的例子将不胜感激！

score 1 · Accepted Answer

您可以尝试Simple HTML DOM 解析器是否更容错。

并记下您正在抓取的网站的条款和条件。

score 1 · Accepted Answer

一个非常基本的例子将不胜感激

要回答正则表达式部分：

preg_match('!<title>(.*)</title>!s', '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
    <head>
<title>

            4 bedroom


        house


    to rent in Caroline Place, Bayswater, W2 through Foxtons (Property to rent)</title>
<meta name="keywords" content="Houses" />', $matches);
print_r($matches);

/* output:
Array
(
    [0] => <title>

            4 bedroom


        house


    to rent in Caroline Place, Bayswater, W2 through Foxtons (Property to rent)</title>
    [1] => 

            4 bedroom


        house


    to rent in Caroline Place, Bayswater, W2 through Foxtons (Property to rent)
)
*/

正s则表达式末尾的将解析器放入（不恰当地）称为single-line mode.

score 0 · Accepted Answer

对于 HTMLsql，我无法给出足够高的推荐：

http://www.jonasjohn.de/lab/htmlsql.htm

这只小狗以无数种方式救了我很多次。

score -1 · Accepted Answer

通过 curl 获取数据后，结果中有许多新行和空格。因此，执行一些干净的 html 脚本以删除这些新行和空格。最后，祝你preg_match愉快

php - 从 html 文档、php cURL、php、preg_match 中提取特定部分

4 回答 4

Related

Reference