php - 为什么有些网站无法抓取？

Question

我刚刚开始学习如何使用正则表达式从网站中提取数据。我的第一个目标是提取网站的标题。这是我的代码的样子：

<?php 
    $data = file_get_contents('http://bctia.org');
    $regex = '/<title>(.+?)<\/title>/';
    preg_match($regex,$data,$match);
    var_dump($match); 
?>

var_dump 的结果为空：

array(0) { }

起初我想，“也许 bctia.org 没有标题”？然而，事实并非如此，因为我检查了 bctia.org 的来源，它确实有和之间的<title>内容</title>。

然后我想，也许我的代码不起作用？但是，情况也并非如此，因为我已经bctia.org用其他网站替换了，例如bing.com，或apple.com，它们都返回了正确的结果。例如，apple.com我得到了正确的结果

array(2) { [0]=> string(20) "" [1]=> string(5) "Apple" }

所以我必须得出结论，这bctia.org是一个非常特殊的网站，它阻止我提取它的标题......

我想知道是否真的是这样？或者也许我的代码有一些我没有发现的问题？

先感谢您！

score 3 · Accepted Answer

这个特定网站的服务器端代码假定客户端发送User-Agent标头，显然，您的 PHP 安装未配置为发送标头。所以 a500 Internal Server Error被返回，导致file_get_contents返回false。

Source Error:
Line 66: //LOAD: Compatibility Mode
Line 67: //<meta http-equiv="X-UA-Compatible" content="IE=7,IE=9" />
Line 68: string BrowserOS = Request.ServerVariables["HTTP_USER_AGENT"].ToString();
Line 69: HtmlMeta compMode = new HtmlMeta();
Line 70: compMode.Content = "IE=7,IE=9";


Source File: c:\inetpub\wwwroot\BCTIA\Website\bctia\layouts\Main Layout.aspx.cs   
Line: 68

Stack Trace:
[NullReferenceException: Object reference not set to an instance of an object.]
   Layouts.Main_Layout.Page_Load(Object sender, EventArgs e) in c:\inetpub\wwwroot\BCTIA\Website\bctia\layouts\Main Layout.aspx.cs:68
   System.Web.Util.CalliHelper.EventArgFunctionCaller(IntPtr fp, Object o, Object t, EventArgs e) +24
   System.Web.UI.Control.LoadRecursive() +70
   System.Web.UI.Page.ProcessRequestMain(Boolean includeStagesBeforeAsyncPoint, Boolean includeStagesAfterAsyncPoint) +3063

要解决此问题，您可以在发出请求之前设置用户代理字符串：

ini_set('user_agent', 'Mozilla/5.0 (compatible; Examplebot/0.1; +http://www.example.com/bot.html)');

score 0 · Accepted Answer

不要使用正则表达式.. !!

而是使用 xpath 看看：xpath

正则表达式将无法正常工作。

score 0 · Accepted Answer

使用正则表达式解析 html 代码并不是一个好方法，因为您可能会对他的宽松结构感到惊讶。

您的模式不起作用的原因是点不匹配换行符。

如果您希望点与换行符匹配，请在模式末尾使用 s 修饰符，或者不使用点：

$regex = '/<title>(.+?)<\/title>/s';

或者

$regex = '/<title>([^<]+)<\/title>/';

[^<]是一个包含所有字符但的字符类<，正如您所看到的，您不需要使用惰性量词：+而不是+?

php - 为什么有些网站无法抓取？

3 回答 3

Related

Reference