php - 为什么我的 simple-html-dom 允许例如 'ä' 用于 wikipedia 但不允许用于 wikisource？

Question

我的问题是以下脚本适用于某些 IRI 而其他人则不适用，我的问题是为什么它会以这种方式运行以及如何解决它。我认为字符集有问题，但这只是一个猜测，因为在维基百科中它可以工作。

<?php
include('C:\xampp\htdocs\php\simple_html_dom.php');
$html = file_get_html('http://de.wikisource.org/wiki/Am_B%C3%A4chle');
//Titel
foreach($html->find('span#ws-title') as $f)
echo $f->plaintext;

//1   http://de.wikisource.org/wiki/7._August_1929           OK
//2   http://de.wikisource.org/wiki/%E2%80%99s_ist_Krieg!    -
//3   http://de.wikisource.org/wiki/Am_B%C3%A4chle           -
//4   http://de.wikipedia.org/wiki/Guillaume-Aff%C3%A4re     OK
//5   http://de.wikisource.org/wiki/Solidit%C3%A4t           -
?>

5 个 IRI 就是示例。最后 3 个 IRI 包含 %C3%A4，它是一个“ä”，但只有来自 wikipedia 的那个有效。2. IRI 包含 %E2%80%99 它是一个“'” - 不起作用。

但是来自 wikisource 的第一个 IRI 有效。对于来自 wikisource 的每个 IRI 都是相同的，它不包含任何 ä、ö、...

当它不起作用时，我会收到以下警告：

警告： file_get_contents(http://de.wikisource.org/wiki/Solidit%C3%A4t)：打开流失败：HTTP 请求失败！HTTP/1.0 403 Forbidden in C:\xampp\htdocs\php\simple_html_dom.php 第 70 行

致命错误：在第 5 行的 C:\xampp\htdocs\php\frage.php 中的非对象上调用成员函数 find()

在 simple_html_dom.php 中包含第 70 行的函数如下所示：

//65    function file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT)
//66    {
//67    // We DO force the tags to be terminated.
//68    $dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $defaultBRText);
//69    // For sourceforge users: uncomment the next line and comment the retreive_url_contents line 2 lines down if it is not already done.
//70    $contents = file_get_contents($url, $use_include_path, $context, $offset);
//71    // Paperg - use our own mechanism for getting the contents as we want to control the timeout.
//72    //    $contents = retrieve_url_contents($url);
//73    if (empty($contents))
//74    {
//75        return false;
//76    }
//77    // The second parameter can force the selectors to all be lowercase.
//78    $dom->load($contents, $lowercase, $stripRN);
//79    return $dom;
//80    }

有没有办法让脚本适用于 Wikipedia 或 Wikisource 中的每个 IRI？（我知道并不总是有span#ws-title，那不是我的问题。）

score 1 · Accepted Answer

真棒问题！:)

他们似乎按用户代理过滤，尝试类似

<?php
ini_set("user_agent", "Descriptive user agent string");
file_get_contents("http://de.wikisource.org/wiki/".urlencode("Am_Bächle"));
?>

您可能可以跳过 urlencode 部分，因为我只是用它来测试编码是否正确。

请注意，wikisource 显然不喜欢自动解析网页上的内容。尽管如此，可能有一个 API 可用于 wikibot 等，询问他们或搜索社区页面。无论如何，API 将更容易处理。

score 0 · Accepted Answer

该问题与字符或编码无关。由于Wikimedia User-Agent policy ，您得到 403 ，其中说：

脚本应使用包含联系信息的用户代理字符串，否则它们可能会被 IP 阻止，恕不另行通知。

这就是您应该做的：将 User-Agent 标头设置为可识别您的应用程序的内容，并可用于在出现问题时与您联系。

话虽如此，直接访问页面可能是获得所需日期的最糟糕方法。您应该改用API，或者，如果您想访问大量页面，数据库转储。

php - 为什么我的 simple-html-dom 允许例如 'ä' 用于 wikipedia 但不允许用于 wikisource？

2 回答 2

Related

Reference