php - 获取没有标题/编码的外部网页的html源

Question

我只想知道是否可以从没有编码标头的 html 文件中提取编码的内容（以 utf-8 格式）。

我的具体情况是这个网站：

http://www.metal-archives.com/band/discography/id/203/tab/all

我想提取所有信息，但如您所见，例如，这个词看起来很糟糕：

墨头

我尝试使用 file_get_html、htmlentities、utf_decode、utf_encode 并将它们与不同的选项混合，但我找不到解决方案......

编辑：

我只想用这个简单的代码查看格式正确的同一个网站：

$html_discos = file_get_html("http://www.metal-archives.com/band/discography/id/223/tab/all");
//some transform/decode here
print_r($html_discos);

我希望字符串或 DOM 对象中的内容格式正确，以便稍后获取某些部分。

编辑2：

$file_get_html 是“simple html dom”库的一个函数：

http://simplehtmldom.sourceforge.net/

有这个代码：

function file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
{
    // We DO force the tags to be terminated.
    $dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText);
    // For sourceforge users: uncomment the next line and comment the retreive_url_contents line 2 lines down if it is not already done.
    $contents = file_get_contents($url, $use_include_path, $context, $offset);
    // Paperg - use our own mechanism for getting the contents as we want to control the timeout.
    //$contents = retrieve_url_contents($url);
    if (empty($contents) || strlen($contents) > MAX_FILE_SIZE)
    {
        return false;
    }
    // The second parameter can force the selectors to all be lowercase.
    $dom->load($contents, $lowercase, $stripRN);
    return $dom;
}

score 2 · Accepted Answer

URL 的 Content-Type

http://www.metal-archives.com/band/discography/id/203/tab/all

是：

Content-Type: text/html

这将默认为 ISO-8859-1。但是您想使用 UTF-8。更改 Content-Type 以便正确发出信号：

Content-Type: text/html; charset=utf-8

请参阅：设置 HTTP 字符集参数

score 1 · Accepted Answer

header('Content-Type: text/html; charset=utf-8');
echo file_get_contents('http://www.metal-archives.com/band/discography/id/203/tab/all');

只要您以 UTF-8 形式发出，原始数据就会正常工作。

score 0 · Accepted Answer

尝试使用html_eneity_decode http://php.net/manual/en/function.html-entity-decode.php（该页面的来源具有编码字符）

php - 获取没有标题/编码的外部网页的html源

3 回答 3

Related

Reference