php - 使用 Php Curl 和 Dom-Document 获取未知格式的网站标题

Question

我想使用站点 url 获取站点标题，其中大部分站点都在工作，但是它在日语和中文站点上获得了一些不可读的文本。

这是我的功能

function file_get_contents_curl($url) {
    $ch = curl_init();

    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

    $data = curl_exec($ch);
    curl_close($ch);

    return $data;
}

利用

use--------
    $html = $this->file_get_contents_curl($url);

解析

$doc = new DOMDocument();
@$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
$title = $nodes->item(0)->nodeValue;

网站网址：https ://user.ameba.jp/regist/registerIntro.do?campaignId=0053&frmid=3051

请帮我提出一些方法来获得任何语言的确切网站标题。

//例子

    /*  MEthod----------4 */
 function file_get_contents_curl($url){
    $ch = curl_init();

    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

    $data = curl_exec($ch);
    curl_close($ch);

    return $data;
}

    $uurl="http://www.piaohua.com/html/xuannian/index.html";
    $html = file_get_contents_curl($uurl);

//parsing begins here:
$doc = new DOMDocument();
@$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');

//get and display what you need:
if(!empty($nodes->item(0)->nodeValue)){
$title = utf8_decode($nodes->item(0)->nodeValue);

}else{
    $title =$uurl;

}
echo $title;

score 1 · Accepted Answer

utf-8通过在文件开头添加以下行来确保您的脚本正在使用编码

    mb_internal_encoding('UTF-8');

这样做之后，utf8_decode从您的代码中删除函数。没有它一切都应该正常工作

[DOMDocument::loadHtml]1 个函数从 html 页面元标记获取编码。因此，如果页面没有明确指定其编码，您可能会遇到问题。

score 1 · Accepted Answer

只需将此行添加到您的 PHP 代码之上。

header('Content-Type: text/html;charset=utf-8');

编码..

<?php
header('Content-Type: text/html;charset=utf-8');
function file_get_contents_curl($url) {
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

    $data = curl_exec($ch);
    curl_close($ch);

    return $data;
}

$html = file_get_contents_curl('http://www.piaohua.com/html/lianxuju/2013/1108/27730.html');
$doc = new DOMDocument();
@$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
echo $title = $nodes->item(0)->nodeValue;

php - 使用 Php Curl 和 Dom-Document 获取未知格式的网站标题

2 回答 2

编码..

Related

Reference