php - 在 php 和 mysql 中从维基百科中提取内容

Question

我有一个网页，其中包含来自维基百科的精选文章的所有链接，并且我提取了所有这些文章的标题、描述和关键字。但是我有一个问题，当网络爬虫开始提取文章的内容时，我的数据库中的字段描述和关键字仍然是空的。

如何提取维基百科文章的描述和关键字？

网络爬虫是用 php 和 mysql 编写的，这是实际代码：

<?php
error_reporting(E_ALL | E_STRICT);
set_time_limit(0);
$server_link = mysql_connect("localhost", "root", "");
if (!$server_link) {
    die("Fall&oacute; la Conexi&oacute;n " . mysql_error());
}
$db_selected = mysql_select_db("test", $server_link);
if (!$db_selected) {
    die("No se pudo seleccionar la Base de Datos " . mysql_error());
}
@mysql_query("SET NAMES 'utf8'");
function storeLink($titulo, $descripcion, $url, $keywords) {
    $query = "INSERT INTO webs (webTitulo, webDescripcion, weburl, webkeywords) VALUES ('$titulo', '$descripcion', '$url', '$keywords')";
    mysql_query($query) or die('Error, fallÃ³ la inserciÃ³n de datos');
}
function extraer($url, $prof, $patron) {
    $userAgent = 'Interredu';
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_HTTPHEADER, array(("Accept-Language: es-es,en")));
    curl_setopt($ch, CURLOPT_FAILONERROR, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
    curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
    curl_setopt($ch, CURLOPT_MAXREDIRS, 2);
    curl_setopt($ch, CURLOPT_AUTOREFERER, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    $html = curl_exec($ch);
    saveUrl($url, $prof, $patron, $html);
    if (!$html) {
        echo "<br />cURL error number:" . curl_errno($ch);
        echo "<br />cURL error:" . curl_error($ch);
    }
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    $xpath = new DOMXPath($dom);
    $hrefs = $xpath->evaluate("/html/body//a");
    for ($i = 0;$i < $hrefs->length;++$i) {
        $href = $hrefs->item($i);
        $url2 = $href->getAttribute('href');
        $var = strstr($url2, '#', true);
        if ($var !== false) {
            $url2 = $var;
        }

        if (strpos($url2, $patron) === false) {
            continue;
        }

        if ($url2 != $url && $url2 != '') {
            $busqueda = mysql_query("SELECT weburl FROM webs WHERE weburl='$url2'");
            $cantidad = mysql_num_rows($busqueda);
            if (1500 >= $prof && 0 == $cantidad) {
                extraer($url2, ++$prof, $patron);
            }
        }
    }
}
function saveUrl($url, $prof, $patron, $html) {
    $retorno = false;
    $pos = strpos($url, $patron);
    if ($prof >= 1) {
        preg_match_all("(<title>(.*)<\/title>)siU", $html, $title);
        $metas = get_meta_tags($url, 1);
        $titulo = html_entity_decode($title[1][0], ENT_QUOTES, 'UTF-8');
        $descripcion = isset($metas["description"])?$metas["description"] : '';
        $keywords = isset($metas["keywords"])?$metas["keywords"] : '';
    if (empty($descripcion)){
obtenerMetaDescription($html);
    }
    if (empty($keywords)){
preg_match_all("#<\s*h1[^>]*>[^<]+</h1>#is", $html, $encabezado);
    preg_match_all("#<\s*b[^>]*>[^<]+</b>#is", $html, $negrita);
    preg_match_all("#<\s*i[^>]*>[^<]+</i>#is", $html, $italica);
    foreach($encabezado[0] as $encabezado){
    $h1 = $encabezado;
    }
    foreach($negrita[0] as $negrita){
    $bold = $negrita;
    }
    foreach($italica[0] as $italica){
    $italic = $italica;
    }
    $keys = $bold." ".$h1." ".$italic." ";
    $keywords = substr(strip_tags($keys), 0, 200);
}
        storeLink($titulo, $descripcion, $url, $keywords, $prof);
        $retorno = true;
    }
    return $retorno;
}
function obtenerMetaDescription($text) {
    preg_match_all('#<p>(.*)</p>#Us', $html, $parraf);
    foreach($parraf[1] as $parraf){
    $descripcion = substr(strip_tags($parraf), 0, 200);
    }
    }
$url = "http://www.mywebsite.com/wikiarticles";
$patron = "http://es.wikipedia.org/wiki/";
$prof = 1500;
libxml_use_internal_errors(true);
extraer($url, 1, $patron);
$errores = libxml_get_errors();
libxml_clear_errors();
mysql_close();
?>

谢谢大家，问候。

score 1 · Accepted Answer

General approach in such situations

First thing to do is to locate the error

check the contents of the variables at different positions ($descripcion, $metas, $parraf) for some known Wikipedia URLs (you can check manually)
this lets you find out where the variables are correct and where not

Then you can come to the following possible conclusions:

every variable is correct in the code: some problem in your mysql-insert method
some variable is not set, even though it should be: error at the specific location in your code

How this approach applies to your situation

meta-description does not seem to be used on Wikipedia (at least on the article I looked)
thus, obtenerMetaDescription() should be called
so I tried this method with a small example like this:

Code:

function obtenerMetaDescription($text) {
    preg_match_all('#<p>(.*)</p>#Us', $html, $parraf);
    foreach($parraf[1] as $parraf){
        $descripcion = substr(strip_tags($parraf), 0, 200);
        var_dump($descripcion);
    }
}

$html = file_get_contents('https://de.wikipedia.org/wiki/Ehrenmal_Marienfeld');
obtenerMetaDescription($html);

PHP output is: PHP Notice: Undefined variable: html in test.php on line 4

Solution for your situation

You used $html even though it was passed as $text to the function. Simple variable problem.

Possible other problem

Double check the assigning to $descripcion in the same function. You assign the contents of <p> to $descripcion in a for loop. You overwrite the old value each time with the old value. I cannot imagine this to be an expected behaviour. I guess you wanted to implement one of the following both:

Take only the first paragraph: only use $parraf[1][0] if !empty()
Concatenate all texts to one large text: use .= string concatenation operator

php - 在 php 和 mysql 中从维基百科中提取内容

1 回答 1

General approach in such situations

How this approach applies to your situation

Solution for your situation

Possible other problem

Related

Reference