php - 价格上的网络蜘蛛

Question

我想在几个网站上按价格比较某些产品。因此，我可以在购买产品之前为自己创建价格历史记录。如果价格保持稳定，我通常会订购产品，如果不是，我会去问为什么价格会一直上涨和下跌。

我想使用 PHP 为自己编写一个网络爬虫程序，所以这是自动完成的，因为当我手动执行它会花费大量时间。

因此，我创建了一个 MySql 数据库，在其中输入了我想要关注的所有产品的所有 URL。之后我使用 i 简单的脚本来输出价格：

<?php

@ini_set("output_buffering", "Off");
@ini_set('implicit_flush', 1);
@ini_set('zlib.output_compression', 0);
@ini_set('max_execution_time', 1200);
$dbhost = 'localhost';
$dbuser = 'salesrep';
$dbpass = 'pas';
$dbname = "spider";
$dbtable = "price_compair";
$conn = mysql_connect($dbhost, $dbuser, $dbpass) or die('Error connecting to mysql');
$selected = mysql_select_db($dbname) or die(mysql_error());
$results = mysql_query("SELECT * FROM $dbtable");
mysql_close($conn);

while ($row = mysql_fetch_array($results, MYSQL_ASSOC)) {
    echo "<td>" . $row['artikel'] . "</td>";
    if ($row['url_site1'] == "") {
        echo "<td>&nbsp;</td>";
    } else {
        if (!$fp = fopen($row['url_tx3'], "r")) {
            return false;
        }
        $content = "";
        while (!feof($fp)) {
            $content .= fgets($fp, 1024);
        }
        fclose($fp);
        preg_match_all("/\&euro; (\d+\.\d+)/", $content, $pricesite1, PREG_SET_ORDER);
        $replace1 = array("&euro; ");
        echo "<td>" . str_replace($replace1, "", $pricesite1[1][0]) . "</td>";
    }

    if ($row['url_site2'] == "") {
        echo "<td>&nbsp;</td>";
    } else {
        if (!$fp = fopen($row['url_tx3shop'], "r")) {
            return false;
        }
        $content = "";
        if (ob_get_level() == 0)
            ob_start();
        while (!feof($fp)) {
            $content .= fgets($fp, 1024);
        }
        fclose($fp);
        preg_match_all("/\d+\.\d+\,\d+|(\d+\,\d+)/", $content, $pricesite2, PREG_SET_ORDER);
        $replace2 = array("€ ", ".");
        $out = str_replace(",", ".", str_replace($replace2, "", $pricesite2[1][0]));
        if ($out == "") {
            echo "<td>" . str_replace(",", ".", str_replace($replace2, "", $pricesite2[0][0])) . "</td>";
        } else {
            echo "<td>" . $out . "</td>";
        }
    }
    echo "</tr>";
    ob_flush();
    flush();
}
?>

我遇到的主要问题是 1 个网站使用 € 符号，另一个在代码中处理 €，因此查找价格很棘手。此外，欧元符号可能位于价格的前面或后面。更难的是，在实际价格之前可能会有一个建议零售价。

我的脚本目前有效，但我用于 preg_match_all 的代码远非完美。有没有人知道如何构建它，以便它在任何网站上都能完美运行？

另外，我在构建蜘蛛时使用的 fgets 语句是否正确？

我知道那里有比较网站可以为我做这件事，但我发现它是一个有趣的 PHP 项目 :)

score 1 · Accepted Answer

I only have a solution to one of your issues. What you are asking is going to be very difficult - a script that can understand what of many prices on any page may be refering to that specific product of that page. You need to write code for each site. I think this will be a waste of time, as if you a scraping 20 sites, it is likely at least one will change their HTML per year.

Anyway, for the HTML entity (€) issue, pass all the HTML though:

function entitiesToUTF8( $str )
{
    $str = preg_replace( '~&#x0*([0-9a-f]+);~ei', 'entityToUTF8("\\1")', $str );
    $str = preg_replace( '~&#0*([0-9]+);~e', 'entityToUTF8(\\1)', $str );

    return $str;
}

and this will convert HTML entities to characters.

php - 价格上的网络蜘蛛

1 回答 1

Related

Reference