-2

我要存储:

  1. 产品名称
  2. 类别
  3. 子类别
  4. 价格
  5. 产品公司。

在我的名为 products_data 的表中,字段名称为 PID、product_name、category、subcategory、product_price 和 product_company。

我正在使用curl_init()php 中的函数来首先废弃网站 URL,接下来我想将产品数据存储在我的数据库表中。这是我到目前为止所做的:

$sites[0] = 'http://www.babyoye.com/';

foreach ($sites as $site)
{
    $ch = curl_init($site);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    $html = curl_exec($ch);

    $title_start = '<div class="info">';

    $parts = explode($title_start,$html);
    foreach($parts as $part){
        $link = explode('<a href="/d/', $part);

        $link = explode('">', $link[1]);
        $url = 'http://www.babyoye.com/d/'.$link[0];

        // now for the title we need to follow a similar process:

        $title = explode('<h2>', $part);

        $title = explode('</h2>', $title[1]);

        $title = strip_tags($title[0]);

        // INSERT DB CODE HERE e.g.

        $db_conn = mysql_connect('localhost', 'root', '') or die('error');
        mysql_select_db('babyoye', $db_conn) or die(mysql_error());

        $sql = "INSERT INTO products_data(PID, product_name) VALUES ('".$url."', '".$title."')"

        mysql_query($sql) or die(mysql_error()); 

    }
}

我对如何在表中插入数据的数据库部分有点困惑。有什么帮助吗?

4

1 回答 1

8

There's a number of things you may wish to consider in your design phase prior to writing some code:

  • Generalise your solutions as much as you can. If you have to write PHP code for every new scrape, your development changes required if a target site changes their layout may be too slow, and may disrupt the enterprise you are building. This is extra-important if you intend to scrape a large number of sites, since the odds of a site restructuring are statistically greater.
  • One way to achieve this generalisation is to use off-the-shelf libraries that are good at this already. So, rather than using cURL, use Goutte or some other programmatic browser system. This will give you sessions for free, which in some sites is necessary to click from one page to another. You'll also get CSS selectors to specify what items of content you are interested in.
  • For tabular content, store a look-up database table on your local site, that converts a heading title to a database column name. For product grids, you could use a table to convert a CSS selector (relative to each grid cell, say) to a column. Either of these will make it easier to respond to changes in the format of your target site(s).
  • If you are extracting text from a site, at a minimum you need to run it through a proper escape system, otherwise a target site could in theory add content on their site to inject SQL of their choosing into your database. In any case, an apostrophe on their side would certainly cause your call to fail, so you should use mysql_real_escape_string.
  • If you are extracting HTML from a site with view to re-displaying it, always remember to clean it properly first. This means stripping tags that you don't want, removing attributes that may be unwelcome, and ensuring the structure is well-nested. HTMLPurifier is good for this, I've found.

When crawling, remember:

  • Be a good robot and define a unique USER_AGENT for yourself, so site operators are easily block you if they wish. It is poor etiquette to masquerade as a human using, say, Internet Explorer. Include a URL to a friendly help page in your user agent, like the GoogleBot does.
  • Don't crawl through proxies or other systems intended to hide your identity - crawl in the open.
  • Respect robots.txt; if a site wishes to block scrapers, they should be allowed to do so using respected conventions. If you are acting like a search engine, the odds of an operator wishing to block you are very low (don't most people want to be scraped by search engines?)
  • Always do some rate limiting, otherwise this happens. On my development laptop through a slow connection, I can scrape a site at a rate of two pages a second, even without using multi_curl. On a real server, that's likely to be much faster - maybe 20? Either way, making that number of requests of one target IP/domain is a great way to find yourself in someone's blocklist. Thus, if you scrape, do it slowly.
  • I maintain a table of HTTP accesses, and have a rule that if I've made a request in the last 5 seconds, I "pause" this scrape, and scrape something else instead. I come back to paused scrapes once sufficient time has passed. I may be inclined to increase this value, and hold the concurrent state of a larger number of paused operations in memory.
  • If you are scraping a number of sites, one way to maintain performance without sleeping excessively is to interleave the requests you wish to make on a round-robin basis. So, do one HTTP operation each on 50 sites, retain the state of each scrape, and then go back to the first one.
  • If you implement the interleaving of many sites, you can use multi_curl to parallelise your HTTP requests. I wouldn't recommend using this on a single site for reasons already stated (the remote server may well limit the number of connections you can separately open to them anyway).
  • Be careful about basing your entire enterprise on the scraping of a single site. If they block you, you're fairly stuck. If your business model can rely on the scraping of many sites, then being blocked by one becomes less of a risk.

Also, it may be cost-effect to install third party scraping software, or get a third-party service to do the scraping for you. My own research in this area has turned up very few organisations that appear to be capable (and bear in mind that, at the time of writing, I've not tried any of them). So, you may wish to look at these:

于 2013-09-25T07:12:45.190 回答