javascript - 优化 cronjob 上的爬虫脚本

Question

我在 MySQL 表中有大约 6600 万个域，我需要在所有域上运行爬虫并在爬虫完成时更新行数 = 1。

爬虫脚本在 php 中使用 php crawler library 这里是脚本。

set_time_limit(10000);
        try{

            $strWebURL          =   $_POST['url'];
            $crawler    =   new MyCrawler();
            $crawler->setURL($strWebURL);
            $crawler->addContentTypeReceiveRule("#text/html#");
            $crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png)$# i");
            $crawler->enableCookieHandling(true);
            $crawler->setTrafficLimit(1000 * 1024);
            $crawler->setConnectionTimeout(10);

            //start of the table
            echo '<table border="1" style="margin-bottom:10px;width:100% !important;">';
            echo '<tr>';
            echo '<th>URL</th>';
            echo '<th>Status</th>';
            echo '<th>Size (bytes)</th>';
            echo '<th>Page</th>';
            echo '</tr>';
            $crawler->go();
            echo '</table>';

            $this->load->model('urls');
            $this->urls->incrementCount($_POST['id'],'urls');

        }catch(Exception $e){

        }

$this->urls->incrementCount(); 仅更新行并标记计数列 = 1

因为我有 66M 域，我需要在我的服务器上运行 cronjob，并且由于 cronjob 在命令行上运行，我需要一个无头浏览器，所以我选择了 phanjomjs，因为在没有无头浏览器（phantomjs）的情况下，爬虫无法按照我希望的方式工作

我面临的第一个问题是从 mysql db 加载域并从 js 脚本运行爬虫脚本我试过这个：

创建一个 php 脚本，它以 json 形式返回域并从 js 文件加载它并 foreach 域并运行爬虫，但它不能很好地工作并在一段时间后卡住
我尝试的下一件事，我仍在使用的是创建一个 python 脚本来直接从 mysql db 加载域，并从 python 脚本在每个域上运行 phantom js 脚本。

这是代码

import MySQLdb
import httplib
import sys
import subprocess
import json

args = sys.argv;

db = MySQLdb.connect("HOST","USER","PW","DB")
cursor = db.cursor()
#tablecount = args[1]
frm = args[1]
limit = args[2]

try:
    sql = "SELECT * FROM urls WHERE count = 0 LIMIT %s,%s" % (frm,limit)
    cursor.execute(sql)
    print "TOTAL RECORDS: "+str(cursor.rowcount)
    results = cursor.fetchall()
    count = 0;
    for row in results:
        try:
            domain = row[1].lower()
            idd = row[0]
            command = "/home/wasif/public_html/phantomjs /home/wasif/public_html/crawler2.js %s %s" % (domain,idd)
            print command
            proc = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE)
            script_response = proc.stdout.read()
            print script_response
        except:
            print "error running crawler: "+domain

except:
    print "Error: unable to fetch data"
db.close()

它需要 2 个参数来设置从数据库中选择域的限制。

foreach 域并使用子进程运行此命令

command = "/home/wasif/public_html/phantomjs /home/wasif/public_html/crawler2.js %s %s" % (domain,idd)
command
proc = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE)
script_response = proc.stdout.read()
        print script_response

crawler2.js 文件也需要 2 个参数 1 是域，第二个是 id 更新计数 = 1 当爬虫完成时这是 crawler2.js

var args = require('system').args;
var address = '';
var id = '';
args.forEach(function(arg, i) {
    if(i == 1){
       address = arg;
    }

    if(i == 2){
        id = arg;
    }
});

address = "http://www."+address;

var page = require('webpage').create(),
server = 'http://www.EXAMPLE.net/main/crawler',
data = 'url='+address+'&id='+id;

console.log(data);

page.open(server, 'post', data, function (status) {
    if (status !== 'success') {
        console.log(address+' Unable to post!');
    } else {
        console.log(address+' : done');
    }
    phantom.exit();
});

它运行良好，但是我的脚本在某个时间后卡住了 n 需要在某个时间后重新启动，并且日志显示没有任何问题

我需要优化这个过程并尽可能快地运行爬虫，任何帮助将不胜感激

score 0 · Accepted Answer

网络爬虫程序员在这里。:)

你的 python 串行执行幻象。您应该并行执行。要做到这一点，执行幻影然后离开它，不要等待它。

在 PHP 中，会是这样的：

exec("/your_executable_path > /dev/null &");

如果不需要，请不要使用幻影。它渲染一切。> 需要 50MB 内存。

javascript - 优化 cronjob 上的爬虫脚本

1 回答 1

Related

Reference