1

我正在尝试了解Crawler4j 开源网络爬虫。同时,我有一些疑问,如下所示,

问题:-

  1. StatisticsDB 在 Counters 类中是做什么的,请解释以下代码部分,

     public Counters(Environment env, CrawlConfig config) throws DatabaseException {
        super(config);
    
        this.env = env;
        this.counterValues = new HashMap<String, Long>();
    
        /*
         * When crawling is set to be resumable, we have to keep the statistics
         * in a transactional database to make sure they are not lost if crawler
         * is crashed or terminated unexpectedly.
         */
        if (config.isResumableCrawling()) {
            DatabaseConfig dbConfig = new DatabaseConfig();
            dbConfig.setAllowCreate(true);
            dbConfig.setTransactional(true);
            dbConfig.setDeferredWrite(false);
            statisticsDB = env.openDatabase(null, "Statistics", dbConfig);
    
            OperationStatus result;
            DatabaseEntry key = new DatabaseEntry();
            DatabaseEntry value = new DatabaseEntry();
            Transaction tnx = env.beginTransaction(null, null);
            Cursor cursor = statisticsDB.openCursor(tnx, null);
            result = cursor.getFirst(key, value, null);
    
            while (result == OperationStatus.SUCCESS) {
                if (value.getData().length > 0) {
                    String name = new String(key.getData());
                    long counterValue = Util.byteArray2Long(value.getData());
                    counterValues.put(name, counterValue);
                }
                result = cursor.getNext(key, value, null);
            }
            cursor.close();
            tnx.commit();
        }
    }
    

据我了解,它保存了爬取的 URL,这有助于在爬虫崩溃的情况下,然后网络爬虫不需要从头开始。 请您逐行解释上述代码。

2. 我没有找到任何可以向我解释 SleepyCat 的好链接,因为 Crawlers4j 使用 SleepyCat 来存储中间信息。所以请告诉我一些好的资源,我可以从中学习 SleepyCat 的基础知识。(我不知道上面代码中使用的Transaction,Cursor是什么意思)。

请帮帮我。寻找您的友好答复。

4

1 回答 1

1

基本上,Crawler4j 通过从数据库中加载所有值来从数据库中加载现有的统计信息。事实上,代码几乎是不正确的,因为打开了一个事务并且没有对数据库进行任何修改。因此可以删除处理 tnx 的行。

逐行注释:

//Create a database configuration object 
DatabaseConfig dbConfig = new DatabaseConfig();
//Set some parameters : allow creation, set to transactional db and don't use deferred    write
dbConfig.setAllowCreate(true);
dbConfig.setTransactional(true);
dbConfig.setDeferredWrite(false);
//Open the database called "Statistics" with the upon created configuration
statisticsDB = env.openDatabase(null, "Statistics", dbConfig);

 OperationStatus result;
//Create new database entries key and values
    DatabaseEntry key = new DatabaseEntry();
    DatabaseEntry value = new DatabaseEntry();
//Start a transaction
    Transaction tnx = env.beginTransaction(null, null);
//Get the cursor on the DB
    Cursor cursor = statisticsDB.openCursor(tnx, null);
//Position the cursor to the first occurrence of key/value
    result = cursor.getFirst(key, value, null);
//While result is success
    while (result == OperationStatus.SUCCESS) {
//If the value at the current cursor position is not null, get the name and the value of     the counter and add it to the Hashmpa countervalues
        if (value.getData().length > 0) {
            String name = new String(key.getData());
            long counterValue = Util.byteArray2Long(value.getData());
            counterValues.put(name, counterValue);
        }
        result = cursor.getNext(key, value, null);
    }
    cursor.close();
//Commit the transaction, changes will be operated on th DB
    tnx.commit();

我也在这里回答了一个类似的问题。关于 SleepyCat,你是在说这个吗?

于 2013-06-07T08:54:07.390 回答