web-crawler - StatisticsDB 在 Crawler4j 开源代码中做了什么？

Question

我正在尝试了解Crawler4j 开源网络爬虫。同时，我有一些疑问，如下所示，

问题：-

StatisticsDB 在 Counters 类中是做什么的，请解释以下代码部分，

 public Counters(Environment env, CrawlConfig config) throws DatabaseException {
    super(config);

    this.env = env;
    this.counterValues = new HashMap<String, Long>();

    /*
     * When crawling is set to be resumable, we have to keep the statistics
     * in a transactional database to make sure they are not lost if crawler
     * is crashed or terminated unexpectedly.
     */
    if (config.isResumableCrawling()) {
        DatabaseConfig dbConfig = new DatabaseConfig();
        dbConfig.setAllowCreate(true);
        dbConfig.setTransactional(true);
        dbConfig.setDeferredWrite(false);
        statisticsDB = env.openDatabase(null, "Statistics", dbConfig);

        OperationStatus result;
        DatabaseEntry key = new DatabaseEntry();
        DatabaseEntry value = new DatabaseEntry();
        Transaction tnx = env.beginTransaction(null, null);
        Cursor cursor = statisticsDB.openCursor(tnx, null);
        result = cursor.getFirst(key, value, null);

        while (result == OperationStatus.SUCCESS) {
            if (value.getData().length > 0) {
                String name = new String(key.getData());
                long counterValue = Util.byteArray2Long(value.getData());
                counterValues.put(name, counterValue);
            }
            result = cursor.getNext(key, value, null);
        }
        cursor.close();
        tnx.commit();
    }
}

据我了解，它保存了爬取的 URL，这有助于在爬虫崩溃的情况下，然后网络爬虫不需要从头开始。 请您逐行解释上述代码。

2. 我没有找到任何可以向我解释 SleepyCat 的好链接，因为 Crawlers4j 使用 SleepyCat 来存储中间信息。所以请告诉我一些好的资源，我可以从中学习 SleepyCat 的基础知识。（我不知道上面代码中使用的Transaction，Cursor是什么意思）。

请帮帮我。寻找您的友好答复。

score 1 · Accepted Answer

基本上，Crawler4j 通过从数据库中加载所有值来从数据库中加载现有的统计信息。事实上，代码几乎是不正确的，因为打开了一个事务并且没有对数据库进行任何修改。因此可以删除处理 tnx 的行。

逐行注释：

//Create a database configuration object 
DatabaseConfig dbConfig = new DatabaseConfig();
//Set some parameters : allow creation, set to transactional db and don't use deferred    write
dbConfig.setAllowCreate(true);
dbConfig.setTransactional(true);
dbConfig.setDeferredWrite(false);
//Open the database called "Statistics" with the upon created configuration
statisticsDB = env.openDatabase(null, "Statistics", dbConfig);

 OperationStatus result;
//Create new database entries key and values
    DatabaseEntry key = new DatabaseEntry();
    DatabaseEntry value = new DatabaseEntry();
//Start a transaction
    Transaction tnx = env.beginTransaction(null, null);
//Get the cursor on the DB
    Cursor cursor = statisticsDB.openCursor(tnx, null);
//Position the cursor to the first occurrence of key/value
    result = cursor.getFirst(key, value, null);
//While result is success
    while (result == OperationStatus.SUCCESS) {
//If the value at the current cursor position is not null, get the name and the value of     the counter and add it to the Hashmpa countervalues
        if (value.getData().length > 0) {
            String name = new String(key.getData());
            long counterValue = Util.byteArray2Long(value.getData());
            counterValues.put(name, counterValue);
        }
        result = cursor.getNext(key, value, null);
    }
    cursor.close();
//Commit the transaction, changes will be operated on th DB
    tnx.commit();

我也在这里回答了一个类似的问题。关于 SleepyCat，你是在说这个吗？

web-crawler - StatisticsDB 在 Crawler4j 开源代码中做了什么？

1 回答 1

Related

Reference