java - Webcrawler 中的 JSoup.SocketTimeoutException 和 404 HttpStatusException

Question

我正在尝试编写一个网络爬虫，它从同义词库网站中获取某些单词的同义词，然后将它们打印到文本文件中。似乎随机地，在抓取了几个链接之后，我会得到一个 SocketTimeOutException 或一个 404 HttpStatusException。

只是为了提供背景，我的代码使用一个带有链接的文本文件来为网络爬虫提供 URL。

模式往往是，如果连续三个或更多 url 包含在同义词库网站上找不到的单词，则会引发这些异常。是的，我知道这可以通过简单地删除不在同义词库中找到的单词的链接来解决，但是我的 url 列表相当长，所以定位和验证同义词库中的单词是什么不可能的。

    import java.util.ArrayList;
    import java.util.Scanner;
    import org.jsoup.nodes.Document;  
    import org.jsoup.*;
    import java.util.*;
    import java.io.*;

    public class ThesaurusSpider {
    private static final File urlList = new File("C:\\Users\\DaRkD0Ma1N\\Documents\\s.m.a.r.t\\generatedurls.txt"); 
    private static ArrayList<String> urlArrayList = new ArrayList<String>();


    public static void CreateUrlArray(ArrayList<String> urlArray, File urlList) throws FileNotFoundException{
        Scanner infile = new Scanner(urlList);
        while(infile.hasNext()){
            urlArrayList.add(infile.nextLine());
        }
    }

        /*
         *   .userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")
                       .referrer("http://www.google.com")              
                       .get();
         */
        public static void ExtractData(ArrayList<String> urlArrayList) throws IOException

, InterruptedException{
        Document doc = Jsoup.connect("http://www.thesaurus.com/").userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0").referrer("http://www.google.com").timeout(1000).get();
        File synonyms = new File("C:\\Users\\DaRkD0Ma1N\\Documents\\s.m.a.r.t\\generated_syns.txt");
        PrintWriter pw = new PrintWriter(synonyms);
        String test = doc.title();
        System.out.println(test);
        int counter = 0;
        try{
        for(String url:urlArrayList){
            if(counter == 30){
                Thread.sleep(500);
                counter = 0;
            }else{
                Document word_doc = Jsoup.connect(url).get();
                if(word_doc.getElementById("words-gallery-no-results") != null || word_doc.select("class.no_results").hasText()){
                    Thread.sleep(1000*2);
                    continue;
                }
                    String[] title = word_doc.title().split(" ");
                    System.out.println(title[0]);
                    pw.write(title[0] + "\r\n");
                        if(word_doc.getElementById("synonyms-0") != null){
                            System.out.println(word_doc.select("

em.txt").get(1).text());
                        pw.write(word_doc.select("em.txt").get(1).text()  + "\r\n");
                        System.out.print(word_doc.select("span.text").text() + " ");
                        pw.write(word_doc.select("span.text").text() + " "  + "\r\n");
                    }
                    System.out.println("");
                    counter++;
        }
    }
    }catch(HttpStatusException e){
        e.printStackTrace();
        pw.close();
    }
    }


public static void PrintArrayList(ArrayList<String> list){
    System.out.println(list);
}
public static void main(String[] args) throws IOException, HttpStatusException, InterruptedException{
    CreateUrlArray(urlArrayList, urlList);
    PrintArrayList(urlArrayList);
    ExtractData(urlArrayList);
}

}

链接在文本文件中如下所示：
http://www.thesaurus.com/browse/Abby?s=t
http://www.thesaurus.com/browse/abdicate?s=t
" "
" "
" "
链接是按字母顺序排列的单词集合。有些词可以在同义词库中找到，有些则不能。我有一个循环应该捕获并跳过同义词库中不属于单词的链接，但是我想它并没有捕获所有错误的链接。

在这个问题上，我有点把头撞在墙上，所以任何帮助/建议都会受到赞赏。

java - Webcrawler 中的 JSoup.SocketTimeoutException 和 404 HttpStatusException

0 回答 0

Related

Reference