java - java中的网络爬虫。下载网页问题

Question

我正在尝试开发一个小型网络爬虫，它可以下载网页并搜索特定部分的链接。但是当我运行这段代码时，“href”标签中的链接正在缩短。喜欢：

original link : "/kids-toys-action-figures-accessories/b/ref=toys_hp_catblock_actnfigs?ie=UTF8&node=165993011&pf_rd_m=ATVPDKIKX0DER&pf_rd_s=merchandised-search-4&pf_rd_r=267646F4BB25430BAD0D&pf_rd_t=101&pf_rd_p=1582921042&pf_rd_i=165793011"

变成：“/kids-toys-action-figures-accessories/b?ie=UTF8&node=165993011”

任何人都可以帮助我吗？下面是我的代码：

package test;
import java.io.*;
import java.net.MalformedURLException;
import java.util.*;
public class myFirstWebCrawler {

public static void main(String[] args)  {

    String strTemp = "";
    String dir="d:/files/";
    String filename="hello.txt";
    String fullname=dir+filename;

    try {
        URL my_url = new URL("http://www.amazon.com/s/ref=lp_165993011_ex_n_1?rh=n%3A165793011&bbn=165793011&ie=UTF8&qid=1376550433");
        BufferedReader br = new BufferedReader(new InputStreamReader(my_url.openStream(),"utf-8"));
        createdir(dir);
        while(null != (strTemp = br.readLine())){
            writetofile(fullname,strTemp);
        System.out.println(strTemp);

      }
        System.out.println("index of feature category : "  +  readfromfile(fullname,"Featured Categories"));
    } catch (Exception ex) {
        ex.printStackTrace();
    }

}


public static void createdir(String dirname)
{ File d= new File(dirname);

  d.mkdirs();


}

public static void writetofile(String path, String bbyte)
{
    try
    {
        FileWriter filewriter = new FileWriter(path,true);
        BufferedWriter bufferedWriter = new BufferedWriter(filewriter);
        bufferedWriter.write(bbyte);
        bufferedWriter.newLine();
        bufferedWriter.close();
    }
    catch(IOException e)
    {System.out.println("Error");}

}

public static int readfromfile(String path, String key)
{
    String dir="d:/files/";
    String filename="hello1.txt";
    String fullname=dir+filename;
    linksAndAt[] linksat=new linksAndAt[10];
    BufferedReader bf = null;
    try {
        bf = new BufferedReader(new FileReader(path));
    } catch (FileNotFoundException e1) {

        e1.printStackTrace();
    }
    String currentLine;
    int index =-1;
    try{
        Runtime.getRuntime().exec("cls");
    while((currentLine = bf.readLine()) != null)
    {
        index=currentLine.indexOf(key);
        if(index>0)
        { 
            writetofile(fullname,currentLine);
            int count=0;
            int lastIndex=0;
            while(lastIndex != -1)
            {
                lastIndex=currentLine.indexOf("href=\"",lastIndex);

                if(lastIndex != -1)
                {
                    lastIndex+="href=\"".length();
                    StringBuilder sb = new StringBuilder();
                while(currentLine.charAt(lastIndex) != '\"')
                    {
                        sb.append(Character.toString(currentLine.charAt(lastIndex)));
                        lastIndex++;

                    }

                    count++;

                    System.out.println(sb);
                }
            }
            System.out.println("\n count : " + count);
            return index;
        }

    }
    }
    catch(FileNotFoundException f)
    {
        f.printStackTrace();
    System.out.println("Error");
    }
    catch(IOException e)
    {try {
        bf.close();
    } catch (IOException e1) {
    e1.printStackTrace();
    }}
    return index;}
}

score 0 · Accepted Answer

在我看来，这就像服务器应用程序对来自桌面浏览器和基于 Java 的爬虫的请求的响应不同的情况。这可能是因为您的浏览器在其请求中传递了您的基于 Java 的爬虫没有的 cookie（例如会话持久 cookie），或者可能是因为您的桌面浏览器传递了与爬虫不同的 User-Agent 标头，或者这可能是因为您的桌面浏览器和 Java 爬虫之间的其他请求标头不同。

在编写爬虫应用程序时，这是遇到的最大问题之一：很容易忘记不同客户端请求的相同 URL 不会总是以相同的代码响应。不确定这是否是您在这里发生的事情，但这很常见。

java - java中的网络爬虫。下载网页问题

1 回答 1

Related

Reference