0

在问这个问题之前,我尝试了几种不同的方法,当然还尝试在谷歌上搜索一些方向/答案。我已经检查过 StackOverflow,似乎找不到解决方案。

基本上,我想创建一个工具,例如基于 url 和 xpath 返回数据

URL:        http://www.google.co.uk/search?q=wicked+games
XPath:      id('rso')/li/div/h3/a

应该返回这些结果

http://puu.sh/3V4JG.jpg

我可以从其他 URL 解析 XML,例如,如果我要获取确切的 XML 文件,例如http://renualsoft.com/jordon/person.xml但是我不确定如何为 google 执行此操作?

我试过这个

    DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
    factory.setNamespaceAware(true);
    DocumentBuilder builder;
    Document doc = null;
    XPathExpression expr = null;
    builder = factory.newDocumentBuilder();
    doc = builder.parse("http://www.google.co.uk/search?q=wicked+games");
    XPathFactory xFactory = XPathFactory.newInstance();
    XPath xpath = xFactory.newXPath();

    expr = xpath.compile("id('rso')/li/div/h3/a/@href");
    Object result = expr.evaluate(doc, XPathConstants.NODESET);
    NodeList nodes = (NodeList) result;
    for (int i = 0; i < nodes.getLength(); i++) {
        System.out.println(nodes.item(i).getNodeValue());
    }

但是我得到了这个例外

Exception in thread "main" java.io.IOException: Server returned HTTP response code: 403 for URL: http://www.google.co.uk/search?q=wicked+games
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1625)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:633)
    at com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion(XMLVersionDetector.java:189)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:799)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764)
    at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:123)
    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:237)
    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:300)
    at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:177)
    at NewEmptyJUnitTest.query(NewEmptyJUnitTest.java:35)
    at NewEmptyJUnitTest.main(NewEmptyJUnitTest.java:77)
Java Result: 1

任何帮助或指导都会非常感谢,我曾尝试在其他地方寻找答案,但就像我说的我找不到任何有用的东西。

4

1 回答 1

0

HTMLUnitsmth。为你?

import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlAnchor;
import com.gargoylesoftware.htmlunit.html.HtmlPage;

class Example
{
    public static void main(final String args[]) throws FailingHttpStatusCodeException, MalformedURLException, IOException
    {
        final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_17);
        webClient.getOptions().setCssEnabled(false);

        final HtmlPage page = webClient.getPage("http://www.google.co.uk/search?q=wicked+games");

        final List<?> byXPath = page.getByXPath("//ol['rso']//h3/a");

        for (final Object object : byXPath)
        {
            System.out.println(((HtmlAnchor) object).getTextContent());
        }
    }
}

这将打印:

Chris Isaak - Wicked Game - YouTube The Weeknd - Wicked Games (Explicit) -
YouTube Emika - Wicked Game - YouTube Wicked Game - Wikipedia, the
free encyclopedia THE WEEKND - WICKED GAMES LYRICS THE WEEKND LYRICS -
Wicked Games - A-Z Lyrics The Weeknd – Wicked Games Lyrics | Rap
Genius Chris Isaak - Wicked Game - Video Dailymotion Wicked Game |
Chris Isaak | Music Video | MTV Wicked Games

Maven依赖:

<dependency>
    <groupId>net.sourceforge.htmlunit</groupId>
    <artifactId>htmlunit</artifactId>
    <version>2.12</version>
</dependency>
于 2013-08-06T11:35:46.433 回答