apache - Apache Nutch 2.1 - How get complete source code

Question

I am trying to write my own Nutch plugin for crawling webpages. The problem is that I need to identify if there is some special tag, e.g. on the webpage. There is some note in official documentation that this is possible using Document.getElementsByTagName("foo") but this is not working for me. Do you have any idea?

My second question is that if I identified tag above, I would like to get some other tags from this webpage where tag was identified... is there any way to store complete source code of the webpage which is crawled at some moment?

Thanks, Jan.

score 3 · Accepted Answer

If you want to extract content based on an HTML tag, you could look at the xpath-filter plugin: http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/ You can write an xpath query and configure it in the plugin to extract the information you need.

Another option is to write a plugin (as you are doing at the moment) and use an HTML/XML parser to get the information out. Here's what I have done when I needed to extract some content out of a specific div:

  @Override
  public NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks) throws IndexingException {

        //LOG.info("filter init: ");
        Metadata metadata = parse.getData().getParseMeta();
        String fullContent = metadata.get("fullcontent");

        Document document = Jsoup.parse(fullContent); 
        Element contentwrapper = document.select("div#content").first();

        //LOG.info("fullcontent");
        //LOG.info(contentwrapper);


        // Add field
        doc.add("contentwrapper", contentwrapper.text());

        return doc;
  }

apache - Apache Nutch 2.1 - How get complete source code

1 回答 1

Related

Reference