0

I am playing around with nutch. I am trying to write something which also include detecting specific nodes in the DOM structure and extracting text data from around the node. e.g. text from parent nodes, sibling nodes etc. I researched and read some examples and then tried writing a plugin that will do this for an image node. Some of the code,

    if("img".equalsIgnoreCase(nodeName) && nodeType == Node.ELEMENT_NODE){
            String imageUrl = "No Url"; 
            String altText = "No Text";
            String imageName = "No Image Name"; //For the sake of simpler code, default values set to
                                                //avoid nullpointerException in findMatches method

            NamedNodeMap attributes = currentNode.getAttributes();
            List<String>ParentNodesText = new ArrayList<String>();
            ParentNodesText = getSurroundingText(currentNode);

            //Analyze the attributes values inside the img node. <img src="xxx" alt="myPic"> 
            for(int i = 0; i < attributes.getLength(); i++){
                Attr attr = (Attr)attributes.item(i);   
                if("src".equalsIgnoreCase(attr.getName())){
                    imageUrl = getImageUrl(base, attr);
                    imageName = getImageName(imageUrl);
                }
                else if("alt".equalsIgnoreCase(attr.getName())){
                    altText = attr.getValue().toLowerCase();
                }
            }

  private List<String> getSurroundingText(Node currentNode){

    List<String> SurroundingText = new ArrayList<String>();
    while(currentNode  != null){
        if(currentNode.getNodeType() == Node.TEXT_NODE){
            String text = currentNode.getNodeValue().trim();
            SurroundingText.add(text.toLowerCase());
        }

        if(currentNode.getPreviousSibling() != null && currentNode.getPreviousSibling().getNodeType() == Node.TEXT_NODE){
            String text = currentNode.getPreviousSibling().getNodeValue().trim();
            SurroundingText.add(text.toLowerCase());
        }
        currentNode = currentNode.getParentNode();
    }   
    return SurroundingText;
}

This doesn't seem to work properly. img tag gets detected, Image name and URL gets retrieved but no more help. the getSurroundingText module looks too ugly, I tried but couldn't improve it. I don't have clear idea from where and how can I extract text which could be related to the image. Any help please?

4

1 回答 1

1

you're on the right track, on the other hand, take a look at this example HTML of code:

<div>
   <span>test1</span>
   <img src="http://example.com" alt="test image" title="awesome title">
   <span>test2</span>
</div>

In your case, I think that the problem lies in the sibling nodes of the img node, for instance you're looking for the direct siblings, and you may think that on the previous example these would be the span nodes, but in this case are some dummy text nodes so when you ask for the sibling node of the img you'll get this empty node with no actual text.

If we rewrite the previous HTML as: <div><span>test1</span><img src="http://example.com" alt="test image" title="awesome title"><span>test2</span></div> then the sibling nodes of the img would be the span nodes that you want.

I'm assuming that in the previous example you want to get both "text1" and "text2", in that case you need to actually keep moving until you find some Node.ELEMENT_NODE and then fetch the text inside that node. One good practice would be to not grab anything that you find, but limit your scope to p,span,div to improve the accuracy.

于 2017-04-28T11:57:55.430 回答