I am playing around with nutch. I am trying to write something which also include detecting specific nodes in the DOM structure and extracting text data from around the node. e.g. text from parent nodes, sibling nodes etc. I researched and read some examples and then tried writing a plugin that will do this for an image node. Some of the code,
if("img".equalsIgnoreCase(nodeName) && nodeType == Node.ELEMENT_NODE){
String imageUrl = "No Url";
String altText = "No Text";
String imageName = "No Image Name"; //For the sake of simpler code, default values set to
//avoid nullpointerException in findMatches method
NamedNodeMap attributes = currentNode.getAttributes();
List<String>ParentNodesText = new ArrayList<String>();
ParentNodesText = getSurroundingText(currentNode);
//Analyze the attributes values inside the img node. <img src="xxx" alt="myPic">
for(int i = 0; i < attributes.getLength(); i++){
Attr attr = (Attr)attributes.item(i);
if("src".equalsIgnoreCase(attr.getName())){
imageUrl = getImageUrl(base, attr);
imageName = getImageName(imageUrl);
}
else if("alt".equalsIgnoreCase(attr.getName())){
altText = attr.getValue().toLowerCase();
}
}
private List<String> getSurroundingText(Node currentNode){
List<String> SurroundingText = new ArrayList<String>();
while(currentNode != null){
if(currentNode.getNodeType() == Node.TEXT_NODE){
String text = currentNode.getNodeValue().trim();
SurroundingText.add(text.toLowerCase());
}
if(currentNode.getPreviousSibling() != null && currentNode.getPreviousSibling().getNodeType() == Node.TEXT_NODE){
String text = currentNode.getPreviousSibling().getNodeValue().trim();
SurroundingText.add(text.toLowerCase());
}
currentNode = currentNode.getParentNode();
}
return SurroundingText;
}
This doesn't seem to work properly. img tag gets detected, Image name and URL gets retrieved but no more help. the getSurroundingText module looks too ugly, I tried but couldn't improve it. I don't have clear idea from where and how can I extract text which could be related to the image. Any help please?