java - 无法使用 tidy 和 Xpath 在 Java 中检索 Web 数据

Question

我想做的是从 XHTML 文件中抓取一个简单的内部 HTML。我已将搜索范围缩小到元素节点，但未能检索到信息。

请注意：元素节点没有子节点。这样做我得到一个空指针异常

这是 HTML 片段

    <div id="dvTitle" class="titlebtmbrdr01" style="line-height: 22px;">BAJAJ AUTO LTD.       </div>

另请注意，此文件的命名空间为 http://www.w3.org/1999/xhtml

你可以看到我有我想要的 div 元素BAJAJ AUTO LTD。

这是我正在使用的代码

    import java.io.IOException;
     import java.net.MalformedURLException; 
      import java.net.URL;
      import java.util.Vector;

    import javax.xml.xpath.XPath;
    import javax.xml.xpath.XPathConstants;
    import javax.xml.xpath.XPathExpression;
      import javax.xml.xpath.XPathExpressionException;
    import javax.xml.xpath.XPathFactory;

    import jxl.read.biff.BiffException;
    import jxl.write.WriteException;
    import jxl.write.biff.RowsExceededException;

    import org.w3c.dom.Document;
    import org.w3c.dom.Element;
      import org.w3c.dom.Node;
      import org.w3c.dom.NodeList;
    import org.w3c.dom.Text;

    import com.sun.org.apache.xml.internal.serialize.Serializer;


    public class BSEQuotesExtractor implements valueExtractor {

@Override
public Vector<String> getName(Document d) throws XPathExpressionException,            RowsExceededException, BiffException, WriteException, IOException {
    // TODO Auto-generated method stub
    XPathFactory factory = XPathFactory.newInstance();
    XPath xpath = factory.newXPath();
    xpath.setNamespaceContext(new MynamespaceContext());


    Object result = xpath.evaluate("//*[@id='dvTitle']",d, XPathConstants.NODESET);
    NodeList nodes = (NodeList) result;

    System.out.println(nodes.getLength());
    System.out.println(nodes.item(0).getNodeName());
    System.out.println(nodes.item(0).getAttributes().item(1).getNodeName());
    System.out.println(nodes.item(0).getAttributes().item(1).getNodeValue());
    System.out.println(nodes.item(0).getTextContent());

    return null;
}

public static void main(String[] args) throws MalformedURLException, IOException, XPathExpressionException, RowsExceededException, BiffException, WriteException{
    BSEQuotesExtractor q = new BSEQuotesExtractor();
    DOMParser parser = new DOMParser(new URL("http://www.bseindia.com/bseplus/StockReach/StockQuote/Equity/BAJAJ%20AUTO%20LTD/BAJAJAUT/532977/Scrips").openStream());
    Document d = parser.getDocument();
    q.getName(d);

}

        }

这是我得到的输出

1
格
dvTitle
空

现在为什么我得到那个空值？我应该得到BAJAJ AUTO LTD。

score 1 · Accepted Answer

当我打开您的代码引用的页面div时，这对我来说实际上是空的：

<div class="titlebtmbrdr01" id="dvTitle" style="line-height: 22px;"></div>

So perhaps you should save the page content to some file to examine if it is the same for you. If it is, but your browser displays things differently, then figure out what combination of cookies and other headers makes a difference there.

java - 无法使用 tidy 和 Xpath 在 Java 中检索 Web 数据

1 回答 1

Related

Reference