java - 在 HTML 中查找包含文章内容的文本区域

Question

最近我想通过 Java 获取 HTML 源代码中的信息。基本需求是获取 HTML 的主要内容区域。例如，以下是 HTML 源代码示例：

<html> 
  <head>
  <tilte>
     chinese charactor --中文
   <title>
  </head> 

      <body>
        <div>
        this is something area including Chinese charactor.,like meun I don't need,
        </div>
        <div>
   this is something area including Chinese charactor,like ads I don't need, 
        </div>
        <div>  
 this is  main content, include the content I need. almost every content is filled by         many  Chinese charactor.Like: 好好学习，天天向上。 我爱stackoverflow.谢谢你的帮助，非常感谢！
        </div>
        <div>  
 this is foot area, also including Chinese charactor ,but I don't need.
         </div>
        </body>
   </html>

这个 HTML 源代码很简单；有许多不同和复杂的来源。我想通过java解析包含主要内容的div或其他元素区域。我想要的结果是：

<div>  
   This is main content, include the content I need. almost every content is filled by         many Chinese character like: 好好学习，天天向上。 我爱stackoverflow.谢谢你的帮助，非常感谢！
   </div>

有数以万计的div，里面有不同的内容，div id是未知的或不同的。div 有许多不同的条件，例如 p 标签。有没有办法判断汉字的外观或分布来解析内容？

score 0 · Accepted Answer

我不能说我很确定你的目标是什么，但一个好的起点可能是在 Apache 的 HTTPComponents 包中。那里有很多工具可以发出 http 请求并将数据返回到字符串缓冲区（我认为你想要的）

在这里查看：

http://hc.apache.org/httpcomponents-client-ga/tutorial/html/fundamentals.html#d5e43

此外，在 HTTPComponents 主页上，大多数教程都有中文翻译——你知道，如果这对你有用的话:D

http://hc.apache.org/

score 0 · Accepted Answer

我不能说我有信心理解这个问题，但您似乎想通过 Java 抓取 HTML 页面中的某个 div？

我必须这样做才能从遗留系统中抓取一些数据来测试新系统 - 看看http://htmlunit.sourceforge.net/。基本上，它允许您像在浏览器中一样点击您想要的页面（所以即使您通常必须填写表格才能到达该页面，您也可以这样做），然后抓取不同部分的内容以多种不同的方式创建页面——例如，您可以获得所有 div 的集合，然后选择第三个，或者选择具有正确 CSS 类的 div，或者只使用 XPath。

java - 在 HTML 中查找包含文章内容的文本区域

2 回答 2

Related

Reference