最近我想通过 Java 获取 HTML 源代码中的信息。基本需求是获取 HTML 的主要内容区域。例如,以下是 HTML 源代码示例:
<html>
<head>
<tilte>
chinese charactor --中文
<title>
</head>
<body>
<div>
this is something area including Chinese charactor.,like meun I don't need,
</div>
<div>
this is something area including Chinese charactor,like ads I don't need,
</div>
<div>
this is main content, include the content I need. almost every content is filled by many Chinese charactor.Like: 好好学习,天天向上。 我爱stackoverflow.谢谢你的帮助,非常感谢!
</div>
<div>
this is foot area, also including Chinese charactor ,but I don't need.
</div>
</body>
</html>
这个 HTML 源代码很简单;有许多不同和复杂的来源。我想通过java解析包含主要内容的div或其他元素区域。我想要的结果是:
<div>
This is main content, include the content I need. almost every content is filled by many Chinese character like: 好好学习,天天向上。 我爱stackoverflow.谢谢你的帮助,非常感谢!
</div>
有数以万计的div,里面有不同的内容,div id是未知的或不同的。div 有许多不同的条件,例如 p 标签。有没有办法判断汉字的外观或分布来解析内容?