如果你仔细观察,你会发现每个类别都是一个<div>
with class=accordeonContainer
,它的标题在一个h2
(在那个下面),而子类别列表在<dl>
一个"clearfix"
CSS 类下面:
<div class="accordeonContainer accordeonExpanded">
<h2 class=" accordeonTitle "><span>Multimedia</span></h2>
<div class="accordeonContent" id="Multimedia" style="display: block;">
<dl class="clearfix">
<dt>Camera Resolution</dt>
<dd>1600 x 1200 pixels </dd>
...
<dt>Graphic Formats</dt>
<dd>BMP, DCF, EXIF, GIF87a, GIF89a, JPEG, PNG, WBMP </dd>
...
</dl>
</div>
</div>
您可以使用以下命令选择特定类型(例如elm
)和给定 CSS 类(例如clazz
)的元素列表:
Elements elms = doc.select("elm.clazz");
然后,简而言之,提取您提到的信息的代码可能是:
public class Nokiareviews {
public static void main(String[] args) throws IOException {
Document doc = Jsoup.connect("http://www.developer.nokia.com/Devices/Device_specifications/Nokia_Asha_308/")
.timeout(1000 * 1000).get();
Elements content = doc.select("div.accordeonContainer");
for (Element spec : content) {
Elements h2 = spec.select("h2.accordeonTitle");
System.out.println(h2.text());
Elements dl = spec.select("dl.clearfix");
Elements dts = dl.select("dt");
Elements dds = dl.select("dd");
Iterator<Element> dtsIterator = dts.iterator();
Iterator<Element> ddsIterator = dds.iterator();
while (dtsIterator.hasNext() && ddsIterator.hasNext()) {
Element dt = dtsIterator.next();
Element dd = ddsIterator.next();
System.out.println("\t\t" + dt.text() + "\t\t" + dd.text());
}
}
}
}
如果使用 Maven,请确保将其添加到您的pom.xml
:
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.7.2</version>
</dependency>