0

我想使用jsoup从 url List of city and towns in India中提取所有城市名称和州名,下面给出了该页面的 HTML 代码片段。

这里Abhaypuri是一个城市的名称,而Assam是一个州的名称。类似的城市和州名也在页面中多次出现在这种出现数千次的表结构中,除了td 标记内 的url之外,其他一切都相同。

<table class="wikitable sortable" style="text-align:;">
<tr>
<th>Name of City/Town</th>
<th>Name of State</th>
<th>Classification<pre><code></th>
<th>Population (2001)<pre><code></th>
<th>Population (2011)<pre><code></th>
</tr>
<tr>
<td><pre><code><a href="/wiki/Abhayapuri" title="Abhayapuri">Abhayapuri<pre><code></a><pre><code></td>
<td><pre><code><a href="/wiki/Assam" title="Assam">Assam<pre><code></a><pre><code></td>

我是jsoup的新手。任何帮助,将不胜感激。谢谢你。

4

1 回答 1

2

示例代码:

    Document root = Jsoup.parse(new URL("http://en.wikipedia.org/wiki/List_of_cities_and_towns_in_India"), 30000);
    //find all tables
    Elements tables = root.select("table");
    for (int m = 0; m < tables.size(); m++) {
        final Element table = tables.get(m);
        Elements th0 = table.select("tbody tr th");
        //find our tables
        if (th0 != null && th0.get(0).text().trim().equals("Name of City/Town")) {
            Elements es = table.select("tbody tr");
            for (int i = 1; i < es.size(); i++) {
                Elements td = es.get(i).select("td");
                String city = td.get(0).select("a").first().text();
                String state = td.get(1).select("a").first().text();
                System.out.println(city + " => " + state);
            }
        }
    }

输出:

Abhayapuri => Assam
Achabbal => Jammu and Kashmir
Achalpur => Maharashtra
Achhnera => Uttar Pradesh
Adari => Uttar Pradesh
Adalaj => Gujarat
Adilabad => Andhra Pradesh
Adityana => Gujarat
pereyaapatna => Karnataka
Adoni => Andhra Pradesh
Adoor => Kerala
Adyar => Karnataka
Adra => West Bengal
Afzalpura => Karnataka
Agartala => Tripura
于 2013-01-05T07:32:24.993 回答