1

我正在尝试使用JSoup从http://dictionary.reference.com/browse/quick获取一些内容。如果你去那个页面,你会看到他们组织数据的方式是将单词quick的每个“单词类型”(形容词、动词、名词)呈现为自己的部分,并且每个部分包含 1+ 个列表定义。

为了让事情更复杂一点,每个定义中的每个单词都是指向另一个 dictionary.com 页面的链接:

quick
    adjective
        1. done, proceeding, or occurring with promptness or rapidity...
        2. that is over or completed within a short interval of time
        ...
        14. Archaic.
            a. endowed with life
            b. having a high degree of vigor, energy, ...
    noun
        1. living persons; the quick and the dead
        2. the tender, sensitive flesh of the living body...
        ...
    adverb
        ...

我想要做的是使用 JSoup 将单词类型及其各自的定义作为字符串列表获取,如下所示:

public class Metadata {
    // Ex: "adjective", "noun", etc.
    private String wordType;

    // Ex: String #1: "1. done, proceeding, or occurring with promptness or rapidity..."
    //     String #2: "that is over or completed within a short interval of time..."
    private List<String> definitions;
}

因此,页面实际上由 a 组成List<Metadata>,其中每个Metadata元素都是与 1+ 个定义配对的单词类型。

我能够使用非常简单的 API 调用找到单词类型列表:

// Contains 1 Element for each word type, like "adjective", "noun", etc.
Document doc = Jsoup.connect("http://dictionary.reference.com/browse/quick").get();
Elements wordTypes = doc.select("div.body div.pbk span.pg");

但我正在努力弄清楚doc.select(...)我必须做些什么才能获得每个Metadata实例。

4

2 回答 2

2

如果您查看 Jsoup 从该页面获取的 HTML,您会看到类似

  <div class="body"> 
     <div class="pbk"> 
      <span class="pg">adjective </span> 
      <div class="luna-Ent">
       <span class="dnindex">1.</span>
       <div class="dndata">
        done, proceeding, or occurring with promptness or rapidity, as an action, process, etc.; prompt; immediate: 
        <span class="ital-inline">a quick response.</span> 
       </div>
      </div>
      <div class="luna-Ent">
       <span class="dnindex">2.</span>
       <div class="dndata">
        that is over or completed within a short interval of time: 
        <span class="ital-inline">a quick shower.</span> 
       </div>
      </div>
...
     <div class="pbk"> 
      <span class="pg">adverb </span> 
      <div class="luna-Ent">
       <span class="dnindex">19.</span>
       <div class="dndata">
        <a style="font-style:normal; font-weight:normal;" href="/browse/quickly">quickly</a>.
       </div>
      </div> 
     </div> 

所以每一节

adjective
    1. done, proceeding, or occurring with promptness or rapidity...
    2. that is over or completed within a short interval of time
    ...
    14. Archaic.
        a. endowed with life
        b. having a high degree of vigor, energy, ...
noun
    1. living persons; the quick and the dead
    2. the tender, sensitive flesh of the living body...
    ...
adverb
    ...

<div class="pbk">其中包含<span class="pg">adjective </span>部分名称和 divs 中的定义<div class="luna-Ent">。所以你可以尝试做类似的事情

Document doc = Jsoup.connect("http://dictionary.reference.com/browse/quick").get();

Elements sections = doc.select("div.body div.pbk");
for (Element element : sections) {
    String elementType = element.getElementsByClass("pg").text();
    System.out.println("--------------------");
    System.out.println(elementType);

    for (Element definitions : element.getElementsByClass("luna-Ent"))
        System.out.println(definitions.text());

}

此代码将选择所有部分,并element.getElementsByClass("pg")使用它们在具有类的 div 中的事实来查找部分的名称和定义luna-Ent element.getElementsByClass("luna-Ent")(如果您想跳过数字1.2.您可以选择dndata类代替luna-Ent

输出:

--------------------
adjective
1. done, proceeding, or occurring with promptness or rapidity, as an action, process, etc.; prompt; immediate: a quick response.
2. that is over or completed within a short interval of time: a quick shower.
3. moving, or able to move, with speed: a quick fox; a quick train.
4. swift or rapid, as motion: a quick flick of the wrist.
5. easily provoked or excited; hasty: a quick temper.
6. keenly responsive; lively; acute: a quick wit.
7. acting with swiftness or rapidity: a quick worker.
8. prompt or swift to do something: quick to respond.
9. prompt to perceive; sensitive: a quick eye.
10. prompt to understand, learn, etc.; of ready intelligence: a quick student.
11. (of a bend or curve) sharp: a quick bend in the road.
12. consisting of living plants: a quick pot of flowers.
13. brisk, as fire, flames, heat, etc.
14. Archaic. a. endowed with life. b. having a high degree of vigor, energy, or activity.
--------------------
noun
15. living persons: the quick and the dead.
16. the tender, sensitive flesh of the living body, especially that under the nails: nails bitten down to the quick.
17. the vital or most important part.
18. Chiefly British. a. a line of shrubs or plants, especially of hawthorn, forming a hedge. b. a single shrub or plant in such a hedge.
--------------------
adverb
19. quickly.
于 2013-11-13T23:37:10.660 回答
0

你去吧。顺便说一句,要测试 CSS 选择器,您可以在 Chrome 开发者工具中激活控制台并直接在他们的网站上测试这样的查询:jQuery('div.body div.pbk div.luna-Ent > .dndata')

Document doc = Jsoup.connect("http://dictionary.reference.com/browse/quick").get();
Elements wordTypes = doc.select("div.body div.pbk");

for (Element wordType : wordTypes) {
    Elements typeOfSpeech = wordType.select("span.pg");

    System.out.println("typeOfSpeech: " + typeOfSpeech.text());

    Elements elements = wordType.select("div.luna-Ent > .dndata");

    for (int i = 0; i < elements.size(); i++) {
        Element element = elements.get(i);
        System.out.println((i + 1) + ". " + element.text());
    }
}
于 2013-11-13T23:32:00.590 回答