-1

从以下 HTML 中,以给定格式提取数据的最佳方法是什么。

<table class="item over  spicy_logo item_border" item_id="3464864" id="item_3464864" ua-action="Item" ua-label="Item">
    <tbody>
        <tr itemscope itemtype="http://schema.org/MenuItem">
            <td class="item_img_box" item_id="3464864" title="How is it?">
                <table>
                    <tbody>
                        <tr>
                            <td>
                                <div>
                                    <img id='img3464864' src="/yelp_images/s3-media4.fl.yelpcdn.com/bphoto/1P50jjYUA4ofx5hF85wm5Q/ms.jpg" align="left" class="item_img" border="0" alt="How is it?"/>
                                </div>
                            </td>
                        </tr>
                    </tbody>
                </table>
            </td>
            <td class="item_name ">
                <div>
                    <a class="cpa" href="http://miami-beach.eat24hours.com/carrot-express/26721?item_id=3464864" itemprop="name">Teeka Salad</a>
                    <div class="item_desc" itemprop="description">Kale, sunflower sprouts, quinoa, avocado, grape tomato, alfalfa bean sprouts, carrots and cucumber with a choice of dressing.</div>
                </div>
            </td>
            <td class="item_price">
                <div >$<span itemprop="price">9.95</span></div>
            </td>
        </tr>
    </tbody>
</table>

预期输出:

ITEM_NAME:蒂卡沙拉

ITEM_DESCRIPTION : 羽衣甘蓝、葵花芽、藜麦、鳄梨、葡萄番茄、紫花苜蓿豆芽、胡萝卜和黄瓜,可选择调味料。

项目价格:9.95 美元

ITEM_IMG:/yelp_images/s3-media4.fl.yelpcdn.com/bphoto/1P50jjYUA4ofx5hF85wm5Q/ms.jpg

我尝试了各种使用 Jsoup 和 Jaunt 的方法。仍然无法弄清楚。

4

1 回答 1

1

下面是获取数据的程序,使用 Jsoup,我使用 CSS 查询选择器。

import java.io.File;
import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class HTMLDataExtraction {

    public static void main(String[] args) throws IOException {
        Document document = Jsoup.parse(new File("A:/Workspaces/MarsWorkspace/jSoupExample/src/main/java/com/URLs/jSoupExample/HTMLParser.html"),
                                        "UTF-8"); //Use execute method Or corresponding Http Methods while connecting to particular URI

        document.select("#img3464864").forEach(element -> {
            System.out.println("ITEM_IMG :" + element.attr("src")); // use absUrl(), to get absolute URL
        });

        document.select("td.item_name > div [itemprop]").forEach(element -> {
            if (element.hasClass("cpa"))
                System.out.println("ITEM_NAME :" + element.text());
            if (element.hasClass("item_desc"))
                System.out.println("ITEM_DESCRIPTION :" + element.text());
        });

        document.select("td.item_price").forEach(element -> {
            System.out.println("ITEM_PRICE:" + element.text());
        });
    }
}
于 2016-01-17T13:57:54.037 回答