web-scraping - 使用刮板箱检索兄弟元素

Question

在学习 Rust 时，我正在尝试构建一个简单的网络爬虫。我的目标是抓取https://news.ycombinator.com/并获取标题、超链接、投票和用户名。我为此使用了外部库reqwest和scraper，并编写了一个从该站点抓取 HTML 链接的程序。

货运.toml

[package]
name = "stackoverflow_scraper"
version = "0.1.0"
edition = "2018"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
scraper = "0.12.0"
reqwest = "0.11.2"
tokio = { version = "1", features = ["full"] }
futures = "0.3.13"

src/main.rs

use scraper::{Html, Selector};
use reqwest;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let url = "https://news.ycombinator.com/";
    let html = reqwest::get(url).await?.text().await?;
    let fragment = Html::parse_fragment(html.as_str());
    let selector = Selector::parse("a.storylink").unwrap();

    for element in fragment.select(&selector) {
        println!("{:?}",element.value().attr("href").unwrap());
        // todo println!("Title");
        // todo println!("Votes");
        // todo println!("User");
    }

    Ok(())
}

如何获得其对应的标题、票数和用户名？

score 3 · Accepted Answer

首页上的项目存储在一个tablewith 类.itemlist中。

由于每个项目都由三个连续的组成<tr>，因此您必须以三个为一组对其进行迭代。我选择首先收集所有节点。

第一行包含：

标题
领域

第二行包含：

积分
作者
邮寄年龄

第三行是应该忽略的间隔。

笔记：

最近一小时内创建的帖子似乎没有显示任何积分，因此需要相应处理。
广告不包含用户名。
最后两个表行tr.morespace和tr包含a.morelink应该被忽略。这就是为什么我选择首先.collect()使用节点然后使用.chunks_exact().

use reqwest;
use scraper::{Html, Selector};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let url = "https://news.ycombinator.com/";
    let html = reqwest::get(url).await?.text().await?;
    let fragment = Html::parse_fragment(html.as_str());

    let selector_items = Selector::parse(".itemlist tr").unwrap();

    let selector_title = Selector::parse("a.storylink").unwrap();
    let selector_score = Selector::parse("span.score").unwrap();
    let selector_user = Selector::parse("a.hnuser").unwrap();

    let nodes = fragment.select(&selector_items).collect::<Vec<_>>();

    let list = nodes
        .chunks_exact(3)
        .map(|rows| {
            let title_elem = rows[0].select(&selector_title).next().unwrap();
            let title_text = title_elem.text().nth(0).unwrap();
            let title_href = title_elem.value().attr("href").unwrap();

            let score_text = rows[1]
                .select(&selector_score)
                .next()
                .and_then(|n| n.text().nth(0))
                .unwrap_or("0 points");

            let user_text = rows[1]
                .select(&selector_user)
                .next()
                .and_then(|n| n.text().nth(0))
                .unwrap_or("Unknown user");

            [title_text, title_href, score_text, user_text]
        })
        .collect::<Vec<_>>();

    println!("links: {:#?}", list);

    Ok(())
}

那应该为您提供以下列表：

[
    [
        "Docker for Mac M1 RC",
        "https://docs.docker.com/docker-for-mac/apple-m1/",
        "327 points",
        "mikkelam",
    ],
    [
        "A Mind Is Born – A 256 byte demo for the Commodore 64 (2017)",
        "https://linusakesson.net/scene/a-mind-is-born/",
        "226 points",
        "matthewsinclair",
    ],
    [
        "Show HN: Video Game in a Font",
        "https://www.coderelay.io/fontemon.html",
        "416 points",
        "ghub-mmulet",
    ],
    ...
]

或者，有一个可用的 API 可以使用：

GitHub, HackerNews API

score 1 · Accepted Answer

这更像是一个选择器问题，它取决于被抓取网站的 html。在这种情况下，很容易得到标题，但更难得到积分和用户。由于您使用的选择器选择了包含 href 和标题的链接，因此您可以使用 .text() 方法获取标题

let title = element.text().collect::<Vec<_>>();

其中元素与 href 相同

但是，要获取其他值，更改第一个选择器并从中获取数据会更容易。由于 news.ycombinator.com 上新闻项目的标题和链接位于具有 .athing 类的元素中，而投票和用户位于没有类的下一个元素中（使其更难选择），最好选择"table.itemlist tr.athing"和迭代这些结果。从找到的每个元素中，然后可以子选择"a.storylink"元素，并分别获取以下 tr 元素和子选择点和用户元素

let select_item = Selector::parse("table.itemlist tr.athing").unwrap();
let select_link = Selector::parse("a.storylink").unwrap();
let select_score = Selector::parse("span.score").unwrap();

for element in fragment.select(&select_item) {
    // Get the link element that contains the href and title
    let link_el = element.select(&select_link).next().unwrap();
    println!("{:?}", link_el.value().attr("href").unwrap());

    // Get the next tr element that follows the first, with score and user
    let details_el = ElementRef::wrap(element.next_sibling().unwrap()).unwrap();
    // Get the score element from within the second row element
    let score = details_el.select(&select_score).next().unwrap();
    println!("{:?}", score.text().collect::<Vec<_>>());
}

这仅显示获取 href 和分数。我会把它留给你从details_el

web-scraping - 使用刮板箱检索兄弟元素

2 回答 2

Related

Reference