node.js - 通过 X 射线/节点抓取黑客新闻

Question

我如何通过 x-ray/nodejs抓取黑客新闻（ https://news.ycombinator.com/ ）？

我想从中得到这样的东西：

[
  {title1, comment1},
  {title2, comment2},
  ...
  {"‘Minimal’ cell raises stakes in race to harness synthetic life", 48}
  ...
  {title 30, comment 30}
]

有一个新闻表，但我不知道如何抓取它...网站上的每个故事都由三列组成。这些没有他们独有的父母。所以结构看起来像这样

<tbody>
  <tr class="spacer"> //Markup 1
  <tr class="athing"> //Headline 1 ('.deadmark+ a' contains title)
  <tr class>          //Meta Information 1 (.age+ a contains comments)
  <tr class="spacer"> //Markup 2
  <tr class="athing"> //Headline 2 ('.deadmark+ a' contains title)
  <tr class>          //Meta Information 2 (.age+ a contains comments)
  ...
  <tr class="spacer"> //Markup 30
  <tr class="athing"> //Headline 30 ('.deadmark+ a' contains title)
  <tr class>          //Meta Information 30 (.age+ a contains comments)

到目前为止，我已经尝试过：

x("https://news.ycombinator.com/", "tr", [{
  title: [".deadmark+ a"],
  comments: ".age+ a"
}])

和

x("https://news.ycombinator.com/", {
  title: [".deadmark+ a"],
  comments: [".age+ a"]
})

第二种方法返回 30 个名称和 29 个评论点...我看不到将它们映射在一起的任何可能性，因为没有信息 30 个标题中的哪一个缺少评论...

任何帮助

score 4 · Accepted Answer

由于无法在 CSS 选择器中引用当前上下文，X-ray因此使用package来抓取标记并不容易。这对于在行之后获取下一个兄弟以获取评论很有用。trtr.thing

我们仍然可以使用“下一个兄弟”符号（the +）来获取下一行，但是，我们将获取完整的行文本，然后使用正则表达式提取评论值，而不是针对可选的注释链接。如果不存在评论，则将值设置为0。

完整的工作代码：

var Xray = require('x-ray');
var x = Xray();

x("https://news.ycombinator.com/", {
    title: ["tr.athing .deadmark+ a"],
    comments: ["tr.athing + tr"]
})(function (err, obj) {
    // extracting comments and mapping into an array of objects
    var result = obj.comments.map(function (elm, index) {
        var match = elm.match(/(\d+) comments?/);
        return {
            title: obj.title[index],
            comments: match ? match[1]: "0"
        };
    });
    console.log(result);
});

当前打印：

[ { title: 'Follow the money: what Apple vs. the FBI is really about',
    comments: '85' },
  { title: 'Unable to open links in Safari, Mail or Messages on iOS 9.3',
    comments: '12' },
  { title: 'Gogs – Go Git Service', comments: '13' },
  { title: 'Ubuntu Tablet now available for pre-order',
    comments: '56' },
  ...
  { title: 'American Tech Giants Face Fight in Europe Over Encrypted Data',
    comments: '7' },
  { title: 'Moving Beyond the OOP Obsession', comments: '34' } ]

node.js - 通过 X 射线/节点抓取黑客新闻

1 回答 1

Related

Reference