1

I'm fiddling around with some scraping, and need to manipulate some of the data before writing it to my json file.

var Xray = require('x-ray');
var x = Xray();


x('http://myUrl.com', '#search_results div div a', [{
    title: '.responsive_search_name_combined .search_name .title',
    price: '.col.search_price.responsive_secondrow',
}])

.paginate('.search_pagination_right a.pagebtn:last-child@href')
    .limit(10)
    .write('data.json');

When saved, price looks like this: "price": "\r\n\t\t\t\t\t\t\t\t13,99€\t\t\t\t\t\t\t".

I guess its because theres a lot of spaces in div.col.search_price.responsive_secondrow.

<div class="col search_price  responsive_secondrow">
								9,99€							</div>

So my question is: Would it be possible to manipulate the data before .write?

4

2 回答 2

3

是的,您可以简单地提供一个回调函数,该函数接受一个对象,该对象是您的抓取结果。在此功能中,您可以完全控制您想做的任何后处理。

所以你的代码最终会是这样的:

x('http://myUrl.com', '#search_results div div a', [{
        title: '.responsive_search_name_combined .search_name .title',
        price: '.col.search_price.responsive_secondrow',
    }])
(function(products){
     var cleanedProducts = [];
     products.forEach(function(product){
        var cleanedProduct = {};
        cleanedProduct.price = product.price.trim();
        //etc
        cleanedProducts.push(cleanedProduct)
     });

     //write out results.json 'manually'
     fs.writeFile('results.json', JSON.stringify(cleanedProducts));
})

于 2015-12-23T13:02:53.473 回答
0

您可以使用称为filter函数的 X-Ray 原生支持的方法并完全涵盖您描述的情况。

filters是自定义函数,允许您在处理抓取的数据时实现自定义逻辑。

请参阅下面的代码示例。有一个自定义的过滤器函数,名称为cleanUpText并将其应用于抓取的数据price

    var Xray = require('x-ray');
var x = Xray({
    filters: {
        cleanUpText: function (value) { return value.replace('\r\n\t\t\t\t\t\t\t\t', '').replace('\t\t\t\t\t\t\t', ''); },
    }
});


x('http://store.steampowered.com/search/?filter=topsellers', '#search_results div div a', [{
    title: '.responsive_search_name_combined .search_name .title ',
    price: '.col.search_price.responsive_secondrow | cleanUpText', // calling filter function 'cleanUpText'
}])

    .paginate('.search_pagination_right a.pagebtn:last-child@href')
    .limit(10)
    .write('data.json');

data.json如下所示:

{"title": "PLAYERUNKNOWN'S BATTLEGROUNDS",
"price": "$29.99"},
{"title": "PAYDAY 2: Ultimate Edition",
"price": "$44.98"}
于 2017-06-13T08:13:40.400 回答