python - 如何限制蜘蛛使用scrapy抓取某些xPaths

Question

我正在尝试抓取一个网站，从产品页面我试图取消产品描述，但我如何只选择产品描述：

xPath : hxs.select('//div[@class="product-shop"]/p/text()').extract()

HTML 相当大，请参阅上面指定的链接。

我只想选择产品描述而不是其他详细信息...

如果我这样做：

[" ".join([i.strip() for i in hxs.select('//div[@class="product-shop"]/p/text()').extract()])]

output : 
[u'Itemcode: 12BTS28271 Brand: BASICS InStock - Ships within 2 business days. Tip: 90% of our shipments reach within 4 business days! This product is part of the Basics T.shirts line made of 100% Cotton. Stripes Muscle Fit T.shirts that come in Green Color. Casual that comes with Henley away.']

但我只想：

[u'This product is part of the Basics T.shirts line made of 100% Cotton. Stripes Muscle Fit T.shirts that come in Green Color. Casual that comes with Henley away.']

score 2 · Accepted Answer

Rightclicking on the element in the elements panel in chrome tells me:

enter image description here

//*[@id="product_addtocart_form"]/div[2]/div[1]/p[3]

Points to

<p>This product is part of the Basics T.shirts line made of 100% Cotton.<br>
                        Stripes Muscle Fit T.shirts that come in Green Color.<br>
                        Casual that comes with Henley away.</p>

Trying the same XPATH on this page also points to the description there too:

<p>This product is part of the Basics Shirts line made of 100% Cotton.<br>
                    Plain Slim Fit Shirts that come in Orange Color.<br>
                    Casual that comes with Button Down away.</p>

So it looks like all you need to do is call that XPATH on the page and you're set. You should still verify that that XPATH works in all cases though, as it's always prone to change depending on the page.

python - 如何限制蜘蛛使用scrapy抓取某些xPaths

1 回答 1

Related

Reference