我正在尝试从此链接中抓取产品描述。但是我如何抓取整个文本,包括 标签之间的文本。这是 hxs 对象
hxs.select('//div[@class="overview"]/div/text()').extract()
,但原始 HTML :
These classic sneakers from
<b>Puma</b>
are best known for their neat and simple design. These basketball shoes are crafted by novel tooling that brings the sleek retro sneaker look. The pair is equipped with a
<b>leather and synthetic upper.</b>
A vulcanized non-slip rubber sole that is
<b>abrasion resistant ensures good traction.</b>
如果我使用上面提到的 hxs 对象,我会得到:
hxs.select('//div[@class="overview"]/div/text()').extract()
Output:
[u'These classic sneakers from ',
u' are best known for their neat and simple design. These basketball shoes are crafted by novel tooling that brings the sleek retro sneaker look. The pair is equipped with a ',
u' A vulcanized non-slip rubber sole that is ',
u' sportswear, jeans and tees.',
u' Gently brush away dust or dirt using a soft cleaning brush.',
u'\r\nUse a leather conditioner/wax and a brush for added shine.',
u'Avoid contact with liquids.\xa0']
我想要的是这个:
These classic sneakers from Puma are best known for their neat and simple design. These
basketball shoes are crafted by novel tooling that brings the sleek retro sneaker look. The pair is equipped with a leather and synthetic upper.A vulcanized non-slip rubber sole
that is abrasion resistant ensures good traction.
正如你可以看到之间的文字丢失了,所以你能告诉我如何从页面中提取整个文本。