2

我正在尝试从此链接中抓取产品描述。但是我如何抓取整个文本,包括 标签之间的文本。这是 hxs 对象 hxs.select('//div[@class="overview"]/div/text()').extract(),但原始 HTML :

These classic sneakers from
<b>Puma</b>
are best known for their neat and simple design. These basketball shoes are crafted by novel tooling that brings the sleek retro sneaker look. The pair is equipped with a
<b>leather and synthetic upper.</b>
A vulcanized non-slip rubber sole that is
<b>abrasion resistant ensures good traction.</b>

如果我使用上面提到的 hxs 对象,我会得到:

hxs.select('//div[@class="overview"]/div/text()').extract()
Output: 
[u'These classic sneakers from ',
 u' are best known for their neat and simple design. These basketball shoes are crafted by novel tooling that brings the sleek retro sneaker look. The pair is equipped with a ',
 u' A vulcanized non-slip rubber sole that is ',
 u' sportswear, jeans and tees.',
 u' Gently brush away dust or dirt using a soft cleaning brush.',
 u'\r\nUse a leather conditioner/wax and a brush for added shine.',
 u'Avoid contact with liquids.\xa0']

我想要的是这个:

These classic sneakers from Puma are best known for their neat and simple design. These
 basketball shoes are crafted by novel tooling that brings the sleek retro sneaker look. The pair is equipped with a leather and synthetic upper.A vulcanized non-slip rubber sole 
that is abrasion resistant ensures good traction.

正如你可以看到之间的文字丢失了,所以你能告诉我如何从页面中提取整个文本。

4

1 回答 1

3

尝试从标签中获取全部内容

 //div[@class="overview"]/div

然后您可以使用正则表达式从中删除标签,或者如果它们没有问题则将其保留。

像这样的正则表达式:

 re.sub('<[^>]*>', '', mystring)
于 2013-07-01T15:10:32.457 回答