9

我正在尝试使用 python Scrapy 仅从正文中抓取文本,但还没有任何运气。

希望一些学者可以在这里帮助我从<body>标签中抓取所有文本。

4

2 回答 2

4

Scrapy 使用 XPath 表示法来提取 HTML 文档的一部分。那么,您是否尝试过仅使用/html/body路径进行提取<body>?(假设它嵌套在 中<html>)。使用选择器可能更简单//body

x.select("//body").extract()    # extract body

您可以在此处找到有关 Scrapy 提供的选择器的更多信息。

于 2011-03-22T11:11:46.097 回答
2

最好能得到由 生成的输出lynx -nolist -dump,它会渲染页面然后转储可见文本。通过提取段落元素的所有子元素的文本,我已经接近了。

我从 开始//body//text(),它将所有文本元素拉到正文中,但这包括脚本元素。 //body//p获取正文中的所有段落元素,包括未标记文本周围的隐含段落标记。从子标签(如粗体斜体、跨度、div)中 提取带有//body//p/text()未命中元素的文本。只要页面没有在段落中嵌入脚本标签,似乎就可以获得大部分所需的内容。//body//p//text()

在 XPath 中/意味着一个直接的孩子,同时//包括所有的后代。

% scrapy shell
In[1]: fetch('http://stackoverflow.com/questions/5390133/scrapy-body-text-only')
In[2]: hxs.select('//body//p//text()').extract()

Out[2]:
[u"I am trying to scrape the text only from body using python Scrapy, but haven't had any luck yet.",
u'Wishing some scholars might be able to help me here scraping all the text from the ',
u'&lt;body&gt;',
u' tag.',
u'Thank you in advance for your time.',
u'Scrapy uses XPath notation to extract parts of a HTML document. So, have you tried just using the ',
u'/html/body',
u' path to extract ',
u'&lt;body&gt;',
u"? (assuming it's nested in ",
u'&lt;html&gt;',
u'). It might be even simpler to use the ',
u'//body',
u' selector:',
u'You can find more information about the selectors Scrapy provides ',
u'here',

将字符串与空格连接在一起,您将获得一个非常好的输出:

In [43]: ' '.join(hxs.select("//body//p//text()").extract())
Out[43]: u"I am trying to scrape the text only from body using python Scrapy, but haven't had any luck yet. Wishing some scholars might be able to help me here scraping all the text from the  &lt;body&gt;  tag. Thank you in advance for your time. Scrapy uses XPath notation to extract parts of a HTML document. So, have you tried just using the  /html/body  path to extract  &lt;body&gt; ? (assuming it's nested in  &lt;html&gt; ). It might be even simpler to use the  //body  selector: You can find more information about the selectors Scrapy provides  here . This is a collaboratively edited question and answer site for  professional and enthusiast programmers . It's 100% free, no registration required. about \xbb \xa0\xa0\xa0 faq \xbb \r\n             tagged asked 1 year ago viewed 280 times active 1 year ago"
于 2012-06-09T02:50:29.413 回答