python - 使用scrapy，如何让xpath更有选择性？

Question

我正在使用 Scrapy 获取文章。

>>> articletext = hxs.select("//span[@id='articleText']")
>>> for p in articletext.select('.//p'):
...     print p.extract()
...
<p class="byline">By Patricia Reaney</p>
<p>
        <span class="location">NEW YORK</span> |
        <span class="timestamp">Tue Apr 3, 2012 6:19am EDT</span>
</p>
<p><span class="articleLocation">NEW YORK</span> (Reuters) - Ba
track of finances, shopping and searching for jobs are the mai
et users around the globe, according to a new international 
survey.</p>
<p>Nearly 60 percent of people in 24 countries used the web to
account and other financial assets in the past 90 days, making
ar use of the Internet.</p>
<p>Shopping was not far behind at 48 percent, the Ipsos poll fo
 and 41 percent went online in search of a job.</p>
<p>"It is easy. You can do it any time of the day and most of t
on't have fees," said Keren Gottfried, research manager for Ips
Affairs, about banking online.</p>

我希望删除署名、时间戳和文章位置，只留下文章。或者更好的是，只提取文章。我怎样才能做到这一点？

score 0 · Accepted Answer

你可以试试这个

articletext = hxs.select("//span[@id='articleText']/p[position()>2]")

哪个应该返回这 3 个<p>标签：

<p><span class="articleLocation">NEW YORK</span> (Reuters) - Ba
track of finances, shopping and searching for jobs are the mai
et users around the globe, according to a new international 
survey.</p>

<p>Nearly 60 percent of people in 24 countries used the web to
account and other financial assets in the past 90 days, making
ar use of the Internet.</p>

<p>Shopping was not far behind at 48 percent, the Ipsos poll fo
 and 41 percent went online in search of a job.</p>
<p>"It is easy. You can do it any time of the day and most of t
on't have fees," said Keren Gottfried, research manager for Ips
Affairs, about banking online.</p>

但之后您可能必须手动删除 articleLocation。

score 0 · Accepted Answer

好吧，您可以添加条件来避免那些<p>s。试试这样：

//span[@id="articletext"]//p[not(@class)][not(span[@class])]

意思是“所有没有类的 P 元素，也没有带有类的子 SPAN 元素”。您可以使用多个条件进行过滤:)

python - 使用scrapy，如何让xpath更有选择性？

2 回答 2

Related

Reference