2

这是我的 spider.py 文件:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scinews_com.items import scinews_item
import codecs
import sys 

class sci_news_com(BaseSpider):
   sys.stdout = codecs.getwriter(sys.stdout.encoding)(sys.stdout, errors='replace')
   name = "scinews"
   allowed_domains = ["sci-news.com"]
   start_urls = [
       "http://www.sci-news.com/news/astronomy"
   ]

   def parse(self, response):
       hxs = HtmlXPathSelector(response)
       bottomposts = hxs.select('//div[@class="bottom-recentpost-wrapper-cat"]')      
       items = []
       for bottompost in bottomposts:
           item = scinews_item()
           item['Article_Title'] = bottompost.select('div[@class="bottom-archive"]/div[@class="post-content-holder"]/div[@class="bottom-content-heading-0"]/h2/a/text()').extract()
           item['Article_Desc'] = bottompost.select('div[@class="bottom-archive"]/div[@class="post-content-holder"]/div[@class="post-content"]/p/text()').extract()
           item['Article_Date'] = bottompost.select('div[@class="bottom-archive"]/div[@class="post-content-holder"]/div[@class="recentpost-dateauthor"]/text()').extract()
           item['Article_Author'] = bottompost.select('div[@class="bottom-archive"]/div[@class="post-content-holder"]/div[@class="recentpost-dateauthor"]/a/text()').extract()
           item['Article_Link'] = bottompost.select('div[@class="bottom-archive"]/div[@class="post-content-holder"]/div[@class="bottom-content-heading-0"]/h2/a/@href').extract()
           item['Article_Image'] = bottompost.select('div[@class="bottom-archive"]/div[@class="bottom-recentpost-image-0"]/a/img/@src').extract()
           items.append(item)
       return items

还有我的 items.py:

from scrapy.item import Item, Field

class scinews_item(Item):
    Article_Title = Field()
    Article_Desc = Field()
    Article_Date = Field()
    Article_Author = Field()
    Article_Link = Field()
    Article_Image = Field()  

出于某种原因,它没有将每个 select 语句放入它自己的“Field()”中。它将所有这些一起输出(在 CSV 和 JSON 输出中):

Article_Title,Article_Image,Article_Desc,Article_Date,Article_Link,Article_Author
" HIP 102152: Astronomers Find Oldest Solar Twin 250 Light-Years Away , ALMA Sees Spectacular Newborn Star 1400 Light-Years Away , Astronomers Discover New Earth-Sized Exoplanet Kepler 78b , Hubble Zooms in on Galaxies in Early Universe , Chandra Sees Multimillion-Degree Gas Cloud in NGC 1232 , Pulsar Helps Astronomers Measure Magnetic Field around Milky Way’s Central Black Hole , Astronomers Discover First Ever Six-Image Lensed Quasar , Astronomers See Bizarre Pair in Large Magellanic Cloud ","http://cdn4.sci-news.com/images/2013/08/image_1344_1-HIP102152-195x110.jpg,http://cdn4.sci-news.com/images/2013/08/image_1325-Herbig-Haro-195x110.jpg,http://cdn4.sci-news.com/images/2013/08/image_1324f-Kepler-78b-195x110.jpg,http://cdn4.sci-news.com/images/2013/08/image_1318-galaxies-195x110.jpg,http://cdn4.sci-news.com/images/2013/08/image_1314-NGC-1232-195x110.jpg,http://cdn4.sci-news.com/images/2013/08/image_1312-pulsar-195x110.jpg,http://cdn4.sci-news.com/images/2013/08/image_1304f-quasar-195x110.jpg,http://cdn4.sci-news.com/images/2013/08/image_1301-LMC-195x110.jpg","Astronomers using ESO’s Very Large Telescope have identified the oldest solar twin known to date.
This image shows the Sun-like star HIP 102152. Credit:...,A team of astronomers using the Atacama Large Millimeter/submillimeter Array (ALMA) has captured a beautiful close-up view of an object named Herbig-Haro...,An international team of astronomers reporting in the Astrophysical Journal (arXiv.org) has discovered an Earth-sized exoplanet called Kepler 78b that...,Astronomers using NASA’s Hubble Space Telescope have established that mature-looking galaxies existed much earlier than previously known, when the...,U.S. astronomer using NASA’s Chandra X-ray Observatory has discovered a massive cloud of hot gas likely caused by a collision between a dwarf galaxy...,An international team of astronomers has used observations of the newly discovered pulsar PSR J1745-2900 to measure the magnetic field emanating from a...,A team of scientists at the University of Copenhagen’s Niels Bohr Institute, Denmark, has reported the discovery of a six-image lensed quasar named...,A team of astronomers using the Very Large Telescope (VLT) at ESO’s Paranal Observatory in Chile has captured an image of two distinctive glowing..."," Aug 29, 2013   by 
, Aug 21, 2013   by 
, Aug 21, 2013   by 
, Aug 16, 2013   by 
, Aug 15, 2013   by 
, Aug 15, 2013   by 
, Aug 12, 2013   by 
, Aug 9, 2013   by 
","http://www.sci-news.com/astronomy/science-oldest-solar-twin-01344.html,http://www.sci-news.com/astronomy/science-alma-newborn-star-01325.html,http://www.sci-news.com/astronomy/science-new-earth-sized-exoplanet-kepler78b-01324.html,http://www.sci-news.com/astronomy/science-hubble-galaxies-early-universe-01318.html,http://www.sci-news.com/astronomy/science-chandra-gas-cloud-ngc1232-01314.html,http://www.sci-news.com/astronomy/science-pulsar-magnetic-field-milky-way-black-hole-01312.html,http://www.sci-news.com/astronomy/science-first-ever-six-image-lensed-quasar-01304.html,http://www.sci-news.com/astronomy/science-large-magellanic-cloud-01301.html","Sci-News.com,Enrico de Lazaro,Sergio Prostak,Enrico de Lazaro,Sci-News.com,Sci-News.com,Sergio Prostak,Sci-News.com"

先感谢您!

4

1 回答 1

3

查看http://www.sci-news.com/news/astronomy的 HTML 源代码

<div id="content" class="single-wrapper" role="main">
    <section>
        <div class="post-wrapper-archive">
            <div class="related-post-wrapper-block">...
            <div class="bottom-recentpost-wrapper-cat">
                <div class="bottom-archive">
                    <div class="bottom-recentpost-image-0">...
                    <div class="post-content-holder">...
                    <div class="cb"></div>
                </div>
                <div class="bottom-archive">...
                <div class="bottom-archive">...
                <div class="bottom-archive">...
                <div class="bottom-archive">...
                <div class="bottom-archive">...
                <div class="bottom-archive">...
                <div class="bottom-archive">...
                <div class="cb"></div>
            </div>
            <div id="pag">
        </div>
    </section>
</div>

我建议您将div[@class="bottom-archive"]步骤移到bottomposts选择器内:

class sci_news_com(BaseSpider):
   sys.stdout = codecs.getwriter(sys.stdout.encoding)(sys.stdout, errors='replace')
   name = "scinews"
   allowed_domains = ["sci-news.com"]
   start_urls = [
       "http://www.sci-news.com/news/astronomy"
   ]

   def parse(self, response):
       hxs = HtmlXPathSelector(response)
       bottomposts = hxs.select('//div[@class="bottom-recentpost-wrapper-cat"]/div[@class="bottom-archive"]')      
       items = []
       for bottompost in bottomposts:
           item = scinews_item()
           item['Article_Title'] = bottompost.select('div[@class="post-content-holder"]/div[@class="bottom-content-heading-0"]/h2/a/text()').extract()
           item['Article_Desc'] = bottompost.select('div[@class="post-content-holder"]/div[@class="post-content"]/p/text()').extract()
           item['Article_Date'] = bottompost.select('div[@class="post-content-holder"]/div[@class="recentpost-dateauthor"]/text()').extract()
           item['Article_Author'] = bottompost.select('div[@class="post-content-holder"]/div[@class="recentpost-dateauthor"]/a/text()').extract()
           item['Article_Link'] = bottompost.select('div[@class="post-content-holder"]/div[@class="bottom-content-heading-0"]/h2/a/@href').extract()
           item['Article_Image'] = bottompost.select('div[@class="bottom-recentpost-image-0"]/a/img/@src').extract()
           items.append(item)
       return items
于 2013-09-02T07:30:03.700 回答