我需要爬取这个网站。这是一个脑筋急转弯网站,当你点击一个按钮时,它会运行一个 JavaScript 来显示答案窗口。
 <tr> 
      <td width="60" bgcolor="#ECF5FF"> <p align="center"><font color="#800000".htm>1</font></p></td>
      <td width="539" bgcolor="#ECF5FF"> <font color="#008080">一种东西,东方人的短,西方人的长,结婚后女的就可以用男的这东西,和尚有但是不用它&nbsp;</font> 
      </td>
      <td width="95" bgcolor="#ECF5FF"> <p align="center"> 
          <INPUT onClick="MM_popupMsg('答案:名字&nbsp;')" type=button value=答案 name=button8639 style='font-size:12px;height:18px;border:1px solid black;'>
        </p></td>
    </tr>
这是我为抓取问题和答案而编写的代码。我可以成功地得到问题,但未能得到答案。(当我打印出答案时,它是一个空的[]。)
    questions = hxs.select('//td[@width="539"]/font/text()').extract()
    answers = hxs.select('//td[@width="95"]/INPUT/@onClick').extract()
答案是onclick脚本的内容,即:我要得到这个字符串:
MM_popupMsg('答案:名字&nbsp;')
这是我的蜘蛛:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
import re
class ReviewSpider(BaseSpider):
    name = "2345jzw"
    allowed_domains = ['2345.com/jzw']
    start_urls = ['http://www.2345.com/jzw/index.htm']
    page = 1
    while page <= 1:
        url = 'http://www.2345.com/jzw/%d.htm' % page
        start_urls.append(url)
        page = page + 1
    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        questions = hxs.select('//td[@width="539"]/font/text()').extract()
        answers = hxs.select('//td[3]/p/INPUT/@onClick').extract()
        print questions
        print answers
        id = 1
        while id <= 50:
            question = questions[id - 1]
            question = re.sub(r'<[^>]*?>', '', str(question.encode('utf8')))
            question = ' '.join(question.split())
            question = question.replace('&', ' ')
            question = question.replace('\'', ' ')
            question = question.replace(',', ';')
            answer = answers[id - 1]
            answer = re.sub(r'<[^>]*?>', '', str(answer.encode('utf8')))
            answer = ' '.join(answer.split())
            answer = answer.replace('&', ' ')
            answer = answer.replace('\'', ' ')
            answer = answer.replace(',', ';')
            file = open('crawled.xml', 'a')
            file.write(question)
            file.write(",")
            file.write(answer)
            file.write("\n")
            file.close()
            id = id + 1
我努力了
hxs.select('//INPUT/@onClick').extract()
但它仍然无法正常工作。路径有什么问题?
请注意,问题已成功提取。问题和答案的结构非常相似。为什么答案是空的?