我正在尝试使用分页从表格中删除链接。我可以让 Selenium 遍历页面,并且可以从第一页获取链接,但是如果我尝试将两者结合起来,当我到达最后一页并且不再有next page按钮时,该过程停止,并且我什么都得不到。
我不确定如何优雅地告诉事情只是将数据返回到 csv。我正在使用一个while true:
循环,所以这对我来说相当令人费解。
另一个问题与定位我尝试使用 xpath 解析的链接有关。链接保存在两个不同的tr
类中。一组在下//tr[@class ="resultsY"]
,另一组在下//tr[@class ="resultsW"]
,是否有某种OR
声明可以用来一次性定位所有链接?
我找到的一个解决方案:
'//tr[@class ="resultsY"] | //tr[@class ="resultsW"]'
每次都给我一个错误。
这是html表:
<tr class="resultsW">
-<td></td>
-<td>
----<a href="fdafda"></a> <----a link i'm after
-<td>
-<td></td>
</tr>
<tr class="resultsW">
-<td></td>
-<td>
----<a href="fdafda"></a> <----a link i'm after
-<td>
-<td></td>
</tr>
这是我的scrapy:
import time
from scrapy.item import Item, Field
from selenium import webdriver
from scrapy.spider import BaseSpider
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
from selenium.common.exceptions import NoSuchElementException
from scrapy.selector import HtmlXPathSelector
class ElyseAvenueItem(Item):
link = Field()
link2 = Field()
class ElyseAvenueSpider(BaseSpider):
name = "s1"
allowed_domains = ["nces.ed.gov"]
start_urls = [
'https://nces.ed.gov/collegenavigator/']
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
self.driver.get(response.url)
select = Select(self.driver.find_element_by_id("ctl00_cphCollegeNavBody_ucSearchMain_ucMapMain_lstState"))
select.deselect_by_visible_text("No Preference")
select.select_by_visible_text("Alabama")
self.driver.find_element_by_id("ctl00_cphCollegeNavBody_ucSearchMain_btnSearch").click()
#here is the while loop. it gets to the end of the table and says...no more "next page" and gives me the middle finger
'''while True:
el1 = self.driver.find_element_by_partial_link_text("Next Page")
if el1:
el1.click()
else:
#return(items)
self.driver.close()'''
hxs = HtmlXPathSelector(response)
'''
#here i tried: titles = self.driver.find_elements_by_xpath('//tr[@class ="resultsW"] | //tr[@class ="resultsY"]') and i got an error saying that
titles = self.driver.find_elements_by_xpath('//tr[@class ="resultsW"]')
items = []
for titles in titles:
item = ElyseAvenueItem()
#here i'd like to be able to target all of the hrefs...not sure how
link = titles.find_element_by_xpath('//tr[@class ="resultsW"]/td[2]/a')
item ["link"] = link.get_attribute('href')
items.append(item)
yield(items)