1

我正在尝试从 http://virtuacareers.com/new-jersey/staff-nurse/jobid3462987-registered-nurse-%28rn%29-jobs抓取数据

我想从此页面获取链接,但是当我查看我的 csv 文件时,链接是:

javascript:GetApplyClickCount('https://careers-virtua.icims.com/jobs/5587/1024245/job?apply=yes&hashed=58168622', 'http://virtuacareers.com/list.aspx?state=voorhees&category=staff+nurse&jobtitle=registered+nurse+(rn)&jobid=3025458&dmaid=1286&dmaname=voorhees', 'SameWindow', 'scrollbars=1, toolbar=1, resizable=1, location=1, directories=1, status=1, menubar=1, copyhistory=1, fullscreen=1', 'true', '0', '0', 'virtuacareers.com', '', '', '3025458', 'Registered Nurse (RN)','212','True','','False');

我只想得到的是:

https://careers-virtua.icims.com/jobs/5587/1024245/job?apply=yes&hashed=58168622

我该怎么处理这些?这是我的代码

linker = hxs.select('//div[@class="box jobDesc"]/a')
item ["link"] = linker.select('@href').extract()
4

1 回答 1

0

一种方法是使用正则表达式提取 url:

>>> import re
>>> s = "javascript:GetApplyClickCount('https://careers-virtua.icims.com/jobs/5587/1024245/job?apply=yes&hashed=58168622', 'http://virtuacareers.com/list.aspx?state=voorhees&category=staff+nurse&jobtitle=registered+nurse+(rn)&jobid=3025458&dmaid=1286&dmaname=voorhees', 'SameWindow', 'scrollbars=1, toolbar=1, resizable=1, location=1, directories=1, status=1, menubar=1, copyhistory=1, fullscreen=1', 'true', '0', '0', 'virtuacareers.com', '', '', '3025458', 'Registered Nurse (RN)','212','True','','False');"
>>> re.search("\'(?P<url>https?://[^\s]+)\'", s).group("url")
'https://careers-virtua.icims.com/jobs/5587/1024245/job?apply=yes&hashed=58168622'

在您的情况下,它将是:

link = linker.select('@href').extract()[0]
item ["link"] = re.search("\'(?P<url>https?://[^\s]+)\'", link).group("url")
于 2013-08-29T09:37:20.830 回答