0

我已经看到了谷歌提取的结果,但它不适用于此。我想简单地进入代码并更改参数,当运行时,它会进行搜索并抓取职位、位置和日期。这就是我到目前为止所拥有的。任何帮助都会很棒,并在此先感谢。

我希望脚本使用给定的参数(工程师软件 CA)在 monster.com 上执行搜索并抓取结果。

#! /usr/bin/python
import re
import requests
from urllib import urlopen
from BeautifulSoup import BeautifulSoup

parameters = ["Software","Engineer","CA"]
base_url = "http://careers.boozallen.com/search?q="
search_string = "+".join(parameters)

final_url = base_url + search_string

a = requests.get(final_url)
raw_string = a.text.strip()


soup = BeautifulSoup( raw_string )

job_urls = soup.findAll(name = 'a', attrs = { 'class': 'jobTitle fnt11_js' })

for job_url in job_urls:

    print job_url.text
    print

raw_input("Press enter to close: ")

我知道这在下面可以作为标准刮擦。

handle = urlopen("http://jobsearch.monster.com/search/Engineer_5?q=Software&where=AZ&rad=20&sort=rv.di.dt")
responce = handle.read()
soup = BeautifulSoup( responce )

job_urls = soup.findAll(name = 'a', attrs = { 'class': 'jobTitle fnt11_js' })
for job_url in job_urls:
    print job_url.text
    print
4

1 回答 1

1

如果您将浏览器指向http://careers.boozallen.com/search?q=software+engineer+CA并检查 HTML,您将看到如下 HTML:

<tr class="dbOutputRow2">
    <td style="width: 400px;" class="colTitle" headers="hdrTitle"><span class="jobTitle"><a href="http://careers.boozallen.com/job/San-Diego-Network-Engineer%2C-Senior-Job-CA-92101/1645793/">Network Engineer, Senior Job</a></span></td>
    <td style="width: auto;" class="colLocation" headers="hdrLocation"><span class="jobLocation">San Diego, CA, US</span></td>
    <td style="width: 155px;" class="colDate" headers="hdrDate" nowrap="nowrap"><span class="jobDate">Jan 5, 2012</span></td>

您要查找的信息位于<span>标签中,其class属性等于jobTitlejobLocationjobDate

以下是使用lxml 刮取这些位的方法:

import urllib2
import lxml.html as LH

url = 'http://careers.boozallen.com/search?q=software+engineer+CA'
doc = LH.parse(urllib2.urlopen(url))

def text_content(iterable):
    for elt in iterable:
        yield elt.text_content()

data = text_content(doc.xpath('''//span[@class = "jobTitle"
                                        or @class = "jobLocation"
                                        or @class = "jobDate"]'''))

for title, location, date in zip(*[data]*3):
    print(title,location,date)

产量

('Title', 'Location', 'Date')
('Network Engineer, Senior Job', 'San Diego, CA, US', 'Jan 5, 2012')
('Network Integration Engineer, Mid Job', 'San Diego, CA, US', 'Jan 12, 2012')
('Systems Engineer, Senior Job', 'San Diego, CA, US', 'Jan 31, 2012')
('Enterprise Architect, Senior Job', 'Washington, DC, US', 'Jan 23, 2012')
...
于 2012-02-02T17:17:58.650 回答