0

我想将 Python 请求与启动浏览器 ( https://splash.readthedocs.io/en/stable/ ) 和自定义标头一起使用来从网站上抓取一些数据。但是,在开始爬网之前,我决定在这个网站http://xhaus.com/headers上检查我发送的标题。结果,我看到我没有发送我想要发送的那些标头。

import requests

def headers():

    headers = requests.utils.default_headers()

    headers.update({
        'User-Agent': random_user_agent()
        })
    return headers

def random_user_agent():
    with open('user-agents.txt','r') as f:
        user_agents = f.readlines()
        user_agents = [h.rstrip('\n') for h in user_agents]
        random_index = random.randint(0,len(user_agents)-1)
        ua = user_agents[random_index]
        return ua
splash = 'http://localhost:8050/render.html'
headers = headers()
url_h = 'http://xhaus.com/headers'
page = requests.get(splash, params={'url':url_h,},headers=headers)

运行此代码后,我有以下用户代理:

{'Connection': 'keep-alive', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}

但是,当我通过我提到的网站检查它时,它向我显示了一个不同的用户代理:

soup = BeautifulSoup(page.text)
print soup.prettify()

...

<td class="even">
       User-Agent
      </td>
      <td class="even">
       <b>
        Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/538.1 (KHTML, like Gecko) splash Safari/538.1
       </b>
      </td>

...
4

0 回答 0