python - 无法抓取 etherscan 交易 url - cloudflare 保护

Question

用于抓取 etherscan 交易块 ID 的代码

def block_chain_explorer_block_id(url):
    import requests
    from bs4 import BeautifulSoup
    
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'html5lib')
    tags = soup.findAll('div', attrs = {'class':'col-md-9'}) 
    print(soup.findAll('a'))

block_chain_explorer_block_id(https://etherscan.io/tx/0x4529e9f79139edab871a699df455e57101cca90574e435da89db457df4885c54)

输出得到：

[<a href="https://www.cloudflare.com/5xx-error-landing" id="brand_link" rel="noopener noreferrer" target="_blank">Cloudflare</a>]

我得到高于输出polygonscan工作正常。etherscan 工作正常。知道如何使它工作吗？

score 1 · Accepted Answer

Etherscan 有一个 API（带有免费计划）。

您应该使用它而不是尝试抓取它，这是 Transactions 的文档：https ://docs.etherscan.io/api-endpoints/stats

score 1 · Accepted Answer

在请求中添加一些标头，以表明您可能是一个“浏览器”可以提供暂时的缓解，但它远非防弹。

您还应该考虑访问目标页面的频率和速度。

使用轮换代理也是一种常见的方法。

注意 这没有什么神奇的公式，因为 Cloudflare 会不断调整其检测机器人流量的方法。- 使用@Speedlulu 提到的 api 将是最好的方法

例子

添加user agent为标题之一，并更改findAll()为find_all()导致这是您应该在新代码中使用的语法。

import requests
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}

def block_chain_explorer_block_id(url):
    import requests
    from bs4 import BeautifulSoup
    
    r = requests.get(url,headers=headers)
    soup = BeautifulSoup(r.content, 'html5lib')
    tags = soup.find_all('div', attrs = {'class':'col-md-9'}) 
    print(soup.find_all('a'))

block_chain_explorer_block_id('https://etherscan.io/tx/0x4529e9f79139edab871a699df455e57101cca90574e435da89db457df4885c54')

python - 无法抓取 etherscan 交易 url - cloudflare 保护

2 回答 2

例子

Related

Reference