-1

这就是我能做到的!我正在尝试获取代理

import urllib.request

page = urllib.request.urlopen("http://www.samair.ru/proxy/ip-address-01.htm")

page('\d+\.\d+\.\d+\.\d+')
4

1 回答 1

6

在这种情况下,表格并不是真正的 HTML 表格,而是用<pre></pre>. 您可以通过查看页面源来验证它。无论如何,使用BeautifulSoup就像在公园里散步一样:

In [1]: from bs4 import BeautifulSoup

In [2]: from urllib.request import urlopen

In [3]: bs = BeautifulSoup(urlopen('http://www.samair.ru/proxy/ip-address-01.htm'))

In [4]: print(bs.find('pre').text)

IP address               Anonymity level   Checked time        Country
056.249.66.50:8080       transparent       Apr-21, 10:33       Bulgaria
1.63.18.22:8080          transparent       Apr-21, 05:56       China
1.9.75.8:8080            transparent       Apr-21, 12:58       Malaysia
103.247.219.165:8080     transparent       Apr-21, 04:01       Indonesia
103.4.165.190:80         transparent       Apr-21, 11:34       Indonesia
103.9.126.110:8080       transparent       Apr-21, 12:19       Indonesia
109.173.98.64:8080       transparent       Apr-20, 22:39       Russian Federation
109.197.194.142:8080     transparent       Apr-21, 12:07       Russian Federation
109.207.61.141:8090      transparent       Apr-21, 11:14       Poland
109.207.61.145:8090      transparent       Apr-21, 13:04       Poland
109.207.61.149:8090      transparent       Apr-21, 10:21       Poland
109.207.61.165:8090      transparent       Apr-21, 03:57       Poland
109.207.61.170:8090      transparent       Apr-21, 11:02       Poland
109.207.61.208:8090      transparent       Apr-21, 10:45       Poland
109.224.55.46:80         transparent       Apr-20, 21:50       Iraq
109.227.124.105:8080     transparent       Apr-21, 09:57       Ukraine
109.69.6.118:8080        transparent       Apr-21, 11:44       Albania
110.138.248.135:8080     transparent       Apr-21, 09:10       Indonesia
110.139.13.121:8080      transparent       Apr-21, 11:31       Indonesia
110.159.179.108:80       transparent       Apr-20, 20:35       Malaysia

In [5]: [l.split()[0] for l in bs.find('pre').text.split('\n')[1:]][1:]
Out[5]: 
['056.249.66.50:8080',
 '1.63.18.22:8080',
 '1.9.75.8:8080',
 '103.247.219.165:8080',
 '103.4.165.190:80',
 '103.9.126.110:8080',
 '109.173.98.64:8080',
 '109.197.194.142:8080',
 '109.207.61.141:8090',
 '109.207.61.145:8090',
 '109.207.61.149:8090',
 '109.207.61.165:8090',
 '109.207.61.170:8090',
 '109.207.61.208:8090',
 '109.224.55.46:80',
 '109.227.124.105:8080',
 '109.69.6.118:8080',
 '110.138.248.135:8080',
 '110.139.13.121:8080',
 '110.159.179.108:80']
于 2013-04-21T14:12:31.670 回答