python - 使用 BeautifulSoup 抓取域名

Question

我正在尝试使用 BeautifulSoup 从 sameip.org 中抓取域列表，我的代码如下：

import urllib, urllib2, cookielib, re, io, sys
from bs4 import BeautifulSoup

cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

resp = opener.open('http://sameip.org/ip/141.101.125.122').read()
soup = BeautifulSoup(resp)
for tr in soup.find_all('tr'):
    tds = tr.find_all('td')
    for x in tds:
        print x

工作蝙蝠抓取更多数据，我只需要抓取域名，例如：

tcjayfund.org
fjminc.com
amandabillyrock.com
fjmclinics.com

我怎样才能做到这一点？

score 1 · Accepted Answer

查看您的代码打印出来的内容，很明显第一行是标题行，在随后的每一行中，第二列是域。所以：

for tr in soup.find_all('tr')[1:]:
    tds = tr.find_all('td')
    print tds[1].text.strip()

或者，如果您想将它们全部抓取到列表中而不是打印它们：

domains = [tr.find_all('td')[1].text.strip() for tr in soup.find_all('tr')[1:]]

在设计更好的站点中，您可能希望使用 ID 和结构关系，而不是像这样的固定索引，但是当他们刚刚获得像这样的裸露、愚蠢的表时，真的没有办法绕过它。

python - 使用 BeautifulSoup 抓取域名

1 回答 1

Related

Reference