python - 使用python从网页中提取姓名和电话号码

Question

我想做的是，在这个网站上：

http://www.yellowpages.com/memphis-tn/gift-shops

我想将商店名称及其关联的电话号码提取到 CSV 中。例如，第一个条目应该是：

巴布科克礼品，（901）763-0700

ETC..

我正在使用 Python。执行 urllib2.urlopen() 后，我得到了整个简介。如何处理此文本以实现我的目标？

score 1 · Accepted Answer

我建议使用正则表达式并点击行中的独特内容。

IE：

<a href="http://www.yellowpages.com/memphis-tn/mip/babcock-gifts-14131113?lid=187490699" class="url " data-analytics="{&quot;click_id&quot;:1600,&quot;rank&quot;:1,&quot;act&quot;:1,&quot;FL&quot;:&quot;list&quot;,&quot;position&quot;:0}" title="Babcock Gifts">Babcock Gifts</a>

你会使用类似的东西：

re_name=re.compile('<a href=.*class=\"url\".*')
re_front=re.compile('^.*title="')
re_back=re.compile('".*')
for line in page:
 if re_name.search(line):
  out = re.front.sub('',line)
  out = re.back.sub('',line)
print out

score 1 · Accepted Answer

我试过 BeautifulSoup

 import urllib
 import re
 from BeautifulSoup import *
 url = 'http://www.yellowpages.com/memphis-tn/gift-shops' 

 u = urllib.urlopen(url) 
 soup = BeautifulSoup(u)

test = soup.findAll('div', {'class':"info"})

for each in test:
    aref = each.findAll('a',{'class':"url "})
    phone = each.findAll('span',{'class':"business-phone phone"})
        x = re.sub(r'[^0-9]',"",str(phone))
    print aref[0]['title'] + " - " + x

我通过查看 html 页面的源代码导出了这个脚本。我找到了包含列表的“div”部分。然后每家公司都列在标签中，我在“aref”中得到。

奇怪的是，我拿起了“电话”，但文本包含整个字符串包括标签。我不确定为什么。所以，我用一个正则表达式来替换除了数字之外的所有东西，它构成了电话号码。

这是 beautifulsoup 的文档。 http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html

python - 使用python从网页中提取姓名和电话号码

2 回答 2

Related

Reference