python - 使用 Beautiful Soup 时删除 html 标签的问题

Question

我正在使用漂亮的汤从网站上抓取一些数据，但是在打印数据时我无法从数据中删除 html 标签。参考代码为：

import csv
import urllib2
import sys  
from bs4 import BeautifulSoup

page = urllib2.urlopen('http://www.att.com/shop/wireless/devices/smartphones.html').read()
soup = BeautifulSoup(page)
soup.prettify()
for anchor1 in soup.findAll('div', {"class": "listGrid-price"}):
    print anchor1
for anchor2 in soup.findAll('div', {"class": "gridPrice"}):
    print anchor2
for anchor3 in soup.findAll('div', {"class": "gridMultiDevicePrice"}):
    print anchor3

我正在使用它的输出，如下所示：

<div class="listGrid-price"> 
                                $99.99 
            </div>
<div class="listGrid-price"> 
                                $0.01 
            </div>
<div class="listGrid-price"> 
                                $0.01 
            </div>

我只想要输出价格而不需要任何 html 标签。请原谅我的无知，因为我是编程新手。

score 0 · Accepted Answer

您正在打印找到的标签。要仅打印包含的文本，请使用以下.string属性：

print anchor1.string

该.string值是一个NavigableString实例；要像普通的 unicode 对象一样使用它，请先转换它。然后你可以strip()用来删除额外的空格：

print unicode(anchor1.string).strip()

稍微调整一下以允许空值：

for anchor1 in soup.findAll('div', {"class": "listGrid-price"}):
    if anchor1.string:
        print unicode(anchor1.string).strip()

这给了我：

$99.99
$0.99
$0.99
$299.99
$199.99
$49.99
$49.99
$99.99
$0.99
$99.99
$0.01
$0.01
$0.01
$0.01
$0.01

python - 使用 Beautiful Soup 时删除 html 标签的问题

1 回答 1

Related

Reference