python - 过滤 BeautifulSoup

Question

我正在尝试从另一个网页获取大学及其网站的列表。

我已经得到输入以显示我想要的每一行的 HTML，但我正在尝试进一步格式化文本。我只希望显示大学名称和指向该大学的链接。有任何想法吗？

这是我的代码：

url = "http://www.arizona.edu/colleges"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
universities = soup.findAll('span', {'class' : 'field-content'})
for eachuniversity in universities:
   print eachuniversity

这是我得到的一个例子eachuniversity：

<div class="views-field-title">
  <span class="field-content">
    <a href="/colleges/college-agriculture-life-sciences">
    <h3>College of Agriculture &amp; Life Sciences</h3>
    </a>
  </span>
</div>

score 4 · Accepted Answer

以下将为您提供所需的内容。BeautifulSoup 文档（版本 4 文档）中很容易获得用于执行此操作的信息。

for uni in universities:
    link = uni.find("a")
    college_name = link.text
    web_page = link["href"]

对于第一个大学（你的例子）：

>>> print web_page
"/colleges/college-agriculture-life-sciences"
>>> print college_name
College of Agriculture &amp; Life Sciences

我将把处理相对/绝对链接和特殊 HTML 字符作为练习留给你。

这是如何工作的

从您最近的问题中，我了解到您在从uni对象中提取标签时遇到问题。您的universities变量是Tag对象的集合，每个对象都是可用于访问其子对象的类字典对象。尝试阅读“Navigating the Parse Tree”，以更好地了解如何使用 BeautifulSoup 进行解析。

python - 过滤 BeautifulSoup

1 回答 1

这是如何工作的

Related

Reference