python - *更新：如何使用 python/beautifulsoup 解析 html

Question

首先，我对 Python 很陌生。我正在尝试从离线网站上抓取联系信息并将信息输出到 csv。我想获取页面 url（不确定如何从 html 中执行此操作）、电子邮件、电话、位置数据（如果可能）、任何名称、任何电话号码以及 html 网站的标签行（如果存在）。

更新 #2 代码：

import os, csv, re
from bs4 import BeautifulSoup

topdir = 'C:\\projects\\training\\html'
output = csv.writer(open("scrape.csv", "wb+"))
output.writerow(["headline", "name", "email", "phone", "location", "url"])
all_contacts = []

for root, dirs, files in os.walk(topdir):
    for f in files:
        if f.lower().endswith((".html", ".htm")):
            soup = BeautifulSoup(f)

            def mailto_link(soup):          
            if soup.name != 'a':
                return None
            for key, value in soup.attrs:
                if key == 'href':
                    m = re.search('mailto:(.*)',value)
                if m:
                    all_contacts.append(m)
                return m.group(1)
            return None

            for ul in soup.findAll('ul'):
            contact = []
            for li in soup.findAll('li'):
                s = li.find('span')
                if not (s and s.string):
                    continue
                if s.string == 'Email:':
                    a = li.find(mailto_link)
                    if a:
                    contact['email'] = mailto_link(a)
                elif s.string == 'Website:':
                    a = li.find('a')
                    if a:
                    contact['website'] = a['href']
                elif s.string == 'Phone:':
                    contact['phone'] = unicode(s.nextSibling).strip()
            all_contacts.append(contact)
            output.writerow([all_contacts])

print "Finished"

此输出当前不返回除行标题以外的任何内容。我在这里想念什么？这应该至少从 html 文件返回一些信息，也就是这个页面：http ://bendoeslife.tumblr.com/about

score 1 · Accepted Answer

这里（至少）有两个问题。

首先，f是文件名，不是文件内容，也不是由这些内容制成的汤。因此，f.find('h2')将'h2'在文件名中查找，这不是很有用。

其次，大多数find方法（包括str.find您正在调用的方法）返回索引，而不是子字符串。调用str该索引只会为您提供数字的字符串版本。例如：

>>> s = 'A string with an h2 in it'
>>> i = s.find('h2')
>>> str(i)
'17'

因此，您的代码正在执行以下操作：

>>> f = 'C:\\python\\training\\offline\\somehtml.html'
>>> headline = f.find('h2')
>>> str(headline)
'-1'

您可能想要调用soup对象上的方法，而不是f. BeautifulSoup.find返回汤的“子树”，这正是您要在此处字符串化的内容。

但是，如果没有您的示例输入，就无法对其进行测试，因此我不能保证这是您的代码中唯一的问题。

同时，当你遇到这样的事情时，你应该尝试打印出中间值。打印出f, and headline, and , 错误headline2的原因会更加明显headline3。

只需在调用中替换fwith并修复缩进错误，针对您的示例文件http://bendoeslife.tumblr.com/about运行即可。soupfind

然而，它并没有做任何有用的事情。由于文件中的任何位置都没有h2标签，因此headline以None. 大多数其他领域也是如此。唯一能找到任何东西的是，因为你要求它找到一个空字符串，这会找到任意的东西。使用三个不同的解析器，我得到or , and … urlabout<html><body>about</body></html><html><body></body></html>

您需要真正了解您尝试解析的文件的结构，然后才能对其进行任何有用的操作。例如，在这种情况下，有一个电子邮件地址，但它位于<a>一个标题为的元素中"Email"，其中一个<li>元素id为"email"。因此，您需要编写一个 find 以根据其中一个条件或它实际匹配的其他条件来定位它。

python - *更新：如何使用 python/beautifulsoup 解析 html

1 回答 1

Related

Reference