python - lxml 在查找链接时错误地解析了 Doctype

Question

我有一个 BeautifulSoup4 (4.2.1) 解析器，它href从我们的模板文件中收集所有属性，到目前为止它已经很完美了。但是安装了 lxml 后，我们的一个人现在得到了一个；

TypeError: string indices must be integers.

我设法在我的 Linux Mint VM 上复制了它，唯一的区别似乎是 lxml，所以我假设当 bs4 使用该 html 解析器时会出现问题。

问题函数是；

def collecttemplateurls(templatedir, urlslist):
    """
    Uses BeautifulSoup to extract all the external URLs from the templates dir.

    @return: list of URLs
    """
    for (dirpath, dirs, files) in os.walk(templatedir):
        for path in (Path(dirpath, f) for f in files):
            if path.endswith(".html"):
                for link in BeautifulSoup(
                        open(path).read(),
                        parse_only=SoupStrainer(target="_blank")
                ):
                    if link["href"].startswith('http://'):
                        urlslist.append(link['href'])

                    elif link["href"].startswith('{{'):
                        for l in re.findall("'(http://(?:.*?))'", link["href"]):
                            urlslist.append(l)

    return urlslist

所以对于这个人，该行if link["href"].startswith('http://'):给出了类型错误，因为 BS4 认为 html Doctype 是一个链接。

谁能解释这里的问题可能是什么，因为没有其他人可以重新创建它？

我看不出在像这样使用 SoupStrainer 时会发生这种情况。我认为它与系统设置问题有关。

我看不出我们的 Doctype 有什么特别之处。

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-gb">

<head>

score 3 · Accepted Answer

SoupStrainer不会过滤掉文档类型；它过滤文档中保留的元素，但保留文档类型，因为它是过滤元素的“容器”的一部分。您正在遍历文档中的所有元素，因此您遇到的第一个元素是DocType对象。

.find_all()在“紧张”的文件上使用：

document = BeautifulSoup(open(path).read(), parse_only=SoupStrainer(target="_blank"))
for link in documen.find_all(target="_blank"):

或过滤掉DocType对象：

from bs4 import DocType

for link in BeautifulSoup(
        open(path).read(),
        parse_only=SoupStrainer(target="_blank")
):
    if isinstance(link, Doctype): continue

python - lxml 在查找链接时错误地解析了 Doctype

1 回答 1

Related

Reference