python - 尝试使用 BeautifulSoup 从本地文件中收集数据

Question

我想运行一个 python 脚本来解析 html 文件并收集具有target="_blank"属性的所有链接的列表。

我已经尝试了以下方法，但它没有从 bs4 得到任何东西。SoupStrainer 在文档中说它将以与 findAll 等相同的方式使用 args，这应该有效吗？我错过了一些愚蠢的错误吗？

import os
import sys

from bs4 import BeautifulSoup, SoupStrainer
from unipath import Path

def main():

    ROOT = Path(os.path.realpath(__file__)).ancestor(3)
    src = ROOT.child("src")
    templatedir = src.child("templates")

    for (dirpath, dirs, files) in os.walk(templatedir):
        for path in (Path(dirpath, f) for f in files):
            if path.endswith(".html"):
                for link in BeautifulSoup(path, parse_only=SoupStrainer(target="_blank")):
                    print link

if __name__ == "__main__":
    sys.exit(main())

score 2 · Accepted Answer

用法BeautifulSoup没问题，但你应该传入 html 字符串，而不仅仅是 html 文件的路径。BeautifulSoup接受 html 字符串作为参数，而不是文件路径。它不会打开它，然后自动读取内容。你应该自己做。如果你通过a.html，汤会<html><body><p>a.html</p></body></html>。这不是文件的内容。肯定没有链接。你应该使用BeautifulSoup(open(path).read(), ...).

编辑：
它也接受文件描述符。BeautifulSoup(open(path), ...)足够的。

score 2 · Accepted Answer

我想你需要这样的东西

if path.endswith(".html"):
    htmlfile = open(dirpath)
    for link in BeautifulSoup(htmlfile,parse_only=SoupStrainer(target="_blank")):
        print link

python - 尝试使用 BeautifulSoup 从本地文件中收集数据

2 回答 2

Related

Reference