python - 使用 Beautiful Soup 遍历多个文件并附加 HTML 中的文本

Question

我有一个下载的 HTML 文件的目录（其中 46 个），我试图遍历它们中的每一个，读取它们的内容，剥离 HTML，然后只将文本附加到文本文件中。但是，我不确定我在哪里搞砸了，因为没有任何东西写入我的文本文件？

import os
import glob
from bs4 import BeautifulSoup
path = "/"
for infile in glob.glob(os.path.join(path, "*.html")):
        markup = (path)
        soup = BeautifulSoup(markup)
        with open("example.txt", "a") as myfile:
                myfile.write(soup)
                f.close()

-----update---- 我已经更新了我的代码如下，但是文本文件仍然没有被创建。

import os
import glob
from bs4 import BeautifulSoup
path = "/"
for infile in glob.glob(os.path.join(path, "*.html")):
    markup = (infile)
    soup = BeautifulSoup(markup)
    with open("example.txt", "a") as myfile:
        myfile.write(soup)
        myfile.close()

------更新2-----

啊，我发现我的目录不正确，所以现在我有：

import os
import glob
from bs4 import BeautifulSoup

path = "c:\\users\\me\\downloads\\"

for infile in glob.glob(os.path.join(path, "*.html")):
    markup = (infile)
    soup = BeautifulSoup(markup)
    with open("example.txt", "a") as myfile:
        myfile.write(soup)
        myfile.close()

执行此操作时，我收到此错误：

Traceback (most recent call last):
  File "C:\Users\Me\Downloads\bsoup.py, line 11 in <module>
    myfile.write(soup)
TypeError: must be str, not BeautifulSoup

我通过更改修复了最后一个错误

myfile.write(soup)

至

myfile.write(soup.get_text())

-----更新3 ----

它现在工作正常，这是工作代码：

import os
import glob
from bs4 import BeautifulSoup

path = "c:\\users\\me\\downloads\\"

for infile in glob.glob(os.path.join(path, "*.html")):
    markup = (infile)
    soup = BeautifulSoup(open(markup, "r").read())
    with open("example.txt", "a") as myfile:
        myfile.write(soup.get_text())
        myfile.close()

score 1 · Accepted Answer

实际上你没有在阅读 html 文件，这应该可以工作，

soup=BeautifulSoup(open(webpage,'r').read(), 'lxml')

score 0 · Accepted Answer

如果你想直接使用lxml.html，这里是我在项目中使用的一些代码的修改版本。如果要获取所有文本，请不要按标签过滤。可能有一种方法可以在不迭代的情况下做到这一点，但我不知道。它将数据保存为 unicode，因此您在打开文件时必须考虑到这一点。

import os
import glob

import lxml.html

path = '/'

# Whatever tags you want to pull text from.
visible_text_tags = ['p', 'li', 'td', 'h1', 'h2', 'h3', 'h4',
                     'h5', 'h6', 'a', 'div', 'span']

for infile in glob.glob(os.path.join(path, "*.html")):
    doc = lxml.html.parse(infile)

    file_text = []

    for element in doc.iter(): # Iterate once through the entire document

        try:  # Grab tag name and text (+ tail text)   
            tag = element.tag
            text = element.text
            tail = element.tail
        except:
            continue

        words = None # text words split to list
        if tail: # combine text and tail
            text = text + " " + tail if text else tail
        if text: # lowercase and split to list
            words = text.lower().split()

        if tag in visible_text_tags:
            if words:
                file_text.append(' '.join(words))

    with open('example.txt', 'a') as myfile:
        myfile.write(' '.join(file_text).encode('utf8'))

python - 使用 Beautiful Soup 遍历多个文件并附加 HTML 中的文本

2 回答 2

Related

Reference