0

有没有办法用beautiful soup同时解析多个HTML文档?我正在在线修改从 edgar 中提取 HTML.txt 文件的代码,并使用漂亮的汤,以便可以将它们作为格式化文件下载:但是,我发现我的代码现在只打印一个 edgar 文档(它打算打印 5),而我没有不知道有什么问题。

import csv
import requests
import re
from bs4 import BeautifulSoup 

with open('General Motors Co 11-15.csv', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    for line in reader:
        fn1 = line[0]
        fn2 = re.sub(r'[/\\]', '', line[1])
        fn3 = re.sub(r'[/\\]', '', line[2])
        fn4 = line[3]
        saveas = '-'.join([fn1, fn2, fn3, fn4])
        # Reorganize to rename the output filename.
        url = 'https://www.sec.gov/Archives/' + line[4].strip()
        bodytext=requests.get(url).text 
        parsedContent=BeautifulSoup(bodytext, 'html.parser')
        for script in parsedContent(["script", "style"]): 
            script.extract()
        text = parsedContent.get_text()
        lines = (line.strip() for line in text.splitlines())
        chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
        text = '\n'.join(chunk for chunk in chunks if chunk) 
        with open(saveas, 'wb') as f:
            f.write(requests.get('%s' % text).content)
            print(file, 'downloaded and wrote to text file')

你知道我的代码有什么问题吗?

4

1 回答 1

0

我猜您每次写入文件时都会覆盖现有文档。尝试更改with open(saveas, 'wb') as f:with open(saveas, 'ab') as f:

打开文件 aswb创建一个与 同名的新文档saveas,实质上是清除现有文档。

于 2019-08-05T15:23:22.547 回答