我想知道如何提取多个 XML 文档中所有标签之间的文本,获取文件名,然后将此信息写入 CSV 文件。
目前我得到了这个:
import csv
import glob
from bs4 import BeautifulSoup
dataExtracted = []
for filename in glob.glob(r'*.xml'):
with open(filename, 'r') as f_in:
soup = BeautifulSoup(f_in.read(), 'lxml')
print(filename)
for i in soup.findAll(text=True):
print(i)
dataExtracted.append([filename, i.get_text()])
with open('data.csv', 'w', newline='') as csvfile:
csv_writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
for row in dataExtracted:
csv_writer.writerow(row)
当我尝试运行它时出现此错误:
AttributeError: 'NavigableString' object has no attribute 'get_text'
我试图添加这个:
for i in soup.findAll(text=True):
try:
print(i)
dataExtracted.append([filename, i.get_text(strip=True)])
except NavigableString:
pass
但现在我得到了这个错误:
catching classes that do not inherit from BaseException is not allowed
所以我认为我没有正确处理错误。
关于我应该如何处理这个问题的任何想法?