我想定义一个可以在目录中的每个 xml 文件上实现的函数,以便解析它并从数据框中的标签中获取内容。
from xml.etree import ElementTree
def func(path, filename):
for filename in os.listdir(path):
with open(os.path.join(path, filename)) as file:
# Read each line in the file, readlines() returns a list of lines
content = file.readlines()
# Combine the lines in the list into a string
content = "".join(content)
bs_content = bs(content, "lxml")
headline = bs_content.find_all("headline")
eventtitle = bs_content.find_all("eventtitle")
city = bs_content.find_all("city")
companyname = bs_content.find_all("companyname")
companyticker = bs_content.find_all("companyticker")
startdate = bs_content.find_all("startdate")
eventstory = bs_content.find_all("eventstory")
data = []
for i in range(0,len(companyname)):
rows = [companyname[i].get_text(),headline[i].get_text(),
city[i].get_text(),eventtitle[i].get_text(),
companyticker[i].get_text(),startdate[i].get_text(),
eventstory[i].get_text()]
data.append(rows)
df = pd.DataFrame(data,columns = ['companyname','headline',
'city','eventtitle','companyticker',
'startdate','eventstory'], dtype = float)
当我调用一个函数时,我收到此错误。不幸的是,每个现有的解决方案都不适用于我。
func('./Calls/', '1000015_T.xml')
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
Input In [58], in <module>
----> 1 func('./Calls/', '1000015_T.xml')
Input In [57], in func(path, filename)
7 for filename in os.listdir(path):
8 with open(os.path.join(path, filename)) as file:
9 # Read each line in the file, readlines() returns a list of lines
---> 10 content = file.readlines()
11 # Combine the lines in the list into a string
12 content = "".join(content)
File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/codecs.py:322, in BufferedIncrementalDecoder.decode(self, input, final)
319 def decode(self, input, final=False):
320 # decode input (taking the buffer into account)
321 data = self.buffer + input
--> 322 (result, consumed) = self._buffer_decode(data, self.errors, final)
323 # keep undecoded input until the next call
324 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte
也许您还可以帮助我进行代码优化。我的任务是获取 2k xml 文件的内容,到目前为止,我决定定义一个函数,然后使用 pandarallel:parallel_apply(func)