非常感谢一些帮助。我已经忙了两天多,四处浏览以了解为什么我无法访问此 xml 文件以将其内容放入 df 中。我的目标是将工作表放在熊猫数据框中的 xml 文件中。我知道有几篇文章讨论了这个主题,但我似乎面临一些使其复杂的错误。

该数据是从知名 ETF 提供商处下载的。它以“.xls”格式下载,但实际上是“xml”格式;显然是一个 Excel xlm。所以一个简单的 pd.read_excel 是行不通的。这就是我被迫进入 xml 格式和 LXML 和 xml.etree.ElementTree 等库的地方。不过,我已经在 BS4 上工作了一段时间。

xml 下载未指定任何编码,当我尝试解析它时,它返回错误。因此,我涉足了 chardet 和 et.XMLParser 以发现它的编码并将其“硬设置”在解析器中。但无济于事。解析时返回:

'lxml.etree.XMLSyntaxError: 文档为空,第 1 行,第 1 列'

我没有直接解析它(参见下面的 xml_tree1),而是尝试使用 fromstring 读取 xml,但我注意到一些乱码。所以我什么都没有替换它:

xml_str = xml_file.read().replace('', '')

现在我有干净的 xml 代码,但在我的根目录中仍然找不到任何子项。事实上,它似乎是空的,根本没有解析。我的知识让我失望了。有人能把我推向正确的方向吗?我的问题处于早期阶段;我似乎无法解析文件和底层格式。第二个问题是我需要解析文档中各个工作表上的 ss:table。在代码中更进一步,我记下了一些示例供我使用。任何评论都非常受欢迎。


xml 的来源可以在这里找到(荷兰语版本)。您可以在右上角下载。



<?xml version="1.0"?>
<ss:Workbook xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">
<ss:Style ss:ID="Default">
<ss:Alignment ss:Horizontal="Left"/>
<ss:Style ss:ID="wraptext">
<ss:Alignment ss:Horizontal="Left" ss:WrapText="1"/>
<ss:Font ss:Italic="1"/>
<ss:Style ss:ID="disclaimer">
<ss:Alignment ss:Vertical="Top" ss:WrapText="1"/>
<ss:Style ss:ID="DefaultHyperlink">
<ss:Alignment ss:Vertical="Center" ss:WrapText="1"/>
<ss:Font ss:Color="#0000FF" ss:Underline="Single" />
<ss:Style ss:ID="headerstyle">
<ss:Font ss:Bold="1" />
<ss:Style ss:ID="Date">
<ss:NumberFormat ss:Format="dd\-mmm\-yyyy"/>
<ss:Style ss:ID="Left">
<ss:Alignment ss:Horizontal="Left"/>
<ss:NumberFormat ss:Format="Standard"/>
<ss:Style ss:ID="Right">
<ss:Alignment ss:Horizontal="Right"/>
<ss:NumberFormat ss:Format="Standard"/>
<ss:Worksheet ss:Name="Overzicht">
<ss:Row >
<ss:Cell ss:StyleID="headerstyle">
<ss:Data ss:Type="String">iShares Core MSCI World UCITS ETF</ss:Data>


import lxml.etree as et
from lxml import objectify
import io
import chardet

with open('C:\\MSCI.xml') as xml_file:
    parser = et.XMLParser(encoding="iso-8859-5", recover=True)
    xml_str = xml_file.read().replace('', '')  # !!! IShares xml has error in first row !!!

    xml_tree = et.ElementTree(et.fromstring(xml_str, parser=parser))
    root = xml_tree.getroot()
    xml_tree0 = et.iterparse(xml_file, encoding='iso-8859-1')  # Nothing
    xml_tree1 = et.parse(xml_file, parser=parser)  # File seems empty, but is not
    xml_tree2 = objectify.parse(io.StringIO(xml_str))  # This is the same as fromstring

    ### Trying to capture encoding and replace it ###
    detector = chardet.UniversalDetector()
    for line in xml_file.readlines():
        detector.feed(line)  # This doesn't seem to work
        if detector.done: break

    xml_enc = detector.result['encoding']  # The result seems always to be None
    if xml_enc != 'utf-8':
        # content = xml_str(xml_enc, 'replace').encode('utf-8')  # Don't know how to replace encoding
    xml_clean = et.fromstring(xml_str, parser=parser)

    # The detector function above and Encryption replacer does not work :(

    ### Some code below is how I'd guess to proceed, after I have a good tree ###

    # ns = {"ss": "urn:schemas-microsoft-com:office:spreadsheet"}
    # https://stackoverflow.com/questions/59945728/how-do-i-pick-up-text-values-of-child-nodes-when-parsing-xml-with-elementtree
    # https://stackoverflow.com/questions/54107550/reading-a-spreadsheet-like-xml-with-elementtree

    ### Something like this to iterate through the children
    # for appt in xml_tree.getchildren():
    #     for elem in appt.getchildren():
    #         if not elem.text:
    #             text = "None"
    #         else:
    #             text = elem.text
    #         print(elem.tag + " => " + text)

    ### Or something like this to iterate to take into account namespaces
    # for ws in xml.findall('ss:Worksheet', namespaces):
    #     for table in ws.findall('ss:Row', namespaces):
    #         for c in table.findall('ss:Cell', namespaces):
    #             data = c.find('ss:Data', namespaces)
    #             if data.text is None:
    #                 text = []
    #                 data = data.findall('html:Font', namespaces)
    #                 for element in data:
    #                     text.append(element.text)
    #                 data_text = ''.join(text)
    #                 print(data_text)
    #             else:
    #                 print(data.text)

    ### Or something like this to iterate to take into account xpaths and namespaces
    # L = []
    # ws = xml.xpath('/ss:Workbook/ss:Worksheet', namespaces=namespaces)
    # if len(ws) > 0:
    #     tables = ws[0].xpath('./ss:Table', namespaces=namespaces)
    #     if len(tables) > 0:
    #         rows = tables[0].xpath('./ss:Row', namespaces=namespaces)
    #         for row in rows:
    #             tmp = []
    #             cells = row.xpath('./ss:Cell/ss:Data', namespaces=namespaces)
    #             for cell in cells:
    #                 #                print(cell.text);
    #                 tmp.append(cell.text)
    #             L.append(tmp)
    # print(L)

1 回答 1







import lxml.etree as et
import io
import chardet
import pandas as pd

filepath = 'C:\\MSCI.xml'
namespace = '{urn:schemas-microsoft-com:office:spreadsheet}'
find_elem = 'Worksheet'
ws_name = 'Posities'

# Capture encoding
with open(filepath, 'rb') as f:
    data = f.read()
xml_enc = chardet.detect(data).get('encoding')
if xml_enc == 'UTF-8-SIG':
    xml_enc = xml_enc.replace('-SIG', '')

### Parse the xml file, iterate through it, append and build dataframe ###
# https://stackoverflow.com/questions/10242237/lxml-etree-iterparse-error-typeerror-reading-file-objects-must-return-plain-st
# https://stackoverflow.com/questions/36804794/iterparse-large-xml-using-python
# https://riptutorial.com/python/example/25995/opening-and-reading-large-xml-files-using-iterparse--incremental-parsing-
# https://stackoverflow.com/questions/28253006/python-element-tree-iterparse-filter-nodes-and-children
# https://stackoverflow.com/questions/12792998/elementtree-iterparse-strategy
# https://stackoverflow.com/questions/7018326/lxml-iterparse-in-python-cant-handle-namespaces
# https://stackoverflow.com/questions/38790012/how-to-get-all-the-tags-in-an-xml-using-python

with open(filepath) as xml_file:

    xml_str = xml_file.read().replace('', '')  # !!! IShares xml has error in first row !!!
    xml_byte = io.BytesIO(xml_str.encode(xml_enc))

    worksheet = []
    for event, elem in et.iterparse(xml_byte, recover=True, events=('start', 'end')):
        if elem.tag == et.QName(namespace + find_elem) and event == 'start':
            for name, value in elem.items():
                if value == ws_name:
                    for table in elem:
                        row_values = []
                        for row in table:
                            cell_values = []
                            for cells in row:
                                for data in cells:
                                    content = data.text
    xml_df_concat = pd.concat([pd.DataFrame(worksheet[i]) for i in range(len(worksheet))], ignore_index=True)
于 2021-04-30T16:52:49.397 回答