0

非常感谢一些帮助。我已经忙了两天多,四处浏览以了解为什么我无法访问此 xml 文件以将其内容放入 df 中。我的目标是将工作表放在熊猫数据框中的 xml 文件中。我知道有几篇文章讨论了这个主题,但我似乎面临一些使其复杂的错误。

该数据是从知名 ETF 提供商处下载的。它以“.xls”格式下载,但实际上是“xml”格式;显然是一个 Excel xlm。所以一个简单的 pd.read_excel 是行不通的。这就是我被迫进入 xml 格式和 LXML 和 xml.etree.ElementTree 等库的地方。不过,我已经在 BS4 上工作了一段时间。

xml 下载未指定任何编码,当我尝试解析它时,它返回错误。因此,我涉足了 chardet 和 et.XMLParser 以发现它的编码并将其“硬设置”在解析器中。但无济于事。解析时返回:

'lxml.etree.XMLSyntaxError: 文档为空,第 1 行,第 1 列'

我没有直接解析它(参见下面的 xml_tree1),而是尝试使用 fromstring 读取 xml,但我注意到一些乱码。所以我什么都没有替换它:

xml_str = xml_file.read().replace('', '')

现在我有干净的 xml 代码,但在我的根目录中仍然找不到任何子项。事实上,它似乎是空的,根本没有解析。我的知识让我失望了。有人能把我推向正确的方向吗?我的问题处于早期阶段;我似乎无法解析文件和底层格式。第二个问题是我需要解析文档中各个工作表上的 ss:table。在代码中更进一步,我记下了一些示例供我使用。任何评论都非常受欢迎。

这些是对我帮助最大的帖子;

使用 ElementTree 解析 XML 时如何获取子节点的文本值

使用 ElementTree 读取像 .xml 这样的电子表格

xml 的来源可以在这里找到(荷兰语版本)。您可以在右上角下载。

https://www.ishares.com/nl/professionele-belegger/nl/producten/251882/ishares-msci-world-ucits-etf-acc-fund

xml的片段:

<?xml version="1.0"?>
<ss:Workbook xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">
<ss:Styles>
<ss:Style ss:ID="Default">
<ss:Alignment ss:Horizontal="Left"/>
</ss:Style>
<ss:Style ss:ID="wraptext">
<ss:Alignment ss:Horizontal="Left" ss:WrapText="1"/>
<ss:Font ss:Italic="1"/>
</ss:Style>
<ss:Style ss:ID="disclaimer">
<ss:Alignment ss:Vertical="Top" ss:WrapText="1"/>
</ss:Style>
<ss:Style ss:ID="DefaultHyperlink">
<ss:Alignment ss:Vertical="Center" ss:WrapText="1"/>
<ss:Font ss:Color="#0000FF" ss:Underline="Single" />
</ss:Style>
<ss:Style ss:ID="headerstyle">
<ss:Font ss:Bold="1" />
</ss:Style>
<ss:Style ss:ID="Date">
<ss:NumberFormat ss:Format="dd\-mmm\-yyyy"/>
</ss:Style>
<ss:Style ss:ID="Left">
<ss:Alignment ss:Horizontal="Left"/>
<ss:NumberFormat ss:Format="Standard"/>
</ss:Style>
<ss:Style ss:ID="Right">
<ss:Alignment ss:Horizontal="Right"/>
<ss:NumberFormat ss:Format="Standard"/>
</ss:Style>
</ss:Styles>
<ss:Worksheet ss:Name="Overzicht">
<ss:Table>
<ss:Row >
<ss:Cell ss:StyleID="headerstyle">
<ss:Data ss:Type="String">iShares Core MSCI World UCITS ETF</ss:Data>
</ss:Cell>
</ss:Row>''

到目前为止我的代码:

import lxml.etree as et
from lxml import objectify
import io
import chardet

with open('C:\\MSCI.xml') as xml_file:
    parser = et.XMLParser(encoding="iso-8859-5", recover=True)
    xml_str = xml_file.read().replace('', '')  # !!! IShares xml has error in first row !!!

    xml_tree = et.ElementTree(et.fromstring(xml_str, parser=parser))
    root = xml_tree.getroot()
    xml_tree0 = et.iterparse(xml_file, encoding='iso-8859-1')  # Nothing
    xml_tree1 = et.parse(xml_file, parser=parser)  # File seems empty, but is not
    xml_tree2 = objectify.parse(io.StringIO(xml_str))  # This is the same as fromstring

    #################################################
    ### Trying to capture encoding and replace it ###
    #################################################
    detector = chardet.UniversalDetector()
    for line in xml_file.readlines():
        detector.feed(line)  # This doesn't seem to work
        if detector.done: break
    detector.close()
    print(detector.result)

    xml_enc = detector.result['encoding']  # The result seems always to be None
    if xml_enc != 'utf-8':
        # content = xml_str(xml_enc, 'replace').encode('utf-8')  # Don't know how to replace encoding
        pass
    xml_clean = et.fromstring(xml_str, parser=parser)

    # The detector function above and Encryption replacer does not work :(

    #############################################################################
    ### Some code below is how I'd guess to proceed, after I have a good tree ###
    #############################################################################

    # ns = {"ss": "urn:schemas-microsoft-com:office:spreadsheet"}
    # https://stackoverflow.com/questions/59945728/how-do-i-pick-up-text-values-of-child-nodes-when-parsing-xml-with-elementtree
    # https://stackoverflow.com/questions/54107550/reading-a-spreadsheet-like-xml-with-elementtree

    ### Something like this to iterate through the children
    # for appt in xml_tree.getchildren():
    #     for elem in appt.getchildren():
    #         if not elem.text:
    #             text = "None"
    #         else:
    #             text = elem.text
    #         print(elem.tag + " => " + text)

    ### Or something like this to iterate to take into account namespaces
    # for ws in xml.findall('ss:Worksheet', namespaces):
    #     for table in ws.findall('ss:Row', namespaces):
    #         for c in table.findall('ss:Cell', namespaces):
    #             data = c.find('ss:Data', namespaces)
    #             if data.text is None:
    #                 text = []
    #                 data = data.findall('html:Font', namespaces)
    #                 for element in data:
    #                     text.append(element.text)
    #
    #                 data_text = ''.join(text)
    #                 print(data_text)
    #             else:
    #                 print(data.text)

    ### Or something like this to iterate to take into account xpaths and namespaces
    # L = []
    # ws = xml.xpath('/ss:Workbook/ss:Worksheet', namespaces=namespaces)
    # if len(ws) > 0:
    #     tables = ws[0].xpath('./ss:Table', namespaces=namespaces)
    #     if len(tables) > 0:
    #         rows = tables[0].xpath('./ss:Row', namespaces=namespaces)
    #         for row in rows:
    #             tmp = []
    #             cells = row.xpath('./ss:Cell/ss:Data', namespaces=namespaces)
    #             for cell in cells:
    #                 #                print(cell.text);
    #                 tmp.append(cell.text)
    #             L.append(tmp)
    # print(L)
4

1 回答 1

0

好吧,我最终得到了下面的代码。

对我有用,但仍然不明白为什么我不能直接解析文件并且需要替换字符串中的乱码。

想法?

也许我可以让其他人对以下内容感到满意。花了我太多时间;):S

干杯!

import lxml.etree as et
import io
import chardet
import pandas as pd

filepath = 'C:\\MSCI.xml'
namespace = '{urn:schemas-microsoft-com:office:spreadsheet}'
find_elem = 'Worksheet'
ws_name = 'Posities'

# Capture encoding
with open(filepath, 'rb') as f:
    data = f.read()
xml_enc = chardet.detect(data).get('encoding')
if xml_enc == 'UTF-8-SIG':
    xml_enc = xml_enc.replace('-SIG', '')

'''
##########################################################################
### Parse the xml file, iterate through it, append and build dataframe ###
##########################################################################
# https://stackoverflow.com/questions/10242237/lxml-etree-iterparse-error-typeerror-reading-file-objects-must-return-plain-st
# https://stackoverflow.com/questions/36804794/iterparse-large-xml-using-python
# https://riptutorial.com/python/example/25995/opening-and-reading-large-xml-files-using-iterparse--incremental-parsing-
# https://stackoverflow.com/questions/28253006/python-element-tree-iterparse-filter-nodes-and-children
# https://stackoverflow.com/questions/12792998/elementtree-iterparse-strategy
# https://stackoverflow.com/questions/7018326/lxml-iterparse-in-python-cant-handle-namespaces
# https://stackoverflow.com/questions/38790012/how-to-get-all-the-tags-in-an-xml-using-python
'''

with open(filepath) as xml_file:

    xml_str = xml_file.read().replace('', '')  # !!! IShares xml has error in first row !!!
    xml_byte = io.BytesIO(xml_str.encode(xml_enc))

    worksheet = []
    for event, elem in et.iterparse(xml_byte, recover=True, events=('start', 'end')):
        if elem.tag == et.QName(namespace + find_elem) and event == 'start':
            for name, value in elem.items():
                if value == ws_name:
                    for table in elem:
                        row_values = []
                        for row in table:
                            cell_values = []
                            for cells in row:
                                for data in cells:
                                    content = data.text
                                    cell_values.append(content)
                            row_values.append(cell_values)
                    worksheet.append(row_values)
    xml_df_concat = pd.concat([pd.DataFrame(worksheet[i]) for i in range(len(worksheet))], ignore_index=True)
于 2021-04-30T16:52:49.397 回答