非常感谢一些帮助。我已经忙了两天多,四处浏览以了解为什么我无法访问此 xml 文件以将其内容放入 df 中。我的目标是将工作表放在熊猫数据框中的 xml 文件中。我知道有几篇文章讨论了这个主题,但我似乎面临一些使其复杂的错误。
该数据是从知名 ETF 提供商处下载的。它以“.xls”格式下载,但实际上是“xml”格式;显然是一个 Excel xlm。所以一个简单的 pd.read_excel 是行不通的。这就是我被迫进入 xml 格式和 LXML 和 xml.etree.ElementTree 等库的地方。不过,我已经在 BS4 上工作了一段时间。
xml 下载未指定任何编码,当我尝试解析它时,它返回错误。因此,我涉足了 chardet 和 et.XMLParser 以发现它的编码并将其“硬设置”在解析器中。但无济于事。解析时返回:
'lxml.etree.XMLSyntaxError: 文档为空,第 1 行,第 1 列'
我没有直接解析它(参见下面的 xml_tree1),而是尝试使用 fromstring 读取 xml,但我注意到一些乱码。所以我什么都没有替换它:
xml_str = xml_file.read().replace('', '')
现在我有干净的 xml 代码,但在我的根目录中仍然找不到任何子项。事实上,它似乎是空的,根本没有解析。我的知识让我失望了。有人能把我推向正确的方向吗?我的问题处于早期阶段;我似乎无法解析文件和底层格式。第二个问题是我需要解析文档中各个工作表上的 ss:table。在代码中更进一步,我记下了一些示例供我使用。任何评论都非常受欢迎。
这些是对我帮助最大的帖子;
使用 ElementTree 解析 XML 时如何获取子节点的文本值
使用 ElementTree 读取像 .xml 这样的电子表格
xml 的来源可以在这里找到(荷兰语版本)。您可以在右上角下载。
xml的片段:
<?xml version="1.0"?>
<ss:Workbook xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">
<ss:Styles>
<ss:Style ss:ID="Default">
<ss:Alignment ss:Horizontal="Left"/>
</ss:Style>
<ss:Style ss:ID="wraptext">
<ss:Alignment ss:Horizontal="Left" ss:WrapText="1"/>
<ss:Font ss:Italic="1"/>
</ss:Style>
<ss:Style ss:ID="disclaimer">
<ss:Alignment ss:Vertical="Top" ss:WrapText="1"/>
</ss:Style>
<ss:Style ss:ID="DefaultHyperlink">
<ss:Alignment ss:Vertical="Center" ss:WrapText="1"/>
<ss:Font ss:Color="#0000FF" ss:Underline="Single" />
</ss:Style>
<ss:Style ss:ID="headerstyle">
<ss:Font ss:Bold="1" />
</ss:Style>
<ss:Style ss:ID="Date">
<ss:NumberFormat ss:Format="dd\-mmm\-yyyy"/>
</ss:Style>
<ss:Style ss:ID="Left">
<ss:Alignment ss:Horizontal="Left"/>
<ss:NumberFormat ss:Format="Standard"/>
</ss:Style>
<ss:Style ss:ID="Right">
<ss:Alignment ss:Horizontal="Right"/>
<ss:NumberFormat ss:Format="Standard"/>
</ss:Style>
</ss:Styles>
<ss:Worksheet ss:Name="Overzicht">
<ss:Table>
<ss:Row >
<ss:Cell ss:StyleID="headerstyle">
<ss:Data ss:Type="String">iShares Core MSCI World UCITS ETF</ss:Data>
</ss:Cell>
</ss:Row>''
到目前为止我的代码:
import lxml.etree as et
from lxml import objectify
import io
import chardet
with open('C:\\MSCI.xml') as xml_file:
parser = et.XMLParser(encoding="iso-8859-5", recover=True)
xml_str = xml_file.read().replace('', '') # !!! IShares xml has error in first row !!!
xml_tree = et.ElementTree(et.fromstring(xml_str, parser=parser))
root = xml_tree.getroot()
xml_tree0 = et.iterparse(xml_file, encoding='iso-8859-1') # Nothing
xml_tree1 = et.parse(xml_file, parser=parser) # File seems empty, but is not
xml_tree2 = objectify.parse(io.StringIO(xml_str)) # This is the same as fromstring
#################################################
### Trying to capture encoding and replace it ###
#################################################
detector = chardet.UniversalDetector()
for line in xml_file.readlines():
detector.feed(line) # This doesn't seem to work
if detector.done: break
detector.close()
print(detector.result)
xml_enc = detector.result['encoding'] # The result seems always to be None
if xml_enc != 'utf-8':
# content = xml_str(xml_enc, 'replace').encode('utf-8') # Don't know how to replace encoding
pass
xml_clean = et.fromstring(xml_str, parser=parser)
# The detector function above and Encryption replacer does not work :(
#############################################################################
### Some code below is how I'd guess to proceed, after I have a good tree ###
#############################################################################
# ns = {"ss": "urn:schemas-microsoft-com:office:spreadsheet"}
# https://stackoverflow.com/questions/59945728/how-do-i-pick-up-text-values-of-child-nodes-when-parsing-xml-with-elementtree
# https://stackoverflow.com/questions/54107550/reading-a-spreadsheet-like-xml-with-elementtree
### Something like this to iterate through the children
# for appt in xml_tree.getchildren():
# for elem in appt.getchildren():
# if not elem.text:
# text = "None"
# else:
# text = elem.text
# print(elem.tag + " => " + text)
### Or something like this to iterate to take into account namespaces
# for ws in xml.findall('ss:Worksheet', namespaces):
# for table in ws.findall('ss:Row', namespaces):
# for c in table.findall('ss:Cell', namespaces):
# data = c.find('ss:Data', namespaces)
# if data.text is None:
# text = []
# data = data.findall('html:Font', namespaces)
# for element in data:
# text.append(element.text)
#
# data_text = ''.join(text)
# print(data_text)
# else:
# print(data.text)
### Or something like this to iterate to take into account xpaths and namespaces
# L = []
# ws = xml.xpath('/ss:Workbook/ss:Worksheet', namespaces=namespaces)
# if len(ws) > 0:
# tables = ws[0].xpath('./ss:Table', namespaces=namespaces)
# if len(tables) > 0:
# rows = tables[0].xpath('./ss:Row', namespaces=namespaces)
# for row in rows:
# tmp = []
# cells = row.xpath('./ss:Cell/ss:Data', namespaces=namespaces)
# for cell in cells:
# # print(cell.text);
# tmp.append(cell.text)
# L.append(tmp)
# print(L)