因此,我意识到这是一个较旧的答案,具有很高的投票率和接受度,但是如果您正在阅读LARGE-FILES并发现自己处于与我相同的困境中;我希望这能够帮到你。
实际上,这种方法的问题在于迭代。不管解析器有多快,做任何事情都会说……几十万次会吃掉你的执行时间。话虽如此,归根结底是真正为我考虑问题并了解名称空间的工作原理(或“打算工作”,因为老实说不需要它们)。现在,如果您的 xml真正使用名称空间,这意味着您看到的标签看起来像这样:<xs:table>
,那么您需要在此处针对您的用例调整该方法。我还将包括完整的处理方式。
免责声明:凭良心,我不能告诉你在解析 html/xml 时使用正则表达式,去看看 SergiyKolesnikov 的答案,因为它有效,但我有一个边缘情况,所以说......让我们深入研究一些正则表达式!
问题:命名空间剥离需要很长时间......而且大多数时候命名空间只存在于最开始的标记内,或者我们的“根”。因此,在考虑 python 如何读取信息以及我们唯一的问题子节点是根节点时,为什么不利用它来发挥我们的优势。
请注意:我作为示例使用的文件是 lulz 的原始、可怕、非常无意义的结构,其中有数据的承诺。
my_file
是我用于我们示例的文件的路径,出于专业原因,我无法与您分享;为了通过这个答案,它已经被缩小了。
import os, sys, subprocess, re, io, json
from lxml import etree
# Your file would be '_biggest_file' if playing along at home
my_file = _biggest_file
meta_stuff = dict(
exists = os.path.exists(_biggest_file),
sizeof = os.path.getsize(_biggest_file),
extension_is_a_real_thing = any(re.findall("\.(html|xml)$", my_file, re.I)),
system_thinks_its_a = subprocess.check_output(
["file", "-i", _biggest_file]
).decode().split(":")[-1:][0].strip()
)
print(json.dumps(meta_stuff, indent = 2))
所以对于初学者来说,大小合适,系统认为它充其量是 html;文件扩展名既不是 xml 也不是 html...
{
"exists": true,
"sizeof": 24442371,
"extension_is_a_real_thing": false,
"system_thinks_its_a": "text/html; charset=us-ascii"
}
方法:
- 为了解析一个 xml 文件......它至少应该是 xml,所以如果不存在,我们需要检查并添加一个声明标签
- 如果我有命名空间.. 那很糟糕,因为我不能使用 xpaths,这就是我想要做的
- 如果我的文件很大,我应该只在准备解析它之前对我需要清理的最小的可想象部分进行操作。
功能
def speed_read(file_path):
# We're gonna be low-brow and add our own using this string. It's fine
_xml_dec = '<?xml version="1.0" encoding="utf-8"?>'
# Even worse.. rgx for xml here we go
#
# We'll need to extract the very first node that we find in our document,
# because for our purposes thats the one we know has the namespace uri's
# ie: "attributes"
# FiRsT node : <actual_name xmlns:xsi="idontactuallydoanything.com">
# We're going to pluck out that first node, get the tags actual name
# which means from:
# <actual_name xmlns:xsi="idontactuallydoanything.com">...</actual_name>
# We pluck:
# actual_name
# Then we're gonna replace the entire tag with one we make from that name
# by simple string substitution
#
# -> 'starting from the beginning, capture everything between the < and the >'
_first_node = re.compile('^(\<.*?\>)', re.I|re.M|re.U)
# -> 'Starting from the beginning, but dont you get me the <, find anything that happens
# before the first white-space, which i don't want either man'
_first_tagname = re.compile('(?<=^\<)(.*?)\S+',re.I|re.M|re.U)
# open the file context
with open(file_path, "r", encoding = "utf-8") as f:
# go ahead and strip leading and trailing, cause why not... plus adds
# safety for our regex's
_raw = f.read().strip()
# Now, if the file somehow happens to magically have the xml declaration, we
# wanna go ahead and remove it as we plan to add our own. But for efficiency,
# only check the first couple of characters
if _raw.startswith('<?xml', 0, 5):
#_raw = re.sub(_xml_dec, '', _raw).strip()
_raw = re.sub('\<\?xml.*?\?>\n?', '', _raw).strip()
# Here we grab that first node that has those meaningless namespaces
root_element = _first_node.search(_raw).group()
# here we get its name
first_tag = _first_tagname.search(root_element).group()
# Here, we rubstitute the entire element, with a new one
# that only contains the elements name
_raw = re.sub(root_element, '<{}>'.format(first_tag), _raw)
# Now we add our declaration tag in the worst way you have ever
# seen, but I miss sprintf, so this is how i'm rolling. Python is terrible btw
_raw = "{}{}".format(_xml_dec, _raw)
# The bytes part here might end up being overkill.. but this has worked
# for me consistently so it stays.
return etree.parse(io.BytesIO(bytes(bytearray(_raw, encoding = "utf-8"))))
# a good answer from above:
def safe_read(file_path):
root = etree.parse(file_path)
for elem in root.getiterator():
elem.tag = etree.QName(elem).localname
# Remove unused namespace declarations
etree.cleanup_namespaces(root)
return root
基准测试 - 是的,我知道有更好的方法来做到这一点。
import pandas as pd
safe_times = []
for i in range(0,5):
s = time.time()
safe_read(_biggest_file)
safe_times.append(time.time() - s)
fast_times = []
for i in range(0,5):
s = time.time()
speed_read(_biggest_file)
fast_times.append(time.time() - s)
pd.DataFrame({"safe":safe_times, "fast":fast_times})
结果
安全的 |
快速地 |
2.36 |
0.61 |
2.15 |
0.58 |
2.47 |
0.49 |
2.94 |
0.60 |
2.83 |
0.53 |