我正在尝试将 1. 父属性 2. 子属性和 3. 孙子文本放入数据框中。我能够让子属性和孙子文本在屏幕上打印出来,但我无法让它们进入数据框。我从熊猫那里得到一个内存错误。
这是介绍的东西
import requests
from lxml import etree, objectify
r = requests.get('https://api.stuff.us/place/getData? security_key=key&period=minutes&startTime=2013-05-01T00:00&endTime=2013-05-01T23:59&sort=channel') #edited for privacy
root = etree.fromstring(r.text)
xml_new = etree.tostring(root, pretty_print=True)
print xml_new[300:900] #gives xml output to show structure
<startTime>2013-05-01 00:00:00</startTime>
<endTime>2013-05-01 23:59:00</endTime>
<summaryPeriod>minutes</summaryPeriod>
<data>
<channel channel="97925" name="blah">
<Time Time="2013-05-01 00:00:00">
<value>258</value>
</Time>
<Time Time="2013-05-01 00:01:00">
<value>259</value>
</Time>
<Time Time="2013-05-01 00:02:00">
<value>258</value>
</Time>
<Time Time="2013-05-01 00:03:00">
<value>257</value>
</Time>
这显示了我如何解析以获取要打印的子属性和孙子属性。
for df in root.xpath('//channel/Time'):
## Iterate over attributes of channel/Time
for attrib in df.attrib:
print '@' + attrib + '=' + df.attrib[attrib]
## value is a child of time, and iterate
subfields = df.getchildren()
for subfield in subfields:
print 'subfield=' + subfield.text
它会根据要求打印出很长的信息:
...
@Time=2013-05-01 23:01:00
value=100
@Time=2013-05-01 23:02:00
value=101
@Time=2013-05-01 23:03:00
value=99
@Time=2013-05-01 23:04:00
value=101
...
但是,当我尝试将其放入数据框中时,出现内存错误。我尝试了他们两个,也只是尝试将子属性放入数据框中。
data = []
for df in root.xpath('//channel/Time'):
## Iterate over attributes of channel/Time
for attrib in df.attrib:
el_data = {}
el_data[attrib] = df.attrib[attrib]
data.append(el_data)
from pandas import *
perf = DataFrame(data)
perf
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-6-08c8c74f7192> in <module>()
1 from pandas import *
----> 2 perf = DataFrame(data)
3 perf
/Users/user/Library/Enthought/Canopy_32bit/User/lib/python2.7/site- packages/pandas/core/frame.pyc in __init__(self, data, index, columns, dtype, copy)
417
418 if isinstance(data[0], (list, tuple, collections.Mapping, Series)):
--> 419 arrays, columns = _to_arrays(data, columns, dtype=dtype)
420 columns = _ensure_index(columns)
421
/Users/user/Library/Enthought/Canopy_32bit/User/lib/python2.7/site-packages/pandas/core/frame.pyc in _to_arrays(data, columns, coerce_float, dtype)
5457 return _list_of_dict_to_arrays(data, columns,
5458 coerce_float=coerce_float,
-> 5459 dtype=dtype)
5460 elif isinstance(data[0], Series):
5461 return _list_of_series_to_arrays(data, columns,
/Users/user/Library/Enthought/Canopy_32bit/User/lib/python2.7/site- packages/pandas/core/frame.pyc in _list_of_dict_to_arrays(data, columns, coerce_float, dtype)
5521 for d in data]
5522
-> 5523 content = list(lib.dicts_to_array(data, list(columns)).T)
5524 return _convert_object_array(content, columns, dtype=dtype,
5525 coerce_float=coerce_float)
/Users/user/Library/Enthought/Canopy_32bit/User/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.dicts_to_array (pandas/lib.c:7657)()
MemoryError:
我的 xml 文件中有 12960 个“值”值。我假设这些内存错误告诉我文件中的值不符合预期,但这与内存错误不匹配,我无法从其他关于内存错误的 SO 问题或从熊猫文档。
尝试获取数据类型不会产生任何信息。也许没有类型?也许是因为它们是元素树中的元素。(我试图打印 .pyval,但它只告诉我没有属性。) el_data 的类型是“dict”
print(objectify.dump(root))[700:1000] #print a subset of types
name = 'zone'
Time = None [_Element]
* Time = '2013-05-01 00:00:00'
value = '258' [_Element]
Time = None [_Element]
* Time = '2013-05-01 00:01:00'
value = '259' [_Element]
type(el_data)
dict
我根据 Python for Data Analysis 一书和其他在 SO 上找到的用于解析 XML 的示例构建了此代码。我对 python 还是很陌生。
在 Mac OS 10.7.5 上运行 Python 2.7.2