几天前我开始学习 Python,以建立一个基本站点,以便从 BOINC 项目(例如 SETI@home 等)中编译一些统计数据。
基本上该网站会:
- 下载gz文件
- 将gz文件解压成xml文件
- 将 xml 信息构建到数据结构中
- 将数据结构写回 cvs 文件
总共有来自 34 个不同 BOINC 项目的 34 个 .gz 文件。
现在所有代码都已完成并且可以工作,但是来自一个项目的 .gz 文件拒绝解析,而其他 34 个工作正常。
该文件是:
user.gz
从
http://www.rnaworld.de/rnaworld/stats/
这些是我得到的错误:
Traceback (most recent call last):
File "C:/Users/chris/PycharmProjects/testproject1/rnaw100.py", line 77, in <module>
for event, elem in ET.iterparse(str(x_file_name2), events=("start", "end")):
File "C:\Users\chris\AppData\Local\Programs\Python\Python38-32\lib\xml\etree\ElementTree.py", line 1227, in iterator
yield from pullparser.read_events()
File "C:\Users\chris\AppData\Local\Programs\Python\Python38-32\lib\xml\etree\ElementTree.py", line 1302, in read_events
raise event
File "C:\Users\chris\AppData\Local\Programs\Python\Python38-32\lib\xml\etree\ElementTree.py", line 1274, in feed
self._parser.feed(data)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 0
这是下载 .gz 文件并解析 XML 的代码:(我省略了 var 声明等)
作为一个新手,我发现很难理解哪里出了问题,因为 (a) 错误指的是 Python 核心文件,例如 ElementTree.py,并且 (b) 我不明白为什么 .gz 文件有许多其他 BOINC stat使用的网站不会在这里工作,以及(c)为什么我的代码适用于 34 个文件,但不是这个 1.
response = requests.get(url2, stream=True)
if response.status_code == 200:
with open(target_path2, 'wb') as f:
f.write(response.raw.read())
with gzip.open(target_path2, 'rb') as f_in:
with open(x_file_name2, 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
for event, elem in ET.iterparse(str(x_file_name2), events=("start", "end")):
if elem.tag == "total_credit" and event == "end":
tc=float(elem.text)
elem.clear
if elem.tag == "expavg_credit" and event == "end":
ac=float(elem.text)
elem.clear
if elem.tag == "id" and event == "end":
id=elem.text
elem.clear
if elem.tag == "cpid" and event == "end":
cpid=elem.text
elem.clear
if elem.tag == "name" and event == "end":
name = elem.text
elem.clear()
teamid=TEAMID
if elem.tag == "teamid" and event == "end":
if elem.text == TEAMID:
cnt=cnt+1
dic[id]={"Name":name,"CPID":cpid, "TC":tc, "AC":ac}
elem.clear()