I have a folder of XML files that I would like to parse. I need to get text out of the elements of these files. They will be collected and printed to a CSV file where the elements are listed in columns.
I can actually do this right now for some of my files. That is, for many of my XML files, the process goes fine, and I get the output I want. The code that does this is:
import os, re, csv, string, operator
import xml.etree.cElementTree as ET
import codecs
def parseEO(doc):
#getting the basic structure
tree = ET.ElementTree(file=doc)
root = tree.getroot()
agencycodes = []
rins = []
titles =[]
elements = [agencycodes, rins, titles]
#pulling in the text from the fields
for elem in tree.iter():
if elem.tag == "AGENCY_CODE":
agencycodes.append(int(elem.text))
elif elem.tag == "RIN":
rins.append(elem.text)
elif elem.tag == "TITLE":
titles.append(elem.text)
with open('parsetest.csv', 'w') as f:
writer = csv.writer(f)
writer.writerows(zip(*elements))
parseEO('EO_file.xml')
However, on some versions of the input file, I get the infamous error:
'ascii' codec can't encode character u'\x97' in position 32: ordinal not in range(128)
The full traceback is:
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-15-28d095d44f02> in <module>()
----> 1 execfile(r'/parsingtest.py') # PYTHON-MODE
/Users/ian/Desktop/parsingtest.py in <module>()
91 writer.writerows(zip(*elements))
92
---> 93 parseEO('/EO_file.xml')
94
95
/parsingtest.py in parseEO(doc)
89 with open('parsetest.csv', 'w') as f:
90 writer = csv.writer(f)
---> 91 writer.writerows(zip(*elements))
92
93 parseEO('/EO_file.xml')
UnicodeEncodeError: 'ascii' codec can't encode character u'\x97' in position 32: ordinal not in range(128)
I am fairly confident from reading the other threads that the problem is in the codec being used (and, you know, the error is pretty clear on that as well). However, the solutions I have read haven't helped me (emphasized because I understand I am the source of the problem, not the way people have answered in the past).
Several repsonses (such as: this one and this one and this one) don't deal directly with ElementTree, and I'm not sure how to translate the solutions into what I'm doing.
Other solutions that do deal with ElementTree (such as: this one and this one) are either using a short string (the first link here) or are using the .tostring/.fromstring methods in ElementTree which I do not. (Though, of course, perhaps I should be.)
Things I have tried that didn't work:
I have attempted to bring in the file via UTF-8 encoding:
infile = codecs.open('/EO_file.xml', encoding="utf-8") parseEO(infile)
but I think the ElementTree process already understands it to be UTF-8 (which is noted in the first line of all the XML files I have), and so this is not only not correct, but is actually redundantly bad all over again.
I attempted to declare an encoding process within the loop, replacing:
tree = ET.ElementTree(file=doc)
with
parser = ET.XMLParser(encoding="utf-8") tree = ET.parse(doc, parser=parser)
in the loop above that does work. This didn't work for me either. The same files that worked before still worked, the same files that created the error still created the error.
There have been a lot of other random attempts, but I won't belabor the point.
So, while I assume the code I have is both inefficient and offensive to good programming style, it does do what I want for several files. I am trying to understand if there is simply an argument I'm missing that I don't know about, if I should somehow pre-process the files (I have not identified where the offending character is, but do know that u'\x97 translates to a control character of some kind), or some other option.