3

I have a folder of XML files that I would like to parse. I need to get text out of the elements of these files. They will be collected and printed to a CSV file where the elements are listed in columns.

I can actually do this right now for some of my files. That is, for many of my XML files, the process goes fine, and I get the output I want. The code that does this is:

import os, re, csv, string, operator
import xml.etree.cElementTree as ET
import codecs
def parseEO(doc):
    #getting the basic structure
    tree = ET.ElementTree(file=doc)
    root = tree.getroot()
    agencycodes = []
    rins = []
    titles =[]
    elements = [agencycodes, rins, titles]
    #pulling in the text from the fields
    for elem in tree.iter():
        if elem.tag == "AGENCY_CODE":
            agencycodes.append(int(elem.text))
        elif elem.tag == "RIN":
            rins.append(elem.text)
        elif elem.tag == "TITLE":
            titles.append(elem.text)
    with open('parsetest.csv', 'w') as f:
        writer = csv.writer(f)
        writer.writerows(zip(*elements))


parseEO('EO_file.xml')     

However, on some versions of the input file, I get the infamous error:

'ascii' codec can't encode character u'\x97' in position 32: ordinal not in range(128)

The full traceback is:

    ---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-15-28d095d44f02> in <module>()
----> 1 execfile(r'/parsingtest.py') # PYTHON-MODE

/Users/ian/Desktop/parsingtest.py in <module>()
     91         writer.writerows(zip(*elements))
     92 
---> 93 parseEO('/EO_file.xml')
     94 
     95 

/parsingtest.py in parseEO(doc)
     89     with open('parsetest.csv', 'w') as f:
     90         writer = csv.writer(f)
---> 91         writer.writerows(zip(*elements))
     92 
     93 parseEO('/EO_file.xml')
UnicodeEncodeError: 'ascii' codec can't encode character u'\x97' in position 32: ordinal not in range(128)

I am fairly confident from reading the other threads that the problem is in the codec being used (and, you know, the error is pretty clear on that as well). However, the solutions I have read haven't helped me (emphasized because I understand I am the source of the problem, not the way people have answered in the past).

Several repsonses (such as: this one and this one and this one) don't deal directly with ElementTree, and I'm not sure how to translate the solutions into what I'm doing.

Other solutions that do deal with ElementTree (such as: this one and this one) are either using a short string (the first link here) or are using the .tostring/.fromstring methods in ElementTree which I do not. (Though, of course, perhaps I should be.)

Things I have tried that didn't work:

  1. I have attempted to bring in the file via UTF-8 encoding:

    infile = codecs.open('/EO_file.xml', encoding="utf-8")
    parseEO(infile)
    

    but I think the ElementTree process already understands it to be UTF-8 (which is noted in the first line of all the XML files I have), and so this is not only not correct, but is actually redundantly bad all over again.

  2. I attempted to declare an encoding process within the loop, replacing:

    tree = ET.ElementTree(file=doc)
    

    with

    parser = ET.XMLParser(encoding="utf-8")
    tree = ET.parse(doc, parser=parser)
    

    in the loop above that does work. This didn't work for me either. The same files that worked before still worked, the same files that created the error still created the error.

There have been a lot of other random attempts, but I won't belabor the point.

So, while I assume the code I have is both inefficient and offensive to good programming style, it does do what I want for several files. I am trying to understand if there is simply an argument I'm missing that I don't know about, if I should somehow pre-process the files (I have not identified where the offending character is, but do know that u'\x97 translates to a control character of some kind), or some other option.

4

2 回答 2

10

您正在解析 XML;XML API 为您提供unicode价值。然后,您尝试将 unicode 数据写入 CSV 文件,而无需先对其进行编码。然后 Python 会尝试为您编码,但失败了。您可以在回溯中看到这一点,是.writerows()调用失败,错误告诉您编码失败,而不是解码(解析 XML)。

您需要选择一种编码,然后在写入之前对数据进行编码:

for elem in tree.iter():
    if elem.tag == "AGENCY_CODE":
        agencycodes.append(int(elem.text))
    elif elem.tag == "RIN":
        rins.append(elem.text.encode('utf8'))
    elif elem.tag == "TITLE":
        titles.append(elem.text.encode('utf8'))

我使用了 UTF8 编码,因为它可以处理任何 Unicode 代码点,但您需要做出自己的明确选择。

于 2013-06-22T23:56:19.657 回答
3

听起来您的 xml 文件中某处有一个 unicode 字符。Unicode 不同于以 utf8 编码的字符串。

python2.7 csv 库不支持 unicode 字符,因此在将数据转储到 csv 文件之前,您必须通过对它们进行编码的函数运行数据。

def normalize(s):
    if type(s) == unicode: 
        return s.encode('utf8', 'ignore')
    else:
        return str(s)

所以你的代码看起来像这样:

for elem in tree.iter():
    if elem.tag == "AGENCY_CODE":
        agencycodes.append(int(elem.text))
    elif elem.tag == "RIN":
        rins.append(normalize(elem.text))
    elif elem.tag == "TITLE":
        titles.append(normalize(elem.text))
于 2013-06-23T00:14:19.400 回答