1

使用namedtuple文档示例作为我在 Python 3.3 中的模板,我有以下代码来下载 csv 并将其转换为一系列 namedtuple 子类实例:

from collections import namedtuple
from csv import reader
from urllib.request import urlopen    

SecurityType = namedtuple('SecurityType', 'sector, name')

url = 'http://bsym.bloomberg.com/sym/pages/security_type.csv'
for sec in map(SecurityType._make, reader(urlopen(url))):
    print(sec)

这会引发以下异常:

Traceback (most recent call last):
  File "scrap.py", line 9, in <module>
    for sec in map(SecurityType._make, reader(urlopen(url))):
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)

我知道问题在于 urlopen 返回的是字节而不是字符串,并且我需要在某个时候解码输出。这是我现在使用 StringIO 的方法:

from collections import namedtuple
from csv import reader
from urllib.request import urlopen
import io

SecurityType = namedtuple('SecurityType', 'sector, name')

url = 'http://bsym.bloomberg.com/sym/pages/security_type.csv'
reader_input = io.StringIO(urlopen(url).read().decode('utf-8'))

for sec in map(SecurityType._make, reader(reader_input)):
    print(sec)

这听起来很有趣,因为我基本上是在迭代字节缓冲区、解码、重新缓冲,然后迭代新的字符串缓冲区。有没有更 Pythonic 的方法可以在没有两次迭代的情况下做到这一点?

4

1 回答 1

5

用于io.TextIOWrapper()解码urllib响应:

reader_input = io.TextIOWrapper(urlopen(url), encoding='utf8', newline='')

现在csv.reader传递的接口与在文件系统上以文本模式打开常规文件时所获得的接口完全相同。

通过此更改,您的示例 URL 在 Python 3.3.1 上适用于我:

>>> for sec in map(SecurityType._make, reader(reader_input)):
...     print(sec)
... 
SecurityType(sector='Market Sector', name='Security Type')
SecurityType(sector='Comdty', name='Calendar Spread Option')
SecurityType(sector='Comdty', name='Financial commodity future.')
SecurityType(sector='Comdty', name='Financial commodity generic.')
SecurityType(sector='Comdty', name='Financial commodity option.')
...
SecurityType(sector='Muni', name='ZERO COUPON, OID')
SecurityType(sector='Pfd', name='PRIVATE')
SecurityType(sector='Pfd', name='PUBLIC')
SecurityType(sector='', name='')
SecurityType(sector='', name='')
SecurityType(sector='', name='')
SecurityType(sector='', name='')
SecurityType(sector='', name='')
SecurityType(sector='', name='')
SecurityType(sector='', name='')
SecurityType(sector='', name='')
SecurityType(sector='', name='')

最后几行似乎产生了空元组;原版确实有几行,上面只有一个逗号。

于 2013-05-04T13:55:21.730 回答