1

我正在尝试格式化文件以便可以将其插入数据库,该文件最初是压缩的,大约 1.3MB 大。每行看起来像这样:

398,%7Enoniem+001%7E,543,480,7525010,1775,0

这就是解析这个文件的代码的样子:

   Village = gzip.open(Root+'\\data'+'\\' +str(Newest_Date[0])+'\\' +str(Newest_Date[1])+'\\' +str(Newest_Date[2])\
               +'\\'+str(Newest_Date[3])+' village.gz');
Village_Parsed = str
for line in Village:
    Village_Parsed = Village_Parsed + urllib.parse.unquote_plus(line);
print(Village.readline());

当我运行程序时,我收到此错误:

Village_Parsed = Village_Parsed + urllib.parse.unquote_plus(line);

文件“C:\Python31\lib\urllib\parse.py”,第 404 行,in unquote_plus string = string.replace('+', ' ') TypeError: expected an object with the buffer interface

知道这里有什么问题吗?在此先感谢您的帮助:)

4

2 回答 2

2

问题 1 是 urllib.unquote_plus 不喜欢line你喂它的那个。消息应该是“请提供一个 str 对象”:-) 我建议您修复下面的问题 2,然后插入:

print('line', type(line), repr(line))

在你的陈述之后立即,for这样你就可以看到你得到了什么line

你会发现它返回字节对象:

>>> [line for line in gzip.open('test.gz')]
[b'nudge nudge\n', b'wink wink\n']

使用“r”模式几乎没有效果:

>>> [line for line in gzip.open('test.gz', 'r')]
[b'nudge nudge\n', b'wink wink\n']

我建议不要传递line给解析例程,而是传递line.decode('UTF-8')...或编写 gz 文件时使用的任何编码。

问题 2 在这一行:

Village_Parsed = str

str是一种类型。您需要一个空的 str 对象。为此,您可以调用类型 ie str(),与使用字符串常量相比,它形式上正确但不切实际/不寻常/可笑/奇怪''......所以这样做:

Village_Parsed = ''

您还有问题 3:您的最后一条语句是在 EOF 之后尝试读取 gz 文件。

于 2009-11-04T10:05:50.510 回答
0
import gzip, os, urllib.parse

archive_relpath = os.sep.join(map(str, Newest_Date[:4])) + ' village.gz'  
archive_path = os.path.join(Root, 'data', archive_relpath)

with gzip.open(archive_path) as Village:
    Village_Parsed = ''.join(urllib.parse.unquote_plus(line.decode('ascii'))
                             for line in Village)
    print(Village_Parsed)

Output:

398,~Anoniem 001~,543,480,7525010,1775,0

NOTE: RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax says:

This specification does not mandate any particular character encoding for mapping between URI characters and the octets used to store or transmit those characters. When a URI appears in a protocol element, the character encoding is defined by that protocol; without such a definition, a URI is assumed to be in the same character encoding as the surrounding text.

Therefore 'ascii' in the line.decode('ascii') fragment should be replaced by whatever character encoding you've used to encode your text.

于 2009-11-04T10:39:05.010 回答