python - 如何正确解码混乱的 UTF-8 字符串？

Question

我正在尝试使用 python 和 pyexiv2 读取 IPTC 数据。

import pyexiv2
image = pyexiv2.Image('test.jpg')
image.readMetadata()
print image['Iptc.Application2.Caption']

这给了我以下信息：

Copyright: Michael Huebner, Kontakt: +4915100000000xxxxxx Höxx (30) ist im Streit mit dem Arbeitsamt in Brandenburg, xxxxxxxxxxxxxx , xxxxxx,

但它应该给我：

Kinder: Axxxxx Hxxxxx (10) und Exxxxxx Höxx (5), Rxxxxxxx Höxx (30) ist im Streit mit dem Arbeitsamt in Brandenburg, xxxxxxxxxxxxx , xxxxxxxxxxx, 
Copyright: Michael Huebner, Kontakt: +4915100000000

这有点乱，因为我必须删除个人数据，但您可以看到发生了什么：“换行符”使最后一部分覆盖字符串的第一部分。

但现在它变得很奇怪：

for i in str(image['Iptc.Application2.Caption']):
  print i,

这只是以正确的顺序打印出所有字符，包括换行符。但它弄乱了“元音变音”字符。

这个：

unicode(image['Iptc.Application2.Caption'])

给我：

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 41: ordinal not in range(128)

那么我怎样才能同时拥有：变音符号和正确的字符串顺序？我怎样才能修复这个字符串？

score 1 · Accepted Answer

您的数据使用与您期望的不同的行分隔符约定。这不是 UTF-8 特定的问题，真的。

您可以使用分割线str.splitlines(); 它将被识别\r为行分隔符。或者，您可以使用以下命令重新加入您的行\n：

>>> sample = 'line 1\rline 2'
>>> print sample
line 2
>>> sample.splitlines()
['line 1', 'line 2']
>>> print '\n'.join(sample.splitlines())
line 1
line 2

python - 如何正确解码混乱的 UTF-8 字符串？

1 回答 1

Related

Reference