我正在使用 CSV 文件通过 AWS 将数据摄取到我的 postgres 数据库中,我遇到了一些问题,其中一些数据不是 UTF-8 格式。我想确定我的数据中的哪些行导致了问题,以便我可以从源头解决。
我一直在尝试使用 chardet 来给我我需要的东西,但似乎无法让它逐行输出编码类型。我也尝试使用下面的,类似于 chardet 会告诉我整个文件是否是特定编码,但不是哪些行导致问题
import codecs
#PYTHON
encodings = ['utf-8','windows-1250', 'windows-1252'] # add more
for e in encodings:
try:
fh = codecs.open('filename.csv', 'r', encoding=e)
fh.readlines()
fh.seek(0)
except UnicodeDecodeError:
print('got unicode error with %s , trying different encoding' % e)
else:
print('opening the file with encoding: %s ' % e)
break
任何帮助表示赞赏!
导致以下问题的示例文本:
Aĺi Hùssaini Buķar Falmatàmi Mohammad Bùlama Mùstapaha Maiďugu Shu"ibu Aĺi Àdamu Ja"o Khaìcalla Makanikì Alì Mòhammad Zaŕami Dànkabo Kelĺumi Umàra Goŕoma Gaptomì Àli Àhmed Àbdullahi Shafi"u Mohammed! Hassañ Aùwal Usaìni Mohàmmed Goniķaka Abdullé ÑGABGL17401598 MUSA ĶAUMI NGAMCH17051ĺ535 NGBOJEGB1708ààaaÅQààp NGYOGJBY3215` NGBOJEAG1709Ź ÙNGBOKD1T17090240 ÑGBOMDMK17100381 ÑGBOMDMK17100382 ÑGBOMDMK17100383 ÑGBOMDMK17100384 ÑGBOMDMK17100385 ÑGBOMDMK17100387 ÑGBOMDMK17100388 ÑGBOMDMK17100389 ÑGBOMDMK17100390 ÑGBOMDMK17100392 ÑGBOMDMK17100393 ÑGBOMDMK17100394 ÑGBOMDMK17100395 ÑGBOMDMK17100396 ÑGBOMDMK17100397 ÑGBOMDMK17100398 ÑGBOMDMK17100399 ÑGBOMDMK17100400 ÑGBOMDMK17100401 ÑGBOMDMK17100402 ÑGBOMDMK17100403 ÑGBOMDMK17100419 Yyģggghyuuiiiuyttttrrrrrrŕ NĢBÒJEGM17100245 NĢBÒJEGM17100479 NĢBÒJEGM17100493 NĢBÒJEGM17100495 NĢBÒJEGM17100524 NĢBÒJEGM17100525 ÑGYOGJGJ122112 ÑGYOYFKG3824 Ngyoýfmy4736 NGBOJEFC1804aaà NGBOJEFC1804à8131 NGYÒGDAKW0717 NGYÒGDAK20609 NGBÒMMST19056545 NGBOMDNY88J00233!! ÀNGYOGDAK21907436 NGBODAAC19110390]