2

我正在尝试在 Pandas 中导入 csv 文件,但它会引发错误。在notepad++中打开的数据格式如下,第一行为列名:

"End Customer Organization ID,End Customer Organization Name,End Customer Top Parent Organization ID,End Customer Top Parent Organization Name,Reseller Top Parent ID,Reseller Top Parent Name,Business,Rev Sum Division,Rev Sum Category,Product Family,Version,Pricing Level,Summary Pricing Level,Detail Pricing Level,MS Sales Amount,MS Sales Licenses,Fiscal Year,Sales Date"
"11027676,Baroda Western Uttar Pradesh Gramin Bankgfhgfnjgfnmjmhgmghmghmghmnghnmghnmhgnmghnghngh,4078446,Bank Of Barodadfhhgfjyjtkyukujkyujkuhykluiluilui;iooi';po'fserwefvegwegf,1809012,""Hcl Infosystems Ltd - Partnerdghftrutyhb frhywer5y5tyu6ui7iukluyj,lgjmfgnhfrgweffw"",Server & CALsdgrgrfgtrhytrnhjdgthjtyjkukmhjmghmbhmgfngdfbndfhtgh,SQL Server & CALdfhtrhtrgbhrghrye5y45y45yu56juhydsgfaefwe,SQL CALdhdfthtrutrjurhjethfdehrerfgwerweqeadfawrqwerwegtrhyjuytjhyj,SQL CALdtrye45y3t434tjkabcjkasdhfhasdjkcbaksmjcbfuigkjasbcjkasbkdfhiwh,2005,Openfkvgjesropiguwe90fujklascnioawfy98eyfuiasdbcvjkxsbhg,Open Lklbjdfoigueroigbjvwioergyuiowerhgosdhvgfoisdhyguiserhguisrh,""Open Stddfm,vdnoghioerivnsdflierohgushdfovhsiodghuiohdbvgsjdhgouiwerho"",125.85,1,FY07,12/28/2006"
"12835756,Uttam Strips Pvt Ltd,12835756,Uttam Strips Pvt Ltd,12565538,Redington C/O Fortis Financial Services Ltd,MBS,Dynamics ERP,Dynamics NAV,Dynamics NAV Business Essentials,Non-specific,Other,MBS SA,MBS New Customer Enhanc. Def,0,0,FY09,9/15/2008"
"12233135,Bhagwan Singh Tondon,12233135,Bhagwan Singh Tondon,2652941,H B S Systems Pvt Ltd,Server & CAL,SQL Server & CAL,SQL CAL,SQL CAL,Non-specific,Open,Open L&SA,Deferred Open L&SA - New,0,0,FY09,9/15/2008"
"11602305,Maya Academy Of Advanced Cinematics,9750934,Maya Entertainment Ltd,336146,Embee Software Pvt Ltd,Server & CAL,Windows Server & CAL,Windows Server HPC,Windows Compute Cluster Server,Non-specific,Open,Open V/MYO - Rec,OLV Perpet L&SA Recur-Def,0,0,FY09,9/25/2008"
"13336009,Remiel Softech Solution Pvt Ltd,13336009,Remiel Softech Solution Pvt Ltd,13335482,Redington C/O Remiel Softech Solutions Pvt Ltd,MBS,Dynamics ERP,Dynamics NAV,Dynamics NAV Business Essentials,Non-specific,Other,MBS SA,MBS New Customer Enhanc. Def,0,0,FY09,12/23/2008"
"7872800,Science Application International Corporation,2839760,GOVERNMENT OF KARNATAKA,10237455,Cubic Computing P.L,Server & CAL,SQL Server & CAL,SQL Server Standard,SQL Server Standard Edition,Non-specific,Open,Open SA/UA,Deferred Open SA - Renewal,0,0,FY09,1/15/2009"
"13096361,Pratham Software Pvt Ltd,13096361,Pratham Software Pvt Ltd,10133086,Krap Computer,Information Worker,Office,Office Standard / Basic,Office Standard,2007,Open,Open L,Open Std,7132.44,28,FY09,9/24/2008"
"12192276,Texmo Precision Castings,12192276,Texmo Precision Castings,4059430,Quadra Systems. - Partner,Server & CAL,Windows Server & CAL,Windows Standard Server,Windows Server Standard,Non-specific,Open,Open L&SA,Deferred Open L&SA - New,0,0,FY09,11/15/2008"

请注意,在 csv 格式中双击时,相同的文件会在 excel 中以逗号分隔值打开,但每行中没有引号,如 notepad++ 所示。

我已将编码用作 UTF-8,它给出了以下错误:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x91 in position 13: invalid start byte

然后先使用 encoding='cp1252' 然后尝试使用 latin1。

df=pd.read_csv(filename,encoding='cp1252') 

or 

df=pd.read_csv(filename,encoding='latin1')

使用这两种编码都没有给出任何错误,并且数据被导入但作为一个单独的列而不是作为不同的列。

它是否与数据中每一行之前出现的“”标记有关?我有一个类似的带有逗号分隔值的 csv 文件,但每行中没有双引号,并且使用 cp1252 和 latin1 都正确导入。但不适用于 UTF-8,即使该文件在 notepad++ 中以 utf8 格式保存。但在这种情况下,utf8 不能像往常一样工作,其他两个将其作为单列导入。

请指教。

谢谢

4

1 回答 1

0

我很确定引号会导致它将其中的所有逗号解释为转义。因此,您需要将它们全部剥离。这样做相对简单,但由于 unicode 问题,我会发疯,建议您阅读它,去掉引号,然后将其写入文件以使用read_csv(因为它会简化编码问题)。

以下是如何写入文件并去除引号,写入新文件,然后使用 read_csv 读入:

with open(filename) as infile, open(tmpfile, 'wb') as outfile:
    for line in infile:
        outfile.write(line.strip('"'))

result = pd.read_csv(tmpfile, encoding='cp1252')

你也想在读完临时文件后删除它。

我建议像上面那样做的原因是你可以避免在传递给 StringIO 缓冲区时处理编码/解码 - 对于 Python 和 pandas 都可能是挑剔的。

于 2013-10-21T05:02:37.520 回答