6

我正在尝试将数据集加载到熊猫中,但似乎无法通过第 1 步。我是新手,所以如果这很明显,请原谅,我搜索了以前的主题但没有找到答案。数据大多是汉字,这可能是问题所在。

.csv 非常大,可以在这里找到:http ://weiboscope.jmsc.hku.hk/datazip/ 我在第 1 周尝试。

在下面的代码中,我确定了我尝试的 3 种解码类型,包括尝试查看使用了哪种编码

import pandas
import chardet
import os


#this is what I tried to start
    data = pandas.read_csv('week1.csv', encoding="utf-8")

    #spits out error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9a in position 69: invalid start byte

#Code to check encoding -- this spits out ascii
bytes = min(32, os.path.getsize('week1.csv'))
raw = open('week1.csv', 'rb').read(bytes)
chardet.detect(raw)

#so i tried this! it also fails, which isn't that surprising since i don't know how you'd do chinese chars in ascii anyway
data = pandas.read_csv('week1.csv', encoding="ascii")

#spits out error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: ordinal not in range(128)

#for god knows what reason this allows me to load data into pandas, but definitely not correct encoding because when I print out first 5 lines its gibberish instead of Chinese chars
data = pandas.read_csv('week1.csv', encoding="latin1")

任何帮助将不胜感激!

编辑:@Kristof 提供的答案确实有效,就像我的同事昨天整理的程序一样:

import csv
import pandas as pd

def clean_weiboscope(file, nrows=0):
    res = []
    with open(file, 'r', encoding='utf-8', errors='ignore') as f:
        reader = csv.reader(f)
        for i, row in enumerate(f):
            row = row.replace('\n', '')
            if nrows > 0 and i > nrows:
                break
            if i == 0:
                headers = row.split(',')
            else:
                res.append(tuple(row.split(',')))
    df = pd.DataFrame(res)
    return df

my_df = clean_weiboscope('week1.csv', nrows=0)

我还想为未来的搜索者补充一下,这是 2012 年的 Weiboscope 开放数据。

4

1 回答 1

3

输入文件似乎有很大问题。始终存在编码错误。

您可以做的一件事是将 CSV 文件作为二进制文件读取,解码二进制字符串并替换错误字符。

示例(块读取代码的源代码):

in_filename = 'week1.csv'
out_filename = 'repaired.csv'

from functools import partial
chunksize = 100*1024*1024 # read 100MB at a time

# Decode with UTF-8 and replace errors with "?"
with open(in_filename, 'rb') as in_file:
    with open(out_filename, 'w') as out_file:
        for byte_fragment in iter(partial(in_file.read, chunksize), b''):
            out_file.write(byte_fragment.decode(encoding='utf_8', errors='replace'))

# Now read the repaired file into a dataframe
import pandas as pd
df = pd.read_csv(out_filename)

df.shape
>> (4790108, 11)

df.head()

样本输出

于 2016-08-03T13:11:37.173 回答