python - Python Pandas 读取具有可变前导码长度的 csv 文件

Question

嗨，我正在使用 pandas 读取一系列文件并将它们连接到数据框。我的文件开头有一堆垃圾，长度可变，我想忽略它们。pd.read_csv()有skirows方法。我编写了一个函数来处理这种情况，但我必须打开文件两次才能使其工作。有没有更好的办法？

HEADER = '#Start'

def header_index(file_name):
    with open(file_name) as fp:
        for ind, line in enumerate(fp):
            if line.startswith(HEADER):
                return ind

for row in directories:
    path2file = '%s%s%s' % (path2data, row, suffix)
    myDF = pd.read_csv(path2file, skiprows=header_index(path2file), header=0, delimiter='\t')

任何帮助将不胜感激。

score 0 · Accepted Answer

这现在是可能的（不知道当时是否可能）如下：

pos= 0
oldpos = None

while pos != oldpos:  # make sure we stop reading, in case we reach EOF
    line= fp.readline()
    if line.startswith(HEADER):
        # set the read position to the start of the line
        # so pandas can read the header
        fp.seek(pos)
        break
    oldpos= pos
    pos= fp.tell()    # renenber this position as sthe start of the next line

pd.read_csv(fp, ...your options here...)

score 0 · Accepted Answer

由于read_csv()也接受类似对象的文件，因此您可以在传递该对象之前跳过前导垃圾行 --- 而不是传递文件名。

例子：

代替

df = pd.read_csv(filename, skiprows=no_junk_lines(filename), ...)

和：

def forward_csv(f, prefix):
    pos = 0
    while True:
        line = f.readline()
        if not line or line.startswith(prefix):
            f.seek(pos)
            return f
        pos += len(line.encode('utf-8'))

df = pd.read_csv(forward_csv(open(filename), HEADER), ...)

笔记：

readline()到达 EOF 时返回空字符串
不调用tell()来跟踪位置可以节省一些lseek系统调用
最后一行forward_csv()假设您的输入文件以 ASCII 或 UTF-8 编码 - 如果不是，您必须调整此行

python - Python Pandas 读取具有可变前导码长度的 csv 文件

2 回答 2

Related

Reference