0

numpy.loadtxt在使用或numpy.genfromtxt将数据列加载到 numpy 数组之前检查数据文件是否只有标题的最优雅(和/或 Pythonic)方法是什么?

我有一个量子蒙特卡罗代码,它在执行时将标头写入磁盘,有时从不写入数据(正在使用集群的挂钟)。我显然已经编写了 Python 脚本来一次处理大量数据文件,有时其中一些文件最终在分配的时间内从未有数据写入它们。在我尝试加载数据并对其执行某些操作之前,我需要让我的分析脚本知道文件何时为空。

我的方法(有效,但可能不是最优雅的)是调用一个看起来像

def checkIfEmpty(fName,n):
    '''
    takes the first non-header line number and returns true or false
    depending upon whether that line is blank or not.
    '''
    Empty = False
    fp = open(fName)
    numLines=0
    for line in fp:
        numLines += 1
    fp.close()

    if n==numLines:
        Empty=True

    return Empty
4

2 回答 2

2

EDIT: Since you've indicated the output files may not really be that much bigger than the header-only files, I've thought of a different way to rid yourself of the explicit for loop.

def checkIfEmpty(fname, n):
    # NOTE: n is the file byte position at the end of the header.
    file_open = open( fname, 'r' )
    EOH = file_open.seek(n)
    if len(file_open.read()) == 0:
        return False
    else:
        return True

Wherever you calculate n in your code currently, you would just return the byte position. open_file.tell() will return this value, if you've read in lines somewhere else to test your header.

END EDIT

How much data is usually in the file?

If there's a huge difference in the file size if the data is missing you could use:

import os
def checkIfEmpty(fname, header_cutoff):
    if os.path.getsize( fname ) < header_cutoff:
        return True
    else:
        return False

Another reason I would prefer this solution is that with alot of large files, opening and checking them could be slow.

于 2013-11-01T04:14:59.697 回答
1

就像是:

def is_header_only(fname):
    with open(fname) as fin:
        return next(fin, '').lstrip().startswith('#') and next(fin, None) is None
于 2013-10-31T17:01:48.170 回答