-1

我需要在 ('/dir'/) 中获取 csv 文件的长度,不包括空行。我试过这个:

import os, csv, itertools, glob

#To filer the empty lines
def filterfalse(predicate, iterable):
    # filterfalse(lambda x: x%2, range(10)) --> 0 2 4 6 8
    if predicate is None:
        predicate = bool
    for x in iterable:
        if not predicate(x):
            yield x

#To read each file in '/dir/', compute the length and write the output 'count.csv'
with open('count.csv', 'w') as out:
     file_list = glob.glob('/dir/*')
     for file_name in file_list:
         with open(file_name, 'r') as f:
              filt_f1 = filterfalse(lambda line: line.startswith('\n'), f)
              count = sum(1 for line in f if (filt_f1))
              out.write('{c} {f}\n'.format(c = count, f = file_name))

我得到了我想要的输出,但不幸的是每个文件的长度(在'/dir/'中)包括空行。

要查看空行的来源,我读取file.csvfile.txt,它看起来像这样:

*text,favorited,favoriteCount,...
"Retweeted user (@user):...
'empty row'
Do Operators...*
4

2 回答 2

1

您的filterfalse()功能正确执行。它与标准库模块中命名的完全一样,所以不清楚为什么不直接使用它而不是自己编写——一个主要优点是它已经过测试和调试。(内置插件通常也更快,因为很多都是用 C 编写的。)ifilterfalseitertools

问题是您没有正确使用生成器功能

  1. 由于它返回一个生成器对象,因此需要yield使用类似for line in filt_f1.

  2. 您提供的谓词函数参数不能正确处理其中包含其他前导空白字符的行,例如空格和制表符。- 所以lambda你通过它也需要修改以处理这些情况。

下面的代码对它进行了这两项更改。

import os, csv, itertools, glob

#To filter the empty lines
def filterfalse(predicate, iterable):
    # filterfalse(lambda x: x%2, range(10)) --> 0 2 4 6 8
    if predicate is None:
        predicate = bool
    for x in iterable:
        if not predicate(x):
            yield x

#To read each file in '/dir/', compute the length and write the output 'count.csv'
with open('count.csv', 'w') as out:
    file_list = glob.glob('/dir/*')
    for file_name in file_list:
        with open(file_name, 'r') as f:
            filt_f1 = filterfalse(lambda line: not line.strip(), f)  # CHANGED
            count = sum(1 for line in filt_f1)  # CHANGED
            out.write('{c} {f}\n'.format(c=count, f=file_name))
于 2016-04-25T19:11:05.590 回答
1

我建议使用熊猫。

import pandas

# Reads csv file and converts it to pandas dataframe.
df = pandas.read_csv('myfile.csv')

# Removes rows where data is missing.
df.dropna(inplace=True)

# Gets length of dataframe and displays it.
df_length = df.count + 1
print('The length of the CSV file is', df_length)

文档: http: //pandas.pydata.org/pandas-docs/version/0.18.0/

于 2016-04-25T18:29:29.210 回答