0

在 Python 2.7 中使用以下内容:

dfile = 'new_data.txt'   #  Depth file no. 1
d_row = [line.strip() for line in open(dfile)]

我已将数据文件加载到没有换行符的列表中。现在我想索引 d_row 中字符串开头不是数字和/或空的所有元素。接下来,我要求:

  1. 删除所有上述详细的非数字实例和
  2. 保存字符串和索引以供以后插入到更新的文件中。

数据示例:

Thu Mar 14 18:17:05 2013                                                       
Fri Mar 15 01:40:25 2013

FT

DepthChange: 0.000000,2895.336,0.000
1363285025.250000,9498.970
1363285025.300000,9498.970
1363285026.050000,9498.970
1363287840.450042,9458.010
1363287840.500042,9458.010
1363287840.850042,9458.010
1363287840.900042,9458.010
DepthChange: 0.000000,2882.810,9457.200
1363287840.950042,9458.010
DepthChange: 0.000000,2882.810,0.000
1363287841.000042,9457.170
1363287841.050042,9457.170
1363287841.100042,9457.170
1363287841.150042,9457.170
1363287841.200042,9457.170
1363287841.250042,9457.170
1363287841.300042,9457.170
1363291902.750102,9149.937
1363291902.800102,9149.822
1363291902.850102,9149.822
1363291902.900102,9149.822
1363291902.950102,9149.822
1363291903.000102,9149.822
1363291903.050102,9149.708
1363291903.100102,9149.708
1363291903.150102,9149.708
1363291903.200102,9149.708
1363291903.250102,9149.708
1363291903.300102,9149.592
1363291903.350102,9149.592
1363291903.400102,9149.592
1363291903.450102,9149.592
1363291903.500102,9149.592
DepthChange: 0.000000,2788.770,2788.709
1363291903.550102,9149.479
1363291903.600102,9149.379

我一直在手动执行删除步骤,这很耗时,因为该文件包含超过一百万行。目前,我无法通过一些修改重写包含所有原始元素的文件。

任何提示将不胜感激。

4

3 回答 3

0
dfile = 'new_data.txt'
with open(dfile) as infile:
  numericLines = set() # line numbers of lines that start with digits
  emptyLines = set() # line numbers of lines that are empty
  charLines = [] # line numbers of lines that start with a letter
  for lineno, line in enumerate(infile):
    if line[0].isalpha:
      charLines.append(line.strip())
    elif line[0].isdigit():
      numericLines.add(lineno)
    elif not line.strip():
      emptyLines.add(lineno)
于 2013-10-22T00:33:35.510 回答
0

感谢所有回答我问题的人。使用每个回复的一部分,我能够达到预期的结果。最终奏效的方法如下:

goodrow_ind, badrow_ind, badrows = [], [], []

d_rows = (line for line in open(ifile))
with open(ofile, 'w') as f:
    for i, row in enumerate(d_rows):
        if row[0].isdigit():
            f.write(row)
            goodrow_ind.append((i))
        else:
            badrow_ind.append((i))
            badrows.append((row))

ifile.close()

data = np.loadtxt(open(ofile,'rb'),delimiter=',')

结果是用索引分隔的“好”和“坏”行。

于 2013-10-24T01:13:50.893 回答
0

最简单的方法是分两遍:首先获取不匹配行的行和行号,然后获取匹配行的行。

d_rows = [line.strip() for line in open(dfile)]
good_rows = [(i, row) for i, row in enumerate(d_rows) if is_good_row(row)]
bad_rows = [(i, row) for i, row in enumerate(d_rows) if not is_good_row(row)]

这确实意味着两次通过名单,但谁在乎呢?如果列表足够小,可以像您已经在做的那样将整个内容读入内存,那么额外的成本可能可以忽略不计。

或者,如果您需要避免在两次传递中构建两个列表的成本,您可能还需要首先避免一次读取整个文件,因此您必须更聪明地做一些事情:

d_rows = (line.strip() for line in open(dfile)) # notice genexp, not list comp
good_rows, bad_rows = [], []
for i, row in enumerate(d_rows):
    if is_good_row(row):
        good_rows.append((i, row))
    else:
        bad_rows.append((i, row))

如果您可以将事情推到甚至不需要显式good_rowsbad_rows列表的地步,您可以将所有内容一直保存在迭代器中,并且完全不会浪费内存或预先阅读时间:

d_rows = (line.strip() for line in open(dfile)) # notice genexp, not list comp
with open(outfile, 'w') as f:
    for i, row in enumerate(d_rows):
        if is_good_row(row):
            f.write(row + '\n')
        else:
            whatever_you_wanted_to_do_with(i, row)
于 2013-10-22T00:35:56.057 回答