0

我有一个文本文件,如下所示

First col, Second col, Third col, Fourth col,...

与此类似:

Johnny, Rodgers, ID1, 18th July,...   
Johnny, Rodgers, ID1, 18th July,...  
Pat, Bryant, ID2, 29th April,...   
Pat, Bryant, ID2, 9th May,... 
Jim, Williams, ID3, 10th March,...  
Jim, Williams, ID3, 17th March,...   
Jim, Williams, ID3, 21st March,...
etc   

我想检查第 3 列中是否有重复,在这种情况下,检查第 4 列是否在第 3 列中重复的行中也相同。如果第 3 列和第 4 列也相同,则删除两行(整行),如果第 4 列不同,则存储结果。之后打印/存储结果。

也就是说,
* 如果第 1 行和第 2 行在第 3 列中具有相同的值,并且在第 4 行中也具有相同的值,则删除这两行
* 如果第 3 和第 4 行在第 3 列中具有相同的值而在第 4 行中具有不同的值,则打印行数+1
* 如果第 5、6 和 7 行在 col 3 中具有相同的值而在第 4 行具有不同的值,则打印 rows and count +1

这样执行后,结果就像

Pat, Bryant, ID2, 29th April,...   
Pat, Bryant, ID2, 9th May,... 
Jim, Williams, ID3, 10th March,...  
Jim, Williams, ID3, 17th March,...   
Jim, Williams, ID3, 21st March,...

counter = 2 #Number of different ID present

我的想法是制作两个列表并在那里存储行,但我没有成功设置目标并同时比较其他列。我还需要用我当前的逻辑循环和弹出,但我做得不好。

val = []
duplicated = []

with open('file.txt', 'rt') as myf.
     for line in myf:
            col = line.stip():split(',')
            if col[2] not in val:
                val.append( THE ROW HERE ) #How to copy and parse the row?
            else:
                duplicated.append( THE ROW HERE ) #Same question
#Comparisons

for x in value:
    if x in dupl:
        value.pop(x)
        dupl.pop(x)

counter = len(val) #Counter of total cases not erased
val.extend(duplicated)

### I would like to print the whole set of rows ordered by the 3rd col

for element in val:
    print element

print "counter of cases: " , counter

改进我的编码的帮助和建议将非常受欢迎。

4

2 回答 2

1

我从您的示例代码开始,并假设要合并和删除的行是相邻的。我只是保留前一行的值进行比较,并可选择添加最后一行。

我使用一组来计算不同的 ID。

我还对第 3 个字段和第 4 个字段的保留行进行了排序,将其作为当前语言环境中月份全名的 dat。

在您的示例中进行了测试,输出就是您所要求的,即使输入行被打乱,只要要删除的 2 行是相邻的。

代码是:

import re
import datetime
val = []

old = None
oldcount = 0
oldcols = None
counter = 0

ids = set()

with open('file.txt', 'rt') as myf:
     for line in myf:
            cols = line.strip().split(',')
            if (old is not None) and (oldcols[2] == cols[2]) \
                   and (oldcols[3] == cols[3]):
                oldcount += 1
            else:
                if oldcount == 1:
                    val.append(old)
                    ids.add(cols[2])
                old = line.strip()
                oldcount = 1
                oldcols = cols

if oldcount == 1:
    val.append(old)
    ids.add(cols[2])

### I would like to print the whole set of rows ordered by the 3rd col
rx = re.compile('\s*([ 0-9]{2}).. *(\w*)')
val.sort(key = lambda x: datetime.datetime.strptime(
    rx.sub('\g<1> \g<2>',x.split(',')[3]),'%d %B'))
val.sort(key = lambda x: x.split(',')[2])
for element in val:
    print (element)

print ("counter of cases: " , len(ids))
于 2014-08-11T15:59:06.777 回答
1

假设它们总是相邻的,并使用您的示例数据:

import csv

with open(fn, 'r') as fin:
    reader=csv.reader(fin, skipinitialspace=True)
    header=next(reader)
    data={k:[] for k in header}
    for row in reader:
        row_di={k:v for k,v in zip(header, row)}
        if (all(len(data[e]) for e in header) 
               and row_di['Third col']==data['Third col'][-1] 
               and row_di['Fourth col']==data['Fourth col'][-1]):
            for e in header:
                data[e].pop()
        else:
            for e in header:
                data[e].append(row_di[e])

>>> data
{'Second col': ['Bryant', 'Bryant', 'Williams', 'Williams', 'Williams'], 'First col': ['Pat', 'Pat', 'Jim', 'Jim', 'Jim'], 'Fourth col': ['29th April', '9th May', '10th March', '17th March', '21st March'], 'Third col': ['ID2', 'ID2', 'ID3', 'ID3', 'ID3'], '...': ['...   ', '... ', '...  ', '...   ', '...']}

以您的格式打印:

unique_ids=set(data['Third col'])    

while True:                        
    try:    
        print ', '.join([data[e].pop(0) for e in header])
    except IndexError:
        break     
print 'Unique IDs:', len(unique_ids)         

印刷:

Pat, Bryant, ID2, 29th April, ...   
Pat, Bryant, ID2, 9th May, ... 
Jim, Williams, ID3, 10th March, ...  
Jim, Williams, ID3, 17th March, ...   
Jim, Williams, ID3, 21st March, ...
Unique IDs: 2

笔记:

  1. 对于 csv 数据,通常最好使用csv 模块
  2. 使用 aset(iterable)获取 iterable 中唯一条目的数量;
  3. 如果您有很多数据,您可以考虑使用双端队列的字典而不是列表的字典。使用此实现所依赖的 pop,双端队列会快得多。
于 2014-08-11T14:55:07.000 回答