python - 按第一列对文本文件进行排序并计数重复python

Question

我有一个文本文件，需要按第一列排序并将所有重复与数据左侧的计数合并，然后将排序/计数的数据写入已经创建的 csv 文件。

前文本文件：

, 00.000.00.000, word, 00
, 00.000.00.001, word, 00
, 00.000.00.002, word, 00
, 00.000.00.000, word, 00
, 00.000.00.002, word, 00
, 00.000.00.000, word, 00

前结果：

, 3, 00.000.00.000, word, 00
, 1, 00.000.00.001, word, 00
, 2, 00.000.00.002, word, 00

我的代码：

for ip in open("list.txt"):
    with open(ip.strip()+".txt", "a") as ip_file:
        for line in open("data.txt"):
            new_line = line.split(" ")
            if "blocked" in new_line:
                if "src="+ip.strip() in new_line:
                    ip_file.write(", " + new_line[11])
                    ip_file.write(", " + new_line[12])
                    ip_file.write(", " + new_line[13])

for ip_file in os.listdir(sub_dir):
        with open(os.path.join(sub_dir, ip_file), "a") as f:
            data = f.readlines()
            data.sort(key = lambda l: float(l.split()[0]), reverse = True)

每当我测试代码时，我都会收到错误TypeError: 'str' object is not callable或类似的东西。我不能.split() .read() .strip()在没有收到错误的情况下使用 etc。

问题：如何对文件内容进行排序并计算重复行数（不定义函数）？

我基本上是在尝试：

sort -k1 | uniq -c | sed 's/^/,/' >> test.csv

score 1 · Accepted Answer

这个怎么样：

input = ''', 00.000.00.000, word, 00
, 00.000.00.001, word, 00
, 00.000.00.002, word, 00
, 00.000.00.000, word, 00
, 00.000.00.002, word, 00
, 00.000.00.000, word, 00'''.split('\n')

input.sort(key=lambda line: line.split(',')[1])

for key, values in itertools.groupby(input, lambda line: line.split(',')[1]):
  values = list(values)
  print ', %d%s' % (len(values), values[0])

这缺少所有错误检查（如不合适的行等），但也许您可以根据需要自行添加。此外，split执行两次；一次用于排序，一次用于分组。这可能可以改进。

score 1 · Accepted Answer

D = {}
for k in open('data.txt'): #use dictionary to count and filter duplicate lines
    if k in D:
        D[k] += 1 #increase k by one if already seen.
    else:
        D[k]  = 1 #initialize key with one if seen for first time.

for sk in sorted(D): #sort keys 
    print(',', D[sk], sk.rstrip(), file=open('test.csv', 'a')) #print a comma, followed by number of lines plus line.   

#Output
, 3, 00.000.00.000, word, 00
, 1, 00.000.00.001, word, 00
, 2, 00.000.00.002, word, 00

score 0 · Accepted Answer

我会考虑使用 Pandas 数据处理模块

import pandas as pd
my_data = pd.read_csv("C:\Where My Data Lives\Data.txt", header=None)
sorted_data = my_data.sort_index(by=[1], ascending=1)  # sort my data
sorted_data = sorted_data.drop_duplicates([1])         # leaves only unique values, sorted in order
counted_data = list(my_data.groupby(1).size())         #counts the unique values in data, coverts to a list
sorted_data[0] = counted_data                          # inserts the list into your data frame

python - 按第一列对文本文件进行排序并计数重复python

3 回答 3

Related

Reference