1

我有一个 CSV 文件,其中包含大约 30,000 行 24 列的数据。最后一列是地理列,看起来像这样:

 Ethiopia
 IL
 IL
 TX
 TX
 MD
 NY
 NY
 Ethiopia
 Ethiopia
 Sweden
 CA
 CA
 HI
 Latvia
 OH

现在我只希望包含所有行的整个 CSV 与美国的地理位置相对应,这将是 2 个字符的州缩写(CA、HI、OH 等)

基本上,我希望 CSV 中的所有数据都删除任何与美国无关的数据,或者如果可能的话,甚至更好地按美国的位置排列前 X 行,其余的按 CSV 末尾的所有其他数据排列。

到目前为止,这是我的代码:

import csv

ask = "Y"

while ask != "N":
    inputfile = input("Please enter filename: ")
    filename = open(inputfile, "r")

    data = []
    with filename as f:
        reader = csv.reader(f, delimiter=',')
        for row in reader:
            if len(row[24]) == 3:
                data = row[24]
        datalist = row[0:23].join(data)
        output = open("Newly Created Data.csv","w")
        output.write(datalist)
        print ("Done.")

    output.close()

    ask = input("Another file, Y or N? ")

它仅通过读取美国位置来正确排列第 24 列中的数据,但我不知道如何对文件的其余部分和其他 23 列进行排序以仅与美国位置匹配。

我正在使用 Python 3,谢谢。

4

2 回答 2

0
import csv
states = set(["AL","AK","AZ","AR","CA","CO","CT","DE","FL","GA","HI","ID","IL","IN","IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT","VT","VA","WA","WV","WI","WY",])

with open('file.txt') as f, open('ofile.txt','w+') as o:
    reader = csv.reader(f)
    writer = csv.writer(o)
    writer.writerows(sorted(reader,key=lambda row: not row[-1] in states))

将对文件进行排序

A,B,C,Ethiopia
A,B,C,IL
A,B,C,IL
A,B,C,TX
A,B,C,TX
A,B,C,MD
A,B,C,NY
A,B,C,NY
A,B,C,Ethiopia
A,B,C,Ethiopia
A,B,C,Sweden
A,B,C,CA
A,B,C,CA
A,B,C,HI
A,B,C,Latvia
A,B,C,OH

进入

A,B,C,IL

A,B,C,IL

A,B,C,TX

A,B,C,TX

A,B,C,MD

A,B,C,NY

A,B,C,NY

A,B,C,CA

A,B,C,CA

A,B,C,HI

A,B,C,OH

A,B,C,Ethiopia

A,B,C,Ethiopia

A,B,C,Ethiopia

A,B,C,Sweden

A,B,C,Latvia

当这样读回来时:

with open('ofile.txt') as f:
    for line in csv.reader(f):
        print(line)

产生:

>>> 
['A', 'B', 'C', 'IL']
['A', 'B', 'C', 'IL']
['A', 'B', 'C', 'TX']
['A', 'B', 'C', 'TX']
['A', 'B', 'C', 'MD']
['A', 'B', 'C', 'NY']
['A', 'B', 'C', 'NY']
['A', 'B', 'C', 'CA']
['A', 'B', 'C', 'CA']
['A', 'B', 'C', 'HI']
['A', 'B', 'C', 'OH']
['A', 'B', 'C', 'Ethiopia']
['A', 'B', 'C', 'Ethiopia']
['A', 'B', 'C', 'Ethiopia']
['A', 'B', 'C', 'Sweden']
['A', 'B', 'C', 'Latvia']
于 2013-05-19T03:52:05.923 回答
0

对于纯粹的标准库解决方案,可能类似于

import csv

with open('location.csv', newline='') as fp_in:
    reader = csv.reader(fp_in, delimiter=',')
    data = list(reader)

data.sort(key=lambda x: (len(x[-1].strip()) != 2, x[-1].strip()))

with open("locout.csv", "w", newline='') as fp_out:
    writer = csv.writer(fp_out, delimiter=',')
    writer.writerows(data)

排序键函数 ,lambda x: (len(x[-1].strip()) != 2, x[-1].strip()))意味着它将首先根据最后一列是否有两个字符对数据进行排序,首先放置 2 个字符的位置,然后按名称(有效地按字母顺序排列它们,至少如果它们都是以大写字母开头。)

我假设文件不是太大:30000 行不是很多,即使有 24 列,所以我们不妨完全在内存中工作。

(顺便说一句:如果您正在执行大量 CSV 操作,您可能会对pandas库感兴趣——它使许多操作比其他操作简单得多。)

于 2013-05-19T04:22:23.707 回答