python - 使用给定列比较两个 csv 文件，并使用匹配行中的特定列构建第三个文件

Question

一个.csv：

12.23496740, -11.95760385, 3, 5, 11.1, 4
12.58295928, -11.39857395, 4, 7, 12.3, 6
12.42572572, -11.09478502, 2, 5, 12.3, 8
12.58300286, -11.95762569, 5, 11, 3.4, 7

二.csv：

12.5830, -11.3986, .2, 4
12.4257, -11.0948, .7, 3

我想通过第 0 列和第 1 列匹配两个 csv 文件，并最终输出一个 csv 文件，其中包括 one.csv 中第 4 列和 two.csv 中第 2 列的相应值，如下所示：

三.csv

12.5830, -11.3986, 12.3, .2
12.4257, -11.0948, 12.3, .7

score 0 · Accepted Answer

我认为这不是一个好的答案，但您的问题的解决方案如下：

import sys
import math

def dist(point1, point2):
  return math.sqrt((point1[0]-point2[0])**2 + (point1[1]-point2[1])**2)

one = []
two = []

with open('one.csv', 'r') as f:
    for line in f.readlines():
        x, y, _, _, _4, _ = line.split(',')
        one.append((float(x), float(y), float(_4)))

with open('two.csv', 'r') as f:
    for line in f.readlines():
        x, y, _2, _ = line.split(',')
        two.append((float(x), float(y), float(_2)))

with open('three.csv', 'w') as f:
    for point in two:
        nearest = None
        distance = sys.float_info.max
        for point2 in one:
            d = dist(point2, point)
            if d < distance:
                distance = d
                nearest = point2
        f.write("%f, %f, %f, %f\n" % (point[0], point[1], nearest[2], point[2]))

将产生输出到三个.csv：

12.583000, -11.398600, 12.300000, 0.200000
12.425700, -11.094800, 12.300000, 0.700000

如果您需要格式化，只需在代码段的最后一行进行。

score 0 · Accepted Answer

我不确定您的问题到底出在哪里。如果您想要一种算法用于根据坐标集计算距离，请随意使用以下代码：

from math import radians, cos, sin, asin, sqrt

def haversine(lat1, lng1, lat2, lng2, metric=False):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    earths_radius_km = 6378.1
    # convert decimal degrees to radians 
    lat1, lng1, lat2, lng2 = map(radians, [lat1, lng1, lat2, lng2])
    # haversine formula 
    dlat = lat2 - lat1
    dlng = lng2 - lng1
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlng/2)**2
    c = 2 * asin(sqrt(a)) 
    km = earths_radius_km * c
    if not metric:
        km_to_miles = 0.621371192
        dist = km * km_to_miles
        units = 'miles'
    else:
        dist = km
        units = 'km'
    return dist, units

if __name__ == '__main__':
    print 'Please call from within another script'
    # example...
    lat1, lng1, lat2, lng2 = 51.0820266, 1.1834209, 52.4931226, -2.1786751
    print 'e.g. distance in km is:', haversine(lat1, lng1, lat2, lng2, True)
    print 'e.g. distance in miles is:', haversine(lat1, lng1, lat2, lng2)

如果我理解正确，您想遍历一个文件中的坐标并在另一个文件中找到最接近的匹配项？如果是这种情况，只需将 min_distance 初始化为任意高的值，例如 1000000 为第一组中的每个值，然后循环通过第二组坐标调用上面的公式（或您想要使用的任何距离函数）并将 min_distance 重置为如果结果是 < 当前 min_distance 的结果（并将第二个列表中所需的额外值存储在临时变量中，每次找到较低距离时都将被覆盖）。一旦你完成了内循环中的所有迭代，你就可以在开始外循环的下一次迭代之前将你需要的数据存储在一个列表中。

score 0 · Accepted Answer

这个问题有一个优雅的解决方案，使用numpy：

def compare_files( f1name, f2name, f3name, ctc1, ctc2, columns, TOL=0.001 ):
    f1 = np.loadtxt( f1name, delimiter=',' )
    f2 = np.loadtxt( f2name, delimiter=',' )
    check = np.logical_and( *[np.absolute(s.outer(f1[:,i], f2[:,j])) < TOL for i,j in zip(ctc1,ctc2)] )
    chosen1 = f1[np.any( check, axis=1 )]
    chosen2 = f2[np.any( check, axis=0 )]
    newshape = (2,f1.shape[0],f2.shape[0])
    ind = np.indices(check.shape)[np.vstack((check,check)).reshape(newshape)]
    ind1 = ind[:len(ind)/2]
    ind2 = ind[len(ind)/2:]

    new = np.concatenate( [eval(f)[ind1, c][:,None] if f=='f1' else\
                           eval(f)[ind2, c][:,None] \
                           for f,c in columns], axis=1 )
    np.savetxt(f3name, new, delimiter=',', fmt='%f')

该功能是通用的，可以应用于您问题中描述的情况，如下所示：

f1name = 'one.csv'
f2name = 'two.csv'
f3name = 'three.csv'
ctc1 = [0,1] # columns to compare from file 1
#       ^ ^
#       | | # this arrows are just to emphisize who is compared with who...
#       v v
ctc2 = [0,1] # columns to compare from file 2
columns = [['f2',0], # file 2 column 0
           ['f2',1], # file 2 column 1
           ['f1',4], # file 1 column 4
           ['f1',2]] # file 1 column 2
TOL = 0.001
compare_files( f1name, f2name, f3name, ctc1, ctc2, columns, TOL )

Wherectc1和ctc2将告诉函数要比较哪些列（ctc）。并将columns告诉如何构建新文件。在此示例中，它使用来自f2的第 0 列、第 1 列、第 4f1列和第 2 列进行构建。

测试one.csv：

12.23496740, -11.95760385, 3, 5, 11.1, 4
12.58295928, -11.39857395, 4, 7, 12.3, 6
12.42572572, -11.09478502, 2, 5, 12.3, 8
12.58300286, -11.95762569, 5, 11, 3.4, 7

并且two.csv：

12.43, -11.0948, .7, 3
12.43, -11.0948, .7, 3
12.4257, -11.0948, .7, 3
12.43, -11.0948, .7, 3
12.5830, -11.3986, .2, 4

给出一个three.csv：

12.583000,-11.398600,12.300000,0.200000
12.425700,-11.094800,12.300000,0.700000

score 0 · Accepted Answer

我会将这两个 csv 文件读入列表列表，以便您拥有 csv1 和 csv2。然后遍历所有这些你会做：

for e1 in csv1:
    for e2 in csv2:
         distance = d(e1[0],e1[1], e2[0], e2[1]) #using a function call to your distance formula

要保存结果，您可以使用字典，以便稍后以简单的方式输出。因此，在保存新条目时：

output_dict[(e1[0], e1[1])] = [e1[3], e2[3]]

python - 使用给定列比较两个 csv 文件，并使用匹配行中的特定列构建第三个文件

4 回答 4

Related

Reference