python - 匹配不同的列并使用python组合它们

Question

我有两个文本文件。

第一个是空格分隔的列表：

23 dog 4
24 cat 5
28 cow 7

第二个是 -'|'分隔的列表：

?dog|parallel|numbering|position
Dogsarebarking
?cat|parallel|nuucers|position
CatisBeautiful

我想获得如下输出文件：

?dog|paralle|numbering|position|23
?cat|parallel|nuucers|position|24

这是一个'|'- 分隔的列表，其中包含第二个文件的值，并附加了第一个文件的第一列中的值，其中两个文件的第二列中的值匹配。

score 3 · Accepted Answer

用于csv读取第一个文件，并使用一个字典来存储 file1 行。第二个文件是 FASTA 格式，所以我们只取以开头的行?：

import csv

with open('file1', 'rb') as file1:
    file1_data = dict(line.split(None, 2)[1::-1] for line in file1 if line.strip())

with open('file2', 'rb') as file2, open('output', 'wb') as outputfile:
    output = csv.writer(outputfile, delimiter='|')
    for line in file2:
        if line[:1] == '?':
            row = line.strip().split('|')
            key = row[0][1:]
            if key in file1_data:
                 output.writerow(row + [file1_data[key]])

这会产生：

?dog|parallel|numbering|position|23
?cat|parallel|nuucers|position|24

对于您的输入示例。

score 3 · Accepted Answer

这是pandas库擅长的任务：

import pandas as pd
df1 = pd.read_csv("c1.txt", sep="|", header=None).dropna()
df2 = pd.read_csv("c2.txt", sep=" ", header=None)
merged = df1.merge(df2, on=1).ix[:,:-1]
merged.to_csv("merged.csv", sep="|", header=None, index=None)

一些解释如下。首先，我们将文件读入称为 DataFrames 的对象中：

>>> df1 = pd.read_csv("c1.txt", sep="|", header=None).dropna()
>>> df1
               0      1          2         3
0      ?parallel    dog  numbering  position
3      ?parallel    cat    nuucers  position
6  ?non parallel  honey  numbering  position
>>> df2 = pd.read_csv("c2.txt", sep=" ", header=None)
>>> df2
    0    1  2
0  23  dog  4
1  24  cat  5
2  28  cow  7

.dropna()跳过没有任何数据的情况。或者，df1 = df1[df1[0].str.startswith("?")]将是另一种方式。

然后我们将它们合并到第一列：

>>> df1.merge(df2, on=1)
         0_x    1        2_x         3  0_y  2_y
0  ?parallel  dog  numbering  position   23    4
1  ?parallel  cat    nuucers  position   24    5

我们不需要最后一列，所以我们对其进行切片：

>>> df1.merge(df2, on=1).ix[:,:-1]
         0_x    1        2_x         3  0_y
0  ?parallel  dog  numbering  position   23
1  ?parallel  cat    nuucers  position   24

然后我们用写出来to_csv，产生：

>>> !cat merged.csv
?parallel|dog|numbering|position|23
?parallel|cat|nuucers|position|24

现在，对于许多简单的任务来说，可能是多余的，学习如何使用像模块pandas这样的低级工具也很重要。csvOTOH，当您只想立即完成某些事情时（tm），它非常非常方便。

score 0 · Accepted Answer

这似乎正是JOIN在关系数据库中的用途。

内连接是应用程序中最常见的连接操作，可以视为默认连接类型。内连接通过基于连接谓词组合两个表（A 和 B）的列值来创建一个新的结果表。该查询将 A 的每一行与 B 的每一行进行比较，以找到满足连接谓词的所有行对。当满足连接谓词时，将 A 和 B 的每对匹配行的列值组合成一个结果行。

看看这个例子：

import sqlite3
conn = sqlite3.connect('example.db')

# get hands on the database
c = conn.cursor()

# create and populate table1
c.execute("DROP TABLE table1")
c.execute("CREATE TABLE table1 (col1 text, col2 text, col3 text)")
with open("file1") as f:
    for line in f:
        c.execute("INSERT INTO table1 VALUES (?, ?, ?)", line.strip().split())

# create table2
c.execute("DROP TABLE table2")
c.execute("CREATE TABLE table2 (col1 text, col2 text, col3 text, col4 text)")
with open("file2") as f:
    for line in f:
        c.execute("INSERT INTO table2 VALUES (?, ?, ?, ?)", 
            line.strip().split('|'))

# make changes persistent
conn.commit()

# retrieve desired data and write it to file
with open("file3", "w+") as f:
    for x in c.execute(
        """
        SELECT table2.col1
             , table2.col2
             , table2.col3
             , table2.col4
             , table1.col1 
        FROM table1 JOIN table2 ON table1.col2 = table2.col2
        """):
        f.write("%s\n" % "|".join(x))

# close connection
conn.close()

输出文件如下所示：

paralle|dog|numbering|position|23
parallel|cat|nuucers|position|24

python - 匹配不同的列并使用python组合它们

3 回答 3

Related

Reference