0

我想在一张表上做一些简单的数据库操作,而不需要使用数据库软件,例如,我可以使用来自 GitHub 的“filo”包来做一些类似于“groupby”功能的事情。我想知道是否有类似的东西可以实现一些简单的“加入”功能?或者我可以用 Python 或 Bash 来做吗?具体来说,我有一个像这样的表:

Col5a2  NM_007737   chr1    -   45447828    45447829
Slc40a1 NM_016917   chr1    -   45870140    45870141
Gm3852  NM_001177356    chr1    -   45956809    45956810
Slc39a10    NM_172653   chr1    -   46798055    46798056
Obfc2a  NM_028696   chr1    -   51422944    51422945
Myo1b   NM_001161817,NM_010863  chr1    -   51860519    51860520
.
.
.

我有一个清单

Slc40a1
Myo1b
Col5a2
Obfc2a
.
.
.

我想从表中获取列表中的项目,这样我会得到:

Slc39a10    NM_172653   chr1    -   46798055    46798056
Myo1b   NM_001161817,NM_010863  chr1    -   51860519    51860520
Col5a2  NM_007737   chr1    -   45447828    45447829
Obfc2a  NM_028696   chr1    -   51422944    51422945
.
.
.
4

5 回答 5

3

Here's one way using awk:

awk 'FNR==NR { a[$1]=$0; next } $1 in a { print a[$1] }' table list

or with formatting:

awk 'FNR==NR { a[$1]=$0; next } $1 in a { print a[$1] }' table list | column -t

Results:

Slc40a1  NM_016917               chr1  -  45870140  45870141
Myo1b    NM_001161817,NM_010863  chr1  -  51860519  51860520
Col5a2   NM_007737               chr1  -  45447828  45447829
Obfc2a   NM_028696               chr1  -  51422944  51422945

Explanation:

'FNR==NR { ... }' is a conditional that is only true for the first file in the argument list.

So for each line in the file called 'table', the first column ($1) is added to an array (called 'a') and this is assigned the value of the whole line ($0). 'next' then skips to remainder of the code and jumps to the next line of input, until all lines in the 'table' file have been processed.

'$1 in a' is another conditional.

This is asking if column one of the 'list' file is a key in the array. If it is, then print out the value of column one that we just stored in the array (a[$1]).

于 2012-12-04T23:09:22.863 回答
3

You can indeed accomplish this with two standard unix tools, join(1) and sort(1):

$ join <(sort table) <(sort list)

Col5a2 NM_007737 chr1 - 45447828 45447829
Myo1b NM_001161817,NM_010863 chr1 - 51860519 51860520
Obfc2a NM_028696 chr1 - 51422944 51422945
Slc40a1 NM_016917 chr1 - 45870140 45870141

The call to sort is needed because (from join man page):

Important: FILE1 and FILE2 must be sorted on the join fields. E.g., use 'sort -k 1b,1' if 'join' has no options. Note, comparisons honor the rules specified by 'LC_COLLATE'. If the input is not sorted and some lines cannot be joined, a warning message will be given.

Update: A solution inspired by this answer, keeping order:

$ join -1 2 <(cat -n list | sort -k2,2) <(sort table) | sort -nk2,2 | cut -d\  -f1,3-
于 2012-12-04T23:10:21.143 回答
1

如果您只是通过表的第一列进行非常简单的查找,那么 pythondict可能就足够了一个数据结构。

像这样构建它:

table = {}
with open(table_file) as f:
    for line in f:
        row = line.split()
        table[row[0]] = row

You can then do your "join" against this dictionary using a list comprehension:

results = [table[key] for key in keys_list]

Or if your second list is also a data file, you can do this instead:

with open(second_file) as f:
    results = [table[line.strip] for line in f]
于 2012-12-04T23:00:23.687 回答
0
import numpy as np
data = []
with open("file1.txt") as f:  #the data file
   for row in f:
      data.append(row.split())
with open("file2.txt") as f:  #the keys file
   keys = map(str.strip,f.readlines())
np_data = np.array(data,np.str)
mask = np.in1d(np_data[:,0],keys)
print np_data[mask]

如果我真的不想使用数据库,那我该怎么做

于 2012-12-04T23:00:00.350 回答
0

You can use pandas which work like a database to "select" and pivot. This is an example that works:

import pandas as pd
data = pd.DataFrame({ 
        'Col5a2'  : ['NM_007737',    '-',         'chr1', 45447828, 45447829],
        'Slc40a1' : ['NM_016917',    '-',         'chr1', 45870140, 45870141],
        'Gm3852'  : ['NM_001177356', '-',         'chr1', 45956809, 45956810],
        'Slc39a10': ['NM_172653',    '-',         'chr1', 46798055, 46798056],
        'Obfc2a'  : ['NM_028696',    '-',         'chr1', 51422944, 51422945],
        'Myo1b'   : ['NM_001161817', 'NM_010863', 'chr1', 51860519, 51860520],
    })
data.get(['Slc40a1', 'Myo1b', 'Col5a2', 'Obfc2a'])

# Output:

     Slc40a1         Myo1b     Col5a2     Obfc2a
0  NM_016917  NM_001161817  NM_007737  NM_028696
1          -     NM_010863          -          -
2       chr1          chr1       chr1       chr1
3   45870140      51860519   45447828   51422944
4   45870141      51860520   45447829   51422945
于 2012-12-04T23:21:58.560 回答