0

我有一个包含 7 列(chr, pos, num, iA, iB, iC, iD)的 MySQL 数据库和一个包含 4000 万行的文件,每行都包含一个数据集。每行有 4 个制表符分隔的列,而前三列始终包含数据,第四列最多可以包含三个不同key=value的对,用分号分隔

chr   pos   num   info
1     10203 3     iA=0.34;iB=nerv;iC=45;iD=dskf12586
1     10203 4     iA=0.44;iC=45;iD=dsf12586;iB=nerv
1     10203 5     
1     10213 1     iB=nerv;iC=49;iA=0.14;iD=dskf12586
1     10213 2     iA=0.34;iB=nerv;iD=cap1486
1     10225 1     iD=dscf12586

列 info 中的 key=value 对没有特定的顺序。我也不确定一个键是否可以出现两次(我希望不会)。

我想将数据写入数据库。前三列没有问题,但是从信息列中提取值让我感到困惑,因为键=值对是无序的,并不是每个键都必须在行中。对于一个类似的数据集(带有有序的信息列),我使用了一个与正则表达式相关的 java-Programm,它允许我(1)检查和(2)提取数据,但现在我陷入了困境。

我怎样才能解决这个任务,最好是使用 bash 脚本或直接在 MySQL 中?

4

2 回答 2

2

你没有确切地提到你想如何写数据。但是下面的示例awk显示了如何在每一行中获取每个单独的 id 和 key。而不是printf,您可以使用自己的逻辑来写入数据

[[bash_prompt$]]$ cat test.sh; echo "###########"; awk -f test.sh log
{
  if(length($4)) {
    split($4,array,";");
    print "In " $1, $2, $3;
    for(element in array) {
      key=substr(array[element],0,index(array[element],"="));
      value=substr(array[element],index(array[element],"=")+1);
      printf("found %s key and %s value for %d line from %s\n",key,value,NR,array[element]);
    }
  }
}
###########
In 1 10203 3
found iD= key and dskf12586 value for 1 line from iD=dskf12586
found iA= key and 0.34 value for 1 line from iA=0.34
found iB= key and nerv value for 1 line from iB=nerv
found iC= key and 45 value for 1 line from iC=45
In 1 10203 4
found iB= key and nerv value for 2 line from iB=nerv
found iA= key and 0.44 value for 2 line from iA=0.44
found iC= key and 45 value for 2 line from iC=45
found iD= key and dsf12586 value for 2 line from iD=dsf12586
In 1 10213 1
found iD= key and dskf12586 value for 4 line from iD=dskf12586
found iB= key and nerv value for 4 line from iB=nerv
found iC= key and 49 value for 4 line from iC=49
found iA= key and 0.14 value for 4 line from iA=0.14
In 1 10213 2
found iA= key and 0.34 value for 5 line from iA=0.34
found iB= key and nerv value for 5 line from iB=nerv
found iD= key and cap1486 value for 5 line from iD=cap1486
In 1 10225 1
found iD= key and dscf12586 value for 6 line from iD=dscf12586
于 2013-05-14T08:37:32.363 回答
2

来自@abasu 的 Awk 解决方案,带有插入也解决了无序键值对。

解析.awk:

NR>1 {
  col["iA"]=col["iB"]=col["iC"]=col["iD"]="null";

  if(length($4)) {
    split($4,array,";");
    for(element in array) {
      split(array[element],keyval,"=");
      col[keyval[1]] = "'" keyval[2] "'";
    }
  }
  print "INSERT INTO tbl VALUES (" $1 "," $2 "," $3 "," col["iA"] "," col["iB"] "," col["iC"] "," col["iD"] ");";
}

测试运行 :

$ awk -f parse.awk file
INSERT INTO tbl VALUES (1,10203,3,'0.34','nerv','45','dskf12586');
INSERT INTO tbl VALUES (1,10203,4,'0.44','nerv','45','dsf12586');
INSERT INTO tbl VALUES (1,10203,5,null,null,null,null);
INSERT INTO tbl VALUES (1,10213,1,'0.14','nerv','49','dskf12586');
INSERT INTO tbl VALUES (1,10213,2,'0.34','nerv',null,'cap1486');
INSERT INTO tbl VALUES (1,10225,1,null,null,null,'dscf12586');
于 2013-05-14T10:21:00.303 回答