shell - Use an awk loop to subset a file

Question

I have a file with lots of pieces of information that I want to split on the first column.

Example (example.gen):

1 rs3094315 752566 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
1 rs2094315 752999 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
2 rs3044315 759996 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
2 rs3054375 799966 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
2 rs3094375 999566 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
3 rs3078315 799866 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
3 rs4054315 759986 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
4 rs4900215 752998 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
4 rs5094315 759886 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
4 rs6094315 798866 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1

Desired output:

Chr1.gen

1 rs3094315 752566 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
1 rs2094315 752999 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1

Chr2.gen

2 rs3044315 759996 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
2 rs3054375 799966 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
2 rs3094375 999566 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1

Chr3.gen

3 rs3078315 799866 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
3 rs4054315 759986 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1

Chr4.gen

4 rs4900215 752998 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
4 rs5094315 759886 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
4 rs6094315 798866 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1

I've tried to do this with the following shell scripts, but it doesn't work - I can't work out how to get awk to recognise a variable defined outside the awk script itself.

First script attempt (no awk loop):

for i in {1..23}
do
    awk '{$1 = $i}' example.gen > Chr$i.gen
done

Second script attempt (with awk loop):

for i in {1..23}
do
    awk '{for (i = 1; i <= 23; i++) $1 = $i}' example.gen > Chr$i.gen
done

I'm sure its probably quite basic, but I just can't work it out...

Thank you!

score 3 · Accepted Answer

With awk:

awk '{print > "Chr"$1".gen"}' file

It just prints and redirects it to a file. And how is this file defined? With "Chr" + first_column + ".gen".

With your sample input it creates 4 files. For example the 4th is:

$ cat Chr4.gen 
4 rs4900215 752998 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
4 rs5094315 759886 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
4 rs6094315 798866 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1

score 3 · Accepted Answer

First, use @fedorqui's answer, as that is best. But to understand the mistake you made with your first attempt (which was close), read on.

Your first attempt failed because you put the test inside the action (in the braces), not preceding it. The minimal fix:

awk "\$1 == $i" example.gen > Chr$i.gen

This uses double quotes to allow the value of i to be seen by the awk script, but that requires you to then escape the dollar sign for $1 so that you don't substitute the value of the shell's first positional argument. Cleaner but longer:

awk -v i=$i '$1 == i' example.gen > Chr$i.gen

This adds creates a variable i inside the awk script with the same value as the shell's i variable.

shell - Use an awk loop to subset a file

2 回答 2

Related

Reference