r - Split a data-frame based in ordered multi factorial column

Question

I would like to split a data-frame in a list of data-frames. The reasoning to split it is that we will have always father followed by mother which in turn is followed by offspring. However, these family members might have more than one row (which are always subsequent. e.g father number 1 is in the row 1 and row 2). In my below example I have two families, then I am trying to get a list with two data-frames.

My input:

df <- 'Chr  Start   End Family
1   187546286   187552094   father
3   108028534   108032021   father
1   4864403 4878685 mother
1   18898657    18904908    mother
2   460238  461771  offspring
3   108028534   108032021   offspring
1   71481449    71532983    father
2   74507242    74511395    father
2   181864092   181864690   mother
1   71481449    71532983    offspring
2   181864092   181864690   offspring
3   160057791   160113642   offspring'

df <- read.table(text=df, header=T)

Thus, my expected output dfout[[1]] would look like:

dfout <- 'Chr   Start   End Family
1   187546286   187552094   father
3   108028534   108032021   father
1   4864403 4878685 mother
1   18898657    18904908    mother
2   460238  461771  offspring
3   108028534   108032021   offspring'

dfout - read.table(text=dfout, header=TRUE)

score 1 · Accepted Answer

要将每个族拆分为单独的数据框，您需要一个索引来指示一个族的结束位置和另一个族的开始位置。对于索引，我使用“父亲”作为更改点。但我们不能简单地使用indx <- df$Family == "father"，因为一行中可以有多个“父亲”条目。相反，我们通过搜索等于 1 的位置来测试从“后代”切换到“父亲”的位置。

indx <- cumsum(c(1L, diff(df$Family == "father")) == 1L)
split(df, indx)
# $`1`
#   Chr     Start       End    Family
# 1   1 187546286 187552094    father
# 2   3 108028534 108032021    father
# 3   1   4864403   4878685    mother
# 4   1  18898657  18904908    mother
# 5   2    460238    461771 offspring
# 6   3 108028534 108032021 offspring
# 
# $`2`
#    Chr     Start       End    Family
# 7    1  71481449  71532983    father
# 8    2  74507242  74511395    father
# 9    2 181864092 181864690    mother
# 10   1  71481449  71532983 offspring
# 11   2 181864092 181864690 offspring
# 12   3 160057791 160113642 offspring

score 0 · Accepted Answer

如果您发布用于生成实际数据框的代码会更有帮助。我没有时间重做所有事情，但我会从一般意义上向您展示它是如何工作的。

gender <- c("M","M","F","F","F","F","M","M","M","M","F","F")
values <- c(20,22,24,19,9,17,18,22,12,14,7,8)
fruit <- c("apple","pear","mango","mango","mango","apple","banana","banana","banana","mango","apple","apple")
df <- data.frame(gender, values, fruit)


> df
   gender values  fruit
1       M     20  apple
2       M     22   pear
3       F     24  mango
4       F     19  mango
5       F      9  mango
6       F     17  apple
7       M     18 banana
8       M     22 banana
9       M     12 banana
10      M     14  mango
11      F      7  apple
12      F      8  apple

split(df, df$gender)

$F
   gender values fruit
3       F     24 mango
4       F     19 mango
5       F      9 mango
6       F     17 apple
11      F      7 apple
12      F      8 apple

$M
   gender values  fruit
1       M     20  apple
2       M     22   pear
7       M     18 banana
8       M     22 banana
9       M     12 banana
10      M     14  mango

r - Split a data-frame based in ordered multi factorial column

2 回答 2

Related

Reference