r - 解析要为 data.frame 读取的 .txt 文件

Question

y<-readLines("output.txt")

读取 txt 文件后，我需要将此数据格式化为具有一定列数的数据框。需要去掉没有 21 列的字母和行。我正在执行以下操作来解析 - 以及任何字母。

p<-gsub("-","",p)
p<-gsub("[aA-zZ]","",p)

系统配置：lcpu=96 mem=196608MB ent=16.00

   kthr            memory                         page                       faults                 cpu             time  
----------- --------------------- ------------------------------------ ------------------ ----------------------- --------
  r   b   p        avm        fre    fi    fo    pi    po    fr     sr    in     sy    cs us sy id wa    pc    ec hr mi se
 19   0   0   21337487    7123470     0   201     0     0     0      0  3576  66723 30304 19  4 77  0  5.97  37.3 00:02:30
 27   0   0   21337431    7121069     0   123     0     0     0      0  4298  81526 36157 19  4 78  0  5.61  35.1 00:03:00
 18   0   0   21333631    7122351     0   195     0     0     0      0  3696  65163 30794 23  4 74  0  6.49  40.6 00:03:30
 19   0   0   21333590    7119082     0   194     0     0     0      0  5217 102823 47621 27  5 68  0  7.79  48.7 00:04:00

   kthr            memory                         page                       faults                 cpu             time  
 ----------- --------------------- ------------------------------------ ------------------ ----------------------- --------
   r   b   p        avm        fre    fi    fo    pi    po    fr     sr    in     sy    cs us sy id wa    pc    ec hr mi se
  20   0   0   21347610    7204383     0   167     0     0     0      0  3645  73642 33333 21  3 75  0  6.21  38.8 00:12:30
  16   0   0   21347576    7201448     0   110     0     0     0      0  4882  84287 40503 23  4 73  0  6.77  42.3 00:13:00

一旦我解析出不需要的字符，我就会有一些空行。这还不是数据框，我将如何摆脱这里的空行？

score 3 · Accepted Answer

您可以使用readLines和来完成此操作count.fields。

# path is the path to your data file
read.table(text=readLines(path)[count.fields(path, blank.lines.skip=FALSE) == 21])

#   V1 V2 V3       V4      V5 V6  V7 V8 V9 V10 V11  V12    V13   V14 V15 V16 V17 V18  V19  V20      V21
# 1 19  0  0 21337487 7123470  0 201  0  0   0   0 3576  66723 30304  19   4  77   0 5.97 37.3 00:02:30
# 2 27  0  0 21337431 7121069  0 123  0  0   0   0 4298  81526 36157  19   4  78   0 5.61 35.1 00:03:00
# 3 18  0  0 21333631 7122351  0 195  0  0   0   0 3696  65163 30794  23   4  74   0 6.49 40.6 00:03:30
# 4 19  0  0 21333590 7119082  0 194  0  0   0   0 5217 102823 47621  27   5  68   0 7.79 48.7 00:04:00
# 5 20  0  0 21347610 7204383  0 167  0  0   0   0 3645  73642 33333  21   3  75   0 6.21 38.8 00:12:30
# 6 16  0  0 21347576 7201448  0 110  0  0   0   0 4882  84287 40503  23   4  73   0 6.77 42.3 00:13:00

score 1 · Accepted Answer

正则表达式可以提供帮助

### For each row in your object "text", search for lines where...
  # we start at the beginning of the line, search for a blank repeated
  # any number of times, then we get to the end of the line
index <- grep('^[[:blank:]]$', text)

### Now that we know which rows contain only blanks, we know which rows to remove
text <- text[-index]

score 0 · Accepted Answer

dat <- readLines(textConnection(' 
   kthr            memory                         page                       faults                 cpu             time  
----------- --------------------- ------------------------------------ ------------------ ----------------------- --------
  r   b   p        avm        fre    fi    fo    pi    po    fr     sr    in     sy    cs us sy id wa    pc    ec hr mi se
 19   0   0   21337487    7123470     0   201     0     0     0      0  3576  66723 30304 19  4 77  0  5.97  37.3 00:02:30
 27   0   0   21337431    7121069     0   123     0     0     0      0  4298  81526 36157 19  4 78  0  5.61  35.1 00:03:00
 18   0   0   21333631    7122351     0   195     0     0     0      0  3696  65163 30794 23  4 74  0  6.49  40.6 00:03:30
 19   0   0   21333590    7119082     0   194     0     0     0      0  5217 102823 47621 27  5 68  0  7.79  48.7 00:04:00

   kthr            memory                         page                       faults                 cpu             time  
 ----------- --------------------- ------------------------------------ ------------------ ----------------------- --------
   r   b   p        avm        fre    fi    fo    pi    po    fr     sr    in     sy    cs us sy id wa    pc    ec hr mi se
  20   0   0   21347610    7204383     0   167     0     0     0      0  3645  73642 33333 21  3 75  0  6.21  38.8 00:12:30
  16   0   0   21347576    7201448     0   110     0     0     0      0  4882  84287 40503 23  4 73  0  6.77  42.3 00:13:00'))

dat <- gsub('-','',dat)
dat <- gsub('[ ]{1,}','|',dat)
dat <- strsplit(dat,split='\\|')
dat[lapply(dat,length)==24]
col.names <- dat[lapply(dat,length)==24][[1]]
dat <- do.call(rbind,dat[lapply(dat,length)==22])

你得到这个 data.frame ：

    [,1] [,2] [,3] [,4] [,5]       [,6]      [,7] [,8]  [,9] [,10] [,11] [,12] [,13]  [,14]    [,15]   [,16] [,17] [,18] [,19] [,20]  [,21] 
[1,] ""   "19" "0"  "0"  "21337487" "7123470" "0"  "201" "0"  "0"   "0"   "0"   "3576" "66723"  "30304" "19"  "4"   "77"  "0"   "5.97" "37.3"
[2,] ""   "27" "0"  "0"  "21337431" "7121069" "0"  "123" "0"  "0"   "0"   "0"   "4298" "81526"  "36157" "19"  "4"   "78"  "0"   "5.61" "35.1"
[3,] ""   "18" "0"  "0"  "21333631" "7122351" "0"  "195" "0"  "0"   "0"   "0"   "3696" "65163"  "30794" "23"  "4"   "74"  "0"   "6.49" "40.6"
[4,] ""   "19" "0"  "0"  "21333590" "7119082" "0"  "194" "0"  "0"   "0"   "0"   "5217" "102823" "47621" "27"  "5"   "68"  "0"   "7.79" "48.7"
[5,] ""   "20" "0"  "0"  "21347610" "7204383" "0"  "167" "0"  "0"   "0"   "0"   "3645" "73642"  "33333" "21"  "3"   "75"  "0"   "6.21" "38.8"
[6,] ""   "16" "0"  "0"  "21347576" "7201448" "0"  "110" "0"  "0"   "0"   "0"   "4882" "84287"  "40503" "23"  "4"   "73"  "0"   "6.77" "42.3"
     [,22]     
[1,] "00:02:30"
[2,] "00:03:00"
[3,] "00:03:30"
[4,] "00:04:00"
[5,] "00:12:30"
[6,] "00:13:00"

我认为您仍然需要将数据转换为数字...

r - 解析要为 data.frame 读取的 .txt 文件

3 回答 3

Related

Reference