1

我有一个文件,其中有一堆数据和文本。我想以只保留具有三个坐标的行的方式读取文件。三个坐标指的是我有格式的行,例如490353.36, 3755632.81, 109.73. 换句话说,我想保留曲面线之后的数据。数据具有不同横截面的 x、y 和 z 坐标。

样本数据如下:

ENDSTREAMNETWORK:

BEGIN CROSS-SECTIONS:

  CROSS-SECTION:
    STREAM ID:Sipsey Fork     
    REACH ID:Sipsey Fork     
    STATION:13.60   
    NODE NAME:                
    CUT LINE:
      490353.358391478 , 3755632.80772044 
      490254.511677942 , 3755640.28160111 
      490229.8 , 3755642.15 
      490205.088314326 , 3755644.01839947 
      490130.953109393 , 3755649.62143546 
    SURFACE LINE:
     490353.36,   3755632.81,   109.73
     490341.00,   3755633.74,   103.63
     490331.74,   3755634.44,   97.54
     490276.13,   3755638.65,   91.44
     490263.78,   3755639.58,   85.34
     490254.51,   3755640.28,   79.25
     490254.51,   3755640.28,   79.25
     490242.16,   3755641.22,   75.59
     490229.80,   3755642.15,   75.59
     490217.44,   3755643.08,   75.59
     490205.09,   3755644.02,   79.25
     490205.09,   3755644.02,   79.25
     490186.55,   3755645.42,   85.34
     490177.29,   3755646.12,   91.44
     490158.75,   3755647.52,   97.54
     490146.40,   3755648.45,   103.63
     490130.95,   3755649.62,   109.73
  END:

  CROSS-SECTION:
    STREAM ID:Sipsey Fork     
    REACH ID:Sipsey Fork     
    STATION:13.552* 
    NODE NAME:                
    CUT LINE:
      490348.236792825 , 3755554.44864345 
      490248.581497463 , 3755561.99219479 
      490223.87626427 , 3755563.8637565 
      490199.171038808 , 3755565.73531763 
      490122.732478269 , 3755571.5258566 
    SURFACE LINE:
     490348.24,   3755554.45,   109.73
     490335.78,   3755555.39,   103.68
     490332.73,   3755555.62,   101.72
     490326.44,   3755556.10,   97.65
     490321.09,   3755556.50,   96.98
     490279.74,   3755559.63,   92.42
     490270.38,   3755560.34,   91.35
     490262.42,   3755560.94,   87.53
     490258.64,   3755561.23,   85.56
     490257.92,   3755561.29,   85.22
     490253.65,   3755561.61,   82.50
     490248.58,   3755561.99,   79.27
     490248.58,   3755561.99,   79.27
     490245.75,   3755562.21,   78.40
     490243.64,   3755562.37,   77.73
     490236.08,   3755562.94,   75.58
     490223.88,   3755563.86,   75.58
     490212.36,   3755564.74,   75.58
     490209.15,   3755564.98,   76.44
     490206.21,   3755565.20,   77.24
     490200.50,   3755565.63,   78.84
     490199.17,   3755565.74,   79.26
     490199.17,   3755565.74,   79.26
     490197.66,   3755565.85,   79.78
     490193.00,   3755566.20,   81.22
     490186.72,   3755566.68,   83.20
     490182.06,   3755567.03,   84.83
     490180.06,   3755567.18,   85.47
     490170.51,   3755567.91,   91.44
     490170.23,   3755567.93,   91.52
     490151.40,   3755569.35,   97.45
     490141.55,   3755570.10,   102.06
     490138.66,   3755570.32,   103.48
     490133.49,   3755570.71,   105.53
     490122.73,   3755571.53,   109.73
  END:

如上所示,我有数千行。我只想用逗号分隔的三列编译所有数据,并将其保存为 R 中的数据框。

我对上述数据集所需的示例输出如下。逗号也应该去掉

     490353.36,   3755632.81,   109.73
     490341.00,   3755633.74,   103.63
     490331.74,   3755634.44,   97.54
     490276.13,   3755638.65,   91.44
     490263.78,   3755639.58,   85.34
     490254.51,   3755640.28,   79.25
     490254.51,   3755640.28,   79.25
     490242.16,   3755641.22,   75.59
     490229.80,   3755642.15,   75.59
     490217.44,   3755643.08,   75.59
     490205.09,   3755644.02,   79.25
     490205.09,   3755644.02,   79.25
     490186.55,   3755645.42,   85.34
     490177.29,   3755646.12,   91.44
     490158.75,   3755647.52,   97.54
     490146.40,   3755648.45,   103.63
     490130.95,   3755649.62,   109.73
     490348.24,   3755554.45,   109.73
     490335.78,   3755555.39,   103.68
     490332.73,   3755555.62,   101.72
     490326.44,   3755556.10,   97.65
     490321.09,   3755556.50,   96.98
     490279.74,   3755559.63,   92.42
     490270.38,   3755560.34,   91.35
     490262.42,   3755560.94,   87.53
     490258.64,   3755561.23,   85.56
     490257.92,   3755561.29,   85.22
     490253.65,   3755561.61,   82.50
     490248.58,   3755561.99,   79.27
     490248.58,   3755561.99,   79.27
     490245.75,   3755562.21,   78.40
     490243.64,   3755562.37,   77.73
     490236.08,   3755562.94,   75.58
     490223.88,   3755563.86,   75.58
     490212.36,   3755564.74,   75.58
     490209.15,   3755564.98,   76.44
     490206.21,   3755565.20,   77.24
     490200.50,   3755565.63,   78.84
     490199.17,   3755565.74,   79.26
     490199.17,   3755565.74,   79.26
     490197.66,   3755565.85,   79.78
     490193.00,   3755566.20,   81.22
     490186.72,   3755566.68,   83.20
     490182.06,   3755567.03,   84.83
     490180.06,   3755567.18,   85.47
     490170.51,   3755567.91,   91.44
     490170.23,   3755567.93,   91.52
     490151.40,   3755569.35,   97.45
     490141.55,   3755570.10,   102.06
     490138.66,   3755570.32,   103.48
     490133.49,   3755570.71,   105.53
     490122.73,   3755571.53,   109.73
4

4 回答 4

3

我会通过首先读取文本文件来做这样的事情readLines

tt <- readLines("myfile.txt")
pat <- "^[ ]*(.*),(.*),(.*)[ ]*$"
tt <- gsub(pat, "\\1,\\2,\\3", grep(pat, tt, value=TRUE))
dat <- read.table(textConnection(tt), sep=",", header=FALSE)

想法:首先我们读取整个文件,tt以便我们可以进行所有需要的更改,过滤所需的行等。然后我们需要选择保留哪些行以及丢弃哪些行。为此,我们构造了一个模式0-任意数量的空格,然后是任意数量的空格,然后是任意数量的空格,然后是任意数量的,a,,然后是任意数量的空格,然后是任意数量的空格。这将确保您只获得由 3 列分隔的行,。因此,首先我们使用它patgrep过滤那些行并只保留那些匹配模式的行(通过使用value=TRUE)。然后我们使用gsub删除空白并保留之间的内容,s(我认为这不是绝对必要的,但可以肯定的是)。然后,我们现在有了我们需要的数据。我们所要做的就是像往常一样将其传递给textConnection并阅读。read.table希望这可以帮助。

线条已经分崩离析。只需逐一输入这些行并查看输出,您应该能够立即理解它。

于 2013-07-03T21:24:12.427 回答
3

这太丑了,我几乎没有发布它。但是,它有效。我读到您的数据,例如:

raw<-read.table(textConnection('NDSTREAMNETWORK:

BEGIN CROSS-SECTIONS:

  CROSS-SECTION:
    STREAM ID:Sipsey Fork     
    REACH ID:Sipsey Fork     
    STATION:13.60   
    NODE NAME:                
    CUT LINE:
      490353.358391478 , 3755632.80772044 
      490254.511677942 , 3755640.28160111 
      490229.8 , 3755642.15 
      490205.088314326 , 3755644.01839947 
      490130.953109393 , 3755649.62143546 
    SURFACE LINE:
     490353.36,   3755632.81,   109.73
     490341.00,   3755633.74,   103.63
     490331.74,   3755634.44,   97.54
     490276.13,   3755638.65,   91.44
     490263.78,   3755639.58,   85.34
     490254.51,   3755640.28,   79.25
     490254.51,   3755640.28,   79.25
     490242.16,   3755641.22,   75.59
     490229.80,   3755642.15,   75.59
     490217.44,   3755643.08,   75.59
     490205.09,   3755644.02,   79.25
     490205.09,   3755644.02,   79.25
     490186.55,   3755645.42,   85.34
     490177.29,   3755646.12,   91.44
     490158.75,   3755647.52,   97.54
     490146.40,   3755648.45,   103.63
     490130.95,   3755649.62,   109.73
  END:

  CROSS-SECTION:
    STREAM ID:Sipsey Fork     
    REACH ID:Sipsey Fork     
    STATION:13.552* 
    NODE NAME:                
    CUT LINE:
      490348.236792825 , 3755554.44864345 
      490248.581497463 , 3755561.99219479 
      490223.87626427 , 3755563.8637565 
      490199.171038808 , 3755565.73531763 
      490122.732478269 , 3755571.5258566 
    SURFACE LINE:
     490348.24,   3755554.45,   109.73
     490335.78,   3755555.39,   103.68
     490332.73,   3755555.62,   101.72
     490326.44,   3755556.10,   97.65
     490321.09,   3755556.50,   96.98
     490279.74,   3755559.63,   92.42
     490270.38,   3755560.34,   91.35
     490262.42,   3755560.94,   87.53
     490258.64,   3755561.23,   85.56
     490257.92,   3755561.29,   85.22
     490253.65,   3755561.61,   82.50
     490248.58,   3755561.99,   79.27
     490248.58,   3755561.99,   79.27
     490245.75,   3755562.21,   78.40
     490243.64,   3755562.37,   77.73
     490236.08,   3755562.94,   75.58
     490223.88,   3755563.86,   75.58
     490212.36,   3755564.74,   75.58
     490209.15,   3755564.98,   76.44
     490206.21,   3755565.20,   77.24
     490200.50,   3755565.63,   78.84
     490199.17,   3755565.74,   79.26
     490199.17,   3755565.74,   79.26
     490197.66,   3755565.85,   79.78
     490193.00,   3755566.20,   81.22
     490186.72,   3755566.68,   83.20
     490182.06,   3755567.03,   84.83
     490180.06,   3755567.18,   85.47
     490170.51,   3755567.91,   91.44
     490170.23,   3755567.93,   91.52
     490151.40,   3755569.35,   97.45
     490141.55,   3755570.10,   102.06
     490138.66,   3755570.32,   103.48
     490133.49,   3755570.71,   105.53
     490122.73,   3755571.53,   109.73
  END:'),sep='\n',stringsAsFactors=FALSE)

然后我把它变成一个data.frame

vec<-unlist(raw)

start<-grep('SURFACE LINE:',vec)+1
end<-grep('END:',vec)-1

data<-do.call(rbind,
lapply(seq_along(start), 
  function(x) read.table(textConnection(vec[start[x]:end[x]])))
)
于 2013-07-03T21:24:22.943 回答
2

不是最短,但对我来说更容易理解

raw_text <- "ENDSTREAMNETWORK:

BEGIN CROSS-SECTIONS:

  CROSS-SECTION:
    STREAM ID:Sipsey Fork     
    REACH ID:Sipsey Fork     
    STATION:13.60   
    NODE NAME:                
    CUT LINE:
      490353.358391478 , 3755632.80772044 
      490254.511677942 , 3755640.28160111 
      490229.8 , 3755642.15 
      490205.088314326 , 3755644.01839947 
      490130.953109393 , 3755649.62143546 
    SURFACE LINE:
     490353.36,   3755632.81,   109.73
     490341.00,   3755633.74,   103.63
     490331.74,   3755634.44,   97.54
     490276.13,   3755638.65,   91.44
     490263.78,   3755639.58,   85.34
     490254.51,   3755640.28,   79.25
     490254.51,   3755640.28,   79.25
     490242.16,   3755641.22,   75.59
     490229.80,   3755642.15,   75.59
     490217.44,   3755643.08,   75.59
     490205.09,   3755644.02,   79.25
     490205.09,   3755644.02,   79.25
     490186.55,   3755645.42,   85.34
     490177.29,   3755646.12,   91.44
     490158.75,   3755647.52,   97.54
     490146.40,   3755648.45,   103.63
     490130.95,   3755649.62,   109.73
  END:"

以下是步骤

## read the data
raw_data <- readLines(textConnection(raw_text))

## split by ","
split_list <- strsplit(raw_data, ",")

## check for 3 columns
data <- split_list[sapply(split_list, length) == 3]

## remove space and ","
data <- lapply(data, function(x) gsub("\\s+|\\,", "", x))

## bind the data 
do.call("rbind", data)


##       [,1]        [,2]         [,3]    
##  [1,] "490353.36" "3755632.81" "109.73"
##  [2,] "490341.00" "3755633.74" "103.63"
##  [3,] "490331.74" "3755634.44" "97.54" 
##  [4,] "490276.13" "3755638.65" "91.44" 
##  [5,] "490263.78" "3755639.58" "85.34" 
##  [6,] "490254.51" "3755640.28" "79.25" 
##  [7,] "490254.51" "3755640.28" "79.25" 
##  [8,] "490242.16" "3755641.22" "75.59" 
##  [9,] "490229.80" "3755642.15" "75.59" 
## [10,] "490217.44" "3755643.08" "75.59" 
## [11,] "490205.09" "3755644.02" "79.25" 
## [12,] "490205.09" "3755644.02" "79.25" 
## [13,] "490186.55" "3755645.42" "85.34" 
## [14,] "490177.29" "3755646.12" "91.44" 
## [15,] "490158.75" "3755647.52" "97.54" 
## [16,] "490146.40" "3755648.45" "103.63"
## [17,] "490130.95" "3755649.62" "109.73"
于 2013-07-03T21:27:17.790 回答
0

我想提出另一种方法。正如@dickoa 指出的那样,如果您是 linux 或 mac 用户,您可以使用第三方程序awkegrep为您进行过滤。无需在 R 之外手动进行过滤,只需一次system调用即可完成。这两项工作:

正如@dickoaawk所建议的那样:

read.table(text = system("awk '{FS = \",\"} {if (NF == 3) print}' test.txt",
                         intern = TRUE),
           sep = ',')

egrep

read.table(text = system("egrep '^[^,]+,[^,]+,[^,]+$' test.txt", intern = TRUE),
           sep = ',')

这样做的好处是它不会强制 R 将所有数据读入内存,如果您从非常大的文件中读取,这可能会有所不同。它也比其他建议的答案短。

于 2013-07-04T00:11:00.270 回答