r - 为什么在读取数据框时我的列名中出现 X.？

Question

几个月前我问了一个关于这个的问题，我认为答案已经解决了我的问题，但我又遇到了这个问题，解决方案对我不起作用。

我正在导入 CSV：

orders <- read.csv("<file_location>", sep=",", header=T, check.names = FALSE)

这是数据框的结构：

str(orders)

'data.frame':   3331575 obs. of  2 variables:
 $ OrderID  : num  -2034590217 -2034590216 -2031892773 -2031892767 -2021008573 ...
 $ OrderDate: Factor w/ 402 levels "2010-10-01","2010-10-04",..: 263 263 269 268 301 300 300 300 300 300 ...

如果我length在第一列 OrderID 上运行命令，我会得到：

length(orders$OrderID)
[1] 0

如果我length在 OrderDate 上运行，它会正确返回：

length(orders$OrderDate)
[1] 3331575

这是的副本/head粘贴CSV。

OrderID,OrderDate
-2034590217,2011-10-14
-2034590216,2011-10-14
-2031892773,2011-10-24
-2031892767,2011-10-21
-2021008573,2011-12-08
-2021008572,2011-12-07
-2021008571,2011-12-07
-2021008570,2011-12-07
-2021008569,2011-12-07

现在，如果我重新运行read.csv，但去掉check.names选项，现在的第一列dataframe在名称的开头有一个 X.。

orders2 <- read.csv("<file_location>", sep=",", header=T)

str(orders2)

'data.frame':   3331575 obs. of  2 variables:
 $ X.OrderID: num  -2034590217 -2034590216 -2031892773 -2031892767 -2021008573 ...
 $ OrderDate: Factor w/ 402 levels "2010-10-01","2010-10-04",..: 263 263 269 268 301 300 300 300 300 300 ...

length(orders$X.OrderID)
[1] 3331575

这可以正常工作。

我的问题是为什么R要在第一列名称的开头添加 X.？从 CSV 文件中可以看出，没有特殊字符。它应该是一个简单的负载。添加check.names，虽然会从 CSV 导入名称，但会导致数据无法正确加载，我无法对其执行分析。

我能做些什么来解决这个问题？

旁注：我意识到这是一个小问题——我对我认为我加载正确但没有得到我预期的结果感到更加沮丧。我可以使用重命名该列colnames(orders)[1] <- "OrderID"，但仍然想知道它为什么不能正确加载。

score 87 · Accepted Answer

read.csv()是更通用read.table()功能的包装器。后一个函数具有参数check.names，记录为：

check.names: logical.  If ‘TRUE’ then the names of the variables in the
         data frame are checked to ensure that they are syntactically
         valid variable names.  If necessary they are adjusted (by
         ‘make.names’) so that they are, and also to ensure that there
         are no duplicates.

如果您的标头包含在语法上无效的标签，make.names()则将根据无效名称将其替换为有效名称，删除无效字符并可能在前面添加X：

R> make.names("$Foo")
[1] "X.Foo"

这记录在?make.names：

Details:

    A syntactically valid name consists of letters, numbers and the
    dot or underline characters and starts with a letter or the dot
    not followed by a number.  Names such as ‘".2way"’ are not valid,
    and neither are the reserved words.

    The definition of a _letter_ depends on the current locale, but
    only ASCII digits are considered to be digits.

    The character ‘"X"’ is prepended if necessary.  All invalid
    characters are translated to ‘"."’.  A missing value is translated
    to ‘"NA"’.  Names which match R keywords have a dot appended to
    them.  Duplicated values are altered by ‘make.unique’.

read.table()您看到的行为与记录在案的数据加载方式完全一致。这表明您在 CSV 文件的标题行中有语法上无效的标签。请注意上面的一点?make.names，什么是字母取决于您系统的语言环境；CSV 文件可能包含文本编辑器将显示的有效字符，但如果 R 未在同一语言环境中运行，则该字符可能在那里无效，例如？

我会查看 CSV 文件并识别标题行中的任何非 ASCII 字符；标题行中也可能有不可见的字符（或转义序列；\t？）。在读取具有无效名称的文件和在控制台中显示它之间可能会发生很多事情，这可能会掩盖无效字符，所以不要认为没有check.namesas它不会显示任何错误表示文件正常。

发布的输出sessionInfo()也很有用。

score 12 · Accepted Answer

我刚遇到这个问题，原因很简单。我有以数字开头的标签，而 R 在它们前面添加了一个 X。我认为 R 与标题中的数字混淆，并应用字母来区分值。

所以，“3_in”变成了“X3_in”等等……我通过将标签切换到“in_3”来解决问题，问题就解决了。

我希望这可以帮助别人。

score 7 · Accepted Answer

当列名的格式不正确时，R 在导入期间在列名的开头放置一个“X”。例如，当您的列名以数字或某些空格字符开头时，通常会发生这种情况。它不会发生的check.names = FALSE原因 - 不会有“X”。但是，如果列名以数字或其他特殊字符开头，则某些功能可能不起作用。例子是rbind.fill函数。

因此，在应用该功能（使用“更正的 colnames”）之后，我使用这个简单的东西来摆脱“X”。

destroyX = function(es) {
  f = es
  for (col in c(1:ncol(f))){ #for each column in dataframe
    if (startsWith(colnames(f)[col], "X") == TRUE)  { #if starts with 'X' ..
      colnames(f)[col] <- substr(colnames(f)[col], 2, 100) #get rid of it
    }
  }
  assign(deparse(substitute(es)), f, inherits = TRUE) #assign corrected data to original name
}

score 6 · Accepted Answer

我遇到了类似的问题，想分享以下代码行来更正列名。当然不是完美的，因为正手的干净编程会更好，但作为快速而肮脏的方法的起点可能会有所帮助。（我本来想将它们添加为对 Ryan 的问题/Gavin 的回答的评论，但我的声誉不够高，所以我不得不发布一个额外的答案 - 抱歉）。

在我的例子中，写入和读取数据的几个步骤产生了一个或多个名为“X”，X.1“，...的列，其中包含X-列中的内容和X.1，...-列中的行号。在我的情况下，X 列的内容应该用作行名，而其他 X.1,... 列应该被删除。

Correct_Colnames <- function(df) {

 delete.columns <- grep("(^X$)|(^X\\.)(\\d+)($)", colnames(df), perl=T)

  if (length(delete.columns) > 0) {

   row.names(df) <- as.character(df[, grep("^X$", colnames(df))])
   #other data types might apply than character or 
   #introduction of a new separate column might be suitable

   df <- df[,-delete.columns]

   colnames(df) <- gsub("^X", "",  colnames(df))
   #X might be replaced by different characters, instead of being deleted
  }

  return(df)
}

score 4 · Accepted Answer

row.names=FALSE我通过在函数中包含作为参数解决了类似的问题write.csv。write.csv将行名称作为未命名列包含在 CSV 文件中，并read.csv在读取 CSV 文件时将该列命名为“X”。

r - 为什么在读取数据框时我的列名中出现 X.？

5 回答 5

Related

Reference