我得到了几个 CSV 文件,其中包含本地德语风格的数字,即以逗号作为小数分隔符,将点作为千位分隔符,例如 10.380,45。CSV 文件中的值用“;”分隔。这些文件还包含来自字符、日期、日期和时间以及逻辑类的列。
read.table 函数的问题在于,您可以使用 dec="," 指定小数点分隔符,但不能指定千点分隔符。(如果我错了,请纠正我)
我知道预处理是一种解决方法,但我想以某种方式编写我的代码,其他人可以在没有我的情况下使用它。
通过设置我自己的类,我找到了一种使用 read.csv2 以我想要的方式读取 CSV 文件的方法,如以下示例所示。基于最优雅的方式来加载 csv,点为 R 中的千位分隔符
# Create test example
df_test_write <- cbind.data.frame(c("a","b","c","d","e","f","g","h","i","j",rep("k",times=200)),
c("5.200,39","250,36","1.000.258,25","3,58","5,55","10.550,00","10.333,00","80,33","20.500.000,00","10,00",rep("3.133,33",times=200)),
c("25.03.2015","28.04.2015","03.05.2016","08.08.2016","08.08.2016","08.08.2016","08.08.2016","08.08.2016","08.08.2016","08.08.2016",rep("08.08.2016",times=200)),
stringsAsFactors=FALSE)
colnames(df_test_write) <- c("col_text","col_num","col_date")
# write test csv
write.csv2(df_test_write,file="Test.csv",quote=FALSE,row.names=FALSE)
#### read with read.csv2 ####
# First, define your own class
#define your own numeric class
setClass('myNum')
#define conversion
setAs("character","myNum", function(from) as.numeric(gsub(",","\\.",gsub("\\.","",from))))
# own date class
library(lubridate)
setClass('myDate')
setAs("character","myDate",function(from) dmy(from))
# Read the csv file, in colClasses the columns class can be defined
df_test_readcsv <- read.csv2(paste0(getwd(),"/Test.csv"),
stringsAsFactors = FALSE,
colClasses = c(
col_text = "character",
col_num = "myNum",
col_date = "myDate"
)
)
我现在的问题是,不同的数据集最多有 200 列和 350000 行。使用上面的解决方案,我需要 40 到 60 秒来加载一个 CSV 文件,我想加快速度。
通过我的研究,我fread()
从data.table
包装中发现,这真的很快。加载 CSV 文件大约需要 3 到 5 秒。
不幸的是,也无法指定千位分隔符。因此,我尝试将我的解决方案与 colClasses 一起使用,但似乎存在一个问题,即您不能将单个类与 fread https://github.com/Rdatatable/data.table/issues/491一起使用
另请参阅我的以下测试代码:
##### read with fread ####
library(data.table)
# Test without colclasses
df_test_readfread1 <- fread(paste0(getwd(),"/Test.csv"),
stringsAsFactors = FALSE,
dec = ",",
sep=";",
verbose=TRUE)
str(df_test_readfread1)
# PROBLEM: In my real dataset it turns the number into an numeric column,
# unforunately it sees the "." as decimal separator, so it turns e.g. 10.550,
# into 10.5
# Here it keeps everything as character
# Test with colclasses
df_test_readfread2 <- fread(paste0(getwd(),"/Test.csv"),
stringsAsFactors = FALSE,
colClasses = c(
col_text = "character",
col_num = "myNum",
col_date = "myDate"
),
sep=";",
verbose=TRUE)
str(df_test_readfread2)
# Keeps everything as character
所以我的问题是:有没有办法用 fread 读取数值为 10.380,45 的 CSV 文件?
(或者:读取具有此类数值的 CSV 的最快方法是什么?)