r - 按日期从数据库中拆分一个大文件

Question

我正在从数据库（test1）中读取一个大数据文件。我无法在 R 中直接读取和处理的数百万行。

我想根据“horodatage”列从这个大文件创建子文件。我在下面给出了一个示例来从大文件中提取一个文件，但现在我想不仅在这两个日期之间对所有文件执行此操作。

拆分必须从“23/03/2005 11:00”这个日期开始，一直到大文件的末尾（大约在“31/12/2005 23:59”左右（来自数据库的test1）和持续时间一个子文件必须是 30 分钟（换句话说：每个子文件正好 36000 行）。

然后，每个子文件都必须以（A200503231100.dat、A200503231130.dat、A200503231200.dat、A200503231230.dat 等）的名称保存

horodatage 列的格式已经是

> class(montableau$horodatage)
[1] "POSIXct" "POSIXt"

我开始的代码是：

heuredebut = "23/03/2005 11:00"
heurefin = "23/03/2005 11:30"
query = paste("select * from test1 where horodatage >= ",heuredebut," and horodatage < ",heurefin," order by horodatage;",sep="'")
montableau <- dbGetQuery (connection_db,query)

如果您对这个大文件的循环有任何见解，那将非常有帮助。

score 1 · Accepted Answer

R 中的日期是出了名的烦人。

这里的关键技巧是使用strptime函数以您需要的方式格式化日期。

# Quick function to go from string to time
cleanDate <- function(x){
    as.POSIXct(strptime(x, '%d/%m/%Y %H:%M'))
}

# Function to print time in format of SQL database
printDate <- function(x){
    as.character(x, '%d/%m/%Y %H:%M')
}


# Create sequence of times
times <- seq(
    cleanDate('23/03/2005 11:00'), 
    cleanDate('01/01/2006 00:00'), 
    by = 60 * 30) # adding 30 minutes

for( i in 1:(length(times) - 1) ){

    # Generate SQL
    sql <- paste("select * from test1 where horodatage >= ", 
        printDate(times[i]),
        " and horodatage < ",
        printDate(times[i+1]),
        " order by horodatage;",sep="'")

    # Query
    montableau <- dbGetQuery (connection_db, sql)

    # Write table
    write.table(montableau, 
        file= as.character(times[i], 'A%Y%m%d%H%M.dat'), 
        row.names=FALSE, sep="\t", quote=FALSE)

}

r - 按日期从数据库中拆分一个大文件

1 回答 1

Related

Reference