0

我有一个可以正常工作的生成器功能。我有一个很大的 .txt 文件列表,其中每个文件也很长。现在的任务是编写一个生成器函数,它需要:

  1. 一批文件
  2. 然后从一个文件中取出一批大小为 128 的文件

我现在的代码:

data_files_generator <- function(train_set) {

  files <- train_set
  next_file <- 0

  function() {

    # move to the next file (note the <<- assignment operator)
    next_file <<- next_file + 1

    # if we've exhausted all of the files then start again at the
    # beginning of the list (keras generators need to yield
    # data infinitely -- termination is controlled by the epochs
    # and steps_per_epoch arguments to fit_generator())
    if (next_file > length(files))
    {next_file <<- 1}

    # determine the file name
    file <- files[[next_file]]

    text <- read_lines(paste(data_dir, file, sep = "" )) %>%
      str_to_lower() %>%
      str_c(collapse = "\n") %>%
      removeNumbers() %>%
      tokenize_characters(strip_non_alphanum = FALSE, simplify = TRUE)

    text <- text[text %in% chars]

    dataset <- map(
      seq(1, length(text) - maxlen - 1, by = 3), 
      ~list(sentece = text[.x:(.x + maxlen - 1)], next_char = text[.x + maxlen])
    )

    dataset <- transpose(dataset)

    # Vectorization
    x <- array(0, dim = c(length(dataset$sentece), maxlen, length(chars)))
    y <- array(0, dim = c(length(dataset$sentece), length(chars)))

    for(i in 1:length(dataset$sentece)){

      x[i,,] <- sapply(chars, function(x){
        as.integer(x == dataset$sentece[[i]])
      })

      y[i,] <- as.integer(chars == dataset$next_char[[i]])

    }
    rounded_dim <- floor(dim(x)[1]/mini_batch_size)
    match_size_to_batch <- 128 * rounded_dim

    x <- x[1:match_size_to_batch, 1:maxlen, 1:length(chars)]
    y <- y_val[1:match_size_to_batch, 1:length(chars)]

    return(list(x, y))

  }
}

所以即将到来的是一个文本文件,它被转换成更小的文本片段(长度为maxlen),然后被热编码为 0 和 1 矩阵。

问题是,从我的代码中,输出是一个大小maxlen x lenght(chars) x samples的数据立方体,其中样本数量非常大,这就是为什么我希望我的生成器函数始终输出一个大小的立方体,maxlen x lenght(chars) x samples(128)然后输出下一批大小maxlen x lenght(chars) x samples直到整个读入文本文件,然后转到下一个文本文件...

现在的输出是一个错误:

 Error in py_call_impl(callable, dots$args, dots$keywords) : 
  ValueError: Cannot feed value of shape (112512, 40, 43) for Tensor 'lstm_layer_input_1:0', which has shape '(128, 40, 43)' 

希望我已经解释得足够好,可以理解。我想我必须输入某种 for 循环来迭代样本长度,但我不知道如何将它包含到 gen.xml 中。功能。

4

2 回答 2

1

根据错误,您正在尝试输入一个 shape 对象,(112512, 40, 43)但您的 LSTM 层需要一个 shape 对象(128, 40, 43)。似乎缺少一些代码,但是当您定义输入层时,您是否修复了批量大小?我很幸运地将我的输入层定义为:

l_input = Input(shape = (None, num_features), name = 'input_layer')

我怀疑错误是由于这些代码行引起的:

rounded_dim <- floor(dim(x)[1]/mini_batch_size)
match_size_to_batch <- 128 * rounded_dim

这为您提供了比 128 大得多的批量大小。从Keras 文档中,输入形状应该是(batch_size, timesteps, input_dim). 整个史诗中的批次大小不必相同,但对于一个批次来说,它们都需要具有相同的数量timesteps(看起来就像你处理的那样maxlen)。

于 2018-11-01T14:03:18.373 回答
1

我已经实现了一个 for 循环,它现在返回大小为 128 的批次:

更改代码:

data_files_generator <- function(train_set) {

  files <- train_set
  next_file <- 0

  function() {

    # move to the next file (note the <<- assignment operator)
    next_file <<- next_file + 1

    # if we've exhausted all of the files then start again at the
    # beginning of the list (keras generators need to yield
    # data infinitely -- termination is controlled by the epochs
    # and steps_per_epoch arguments to fit_generator())
    if (next_file > length(files))
    {next_file <<- 1}

    # determine the file name
    file <- files[[next_file]]

    text <- read_lines(paste(data_dir, file, sep = "" )) %>%
      str_to_lower() %>%
      str_c(collapse = "\n") %>%
      removeNumbers() %>%
      tokenize_characters(strip_non_alphanum = FALSE, simplify = TRUE)

    text <- text[text %in% chars]

    dataset <- map(
      seq(1, length(text) - maxlen - 1, by = 3), 
      ~list(sentece = text[.x:(.x + maxlen - 1)], next_char = text[.x + maxlen])
    )

    dataset <- transpose(dataset)

    # Vectorization
    x <- array(0, dim = c(length(dataset$sentece), maxlen, length(chars)))
    y <- array(0, dim = c(length(dataset$sentece), length(chars)))

    for(i in 1:length(dataset$sentece)){

      x[i,,] <- sapply(chars, function(x){
        as.integer(x == dataset$sentece[[i]])
      })

      y[i,] <- as.integer(chars == dataset$next_char[[i]])

    }
    rounded_dim <- floor(dim(x)[1]/mini_batch_size)
    match_size_to_batch <- 128 * rounded_dim

    x <- x[1:match_size_to_batch, 1:maxlen, 1:length(chars)]
    y <- y_val[1:match_size_to_batch, 1:length(chars)]

    #Edit:
    span_start <-1
    for (iter in 1:rounded_dim){
     i <- iter * 128
     span_end <- iter * 128
     x <- x[span_start:span_end, 1:maxlen, 1:length(chars)]
     y <- y[span_start:span_end, 1:length(chars)]
     span_start <- i
     return(list(x, y))
    }
  }
}
于 2018-11-01T14:39:03.687 回答