1

我正面临一个最烦人的行为,即 R 脚本在 R Studio 中运行良好,并在 Azure ML 中生成错误。

我首先认为这是关于输入和输出的差异,但正如您在下面的脚本中看到的那样,我删除了对输入和输出的依赖关系。

错误是由调用产生的chartr:“旧”比“新”长。

任何输入表示赞赏。

accented_characters <- list('Š'='S', 'š'='s', 'Ž'='Z', 'ž'='z', 'À'='A', 'Á'='A', 'Â'='A', 'Ã'='A', 'Ä'='A', 'Å'='A', 'Æ'='A', 'Ç'='C', 'È'='E', 'É'='E',
                        'Ê'='E', 'Ë'='E', 'Ì'='I', 'Í'='I', 'Î'='I', 'Ï'='I', 'Ñ'='N', 'Ò'='O', 'Ó'='O', 'Ô'='O', 'Õ'='O', 'Ö'='O', 'Ø'='O', 'Ù'='U',
                        'Ú'='U', 'Û'='U', 'Ü'='U', 'Ý'='Y', 'Þ'='B', 'ß'='Ss', 'à'='a', 'á'='a', 'â'='a', 'ã'='a', 'ä'='a', 'å'='a', 'æ'='a', 'ç'='c',
                        'è'='e', 'é'='e', 'ê'='e', 'ë'='e', 'ì'='i', 'í'='i', 'î'='i', 'ï'='i', 'ð'='o', 'ñ'='n', 'ò'='o', 'ó'='o', 'ô'='o', 'õ'='o',
                        'ö'='o', 'ø'='o', 'ù'='u', 'ú'='u', 'û'='u', 'ý'='y', 'ý'='y', 'þ'='b', 'ÿ'='y' )

input <- data.frame(text = c("some piZzaé pizZa word a to : here $","or there € with 28'89.5"))
stop_words <- data.frame(international = c('pizza'))

stop_words <- as.character(stop_words$international)
stop_words <- gsub("^\\s+|\\s+$", "", stop_words) # trim
stop_words <- tolower(stop_words) # lowercase

input <- as.character(input$text)
input <- gsub("[[:space:]]+", ' ', input) # remove multiple spaces
input <- gsub("[1-9!\"#$€%&'()*+,./:;<=>@}~^_|`\\?\\[\\{]+", '', input) # remove punctuation, numbers and some others. Note, does not remove closing bracket, can't figure out why
input <- chartr(paste(names(accented_characters), collapse = ''),
            paste(accented_characters, collapse = ''), input) # remove accents
input <- tolower(input) # lowercase everything         
input <- gsub("\\b[a-z]{1,2}\\b", '', input) #remove too short words
input <- gsub(paste(stop_words, "|"), '', input) # remove stop words

input <- data.frame(input) # set as data.frame class
4

1 回答 1

0

尝试使用 Microsoft R Open 和检查点功能。 https://mran.revolutionanalytics.com/documents/rro/reproducibility/

于 2016-02-29T20:07:33.750 回答