非常感谢您对我的问题的帮助。我已经研究了好几个星期了,但找不到解决方案……抱歉,描述太长了。我想我发布了很多细节,所以你完全明白了这个问题,虽然它并不难掌握。
情况:
由 400 万个条目组成的数据框(总大小:250 MB)。两列:ID
和TEXT
。每个文本字符串最多 200 个字符。
任务:
预处理文本字符串
问题:
时间太长了。在我的 8GB RAM 双核机器上:1 天后取消。在 70GB 8 核 Amazon EC2 实例上:1 天后取消。
细节:
我基本上是
- 计算某些单词在一个字符串中出现的频率
- 将此数字写入新列 (COUNT)
- 替换这个(计数的)单词
- 替换其他词(我之前不需要计算)
- 替换一些正则表达式
用作模式的向量如下所示:
"\\bWORD1\\b|\\bWORD2\\b|\\bWORD3\\b|\\bWORD4\\b..."
因此,这些“替换向量”是长度为 1 的字符向量,每个包含最多 800 个单词
主要的:
library("parallel")
library("stringr")
preprocessText<-function(x){
# Replace the 'html-and'
arguments<-list(pattern="\\&\\;",replacement="and",x=x, ignore.case=TRUE)
y<-do.call(gsub, arguments)
# Remove some special characters
arguments<-list(pattern="[^-[:alnum:]\\'\\:\\/\\$\\%\\.\\,\\+\\-\\#\\@\\_\\!\\?+[:space:]]",replacement="",x=y, ignore.case=TRUE)
y<-do.call(gsub, arguments)
# Lowercase
arguments<-list(string=y,pattern=tolower(rep_ticker))
first<-do.call(str_match,arguments)
# Identify signal words and count them
# Need to be done in parts, because otherwise R can't handle this many at once
arguments<-list(string=x, pattern=rep_words_part1)
t1<-do.call(str_extract_all,arguments)
arguments<-list(string=x, pattern=rep_words_part2)
t2<-do.call(str_extract_all,arguments)
arguments<-list(string=x, pattern=rep_words_part3)
t3<-do.call(str_extract_all,arguments)
arguments<-list(string=x, pattern=rep_words_part4)
t4<-do.call(str_extract_all,arguments)
count=length(t1[[1]])+length(t2[[1]])+length(t3[[1]])+length(t4[[1]])
signal_words=c(t1[[1]],t2[[1]],t3[[1]],t4[[1]])
# Replacements
arguments<-list(pattern=rep_wordsA,replacement=" [wordA] ",x=y, ignore.case=TRUE)
y<-do.call(gsub, arguments)
arguments<-list(pattern=rep_wordB_part1,replacement=" [wordB] ",x=y, ignore.case=TRUE)
y<-do.call(gsub, arguments)
arguments<-list(pattern=rep_wordB_part2,replacement=" [wordB] ",x=y, ignore.case=TRUE)
y<-do.call(gsub, arguments)
arguments<-list(pattern=rep_wordB_part3,replacement=" [wordB] ",x=y, ignore.case=TRUE)
y<-do.call(gsub, arguments)
arguments<-list(pattern=rep_wordB_part4,replacement=" [wordB] ",x=y, ignore.case=TRUE)
y<-do.call(gsub, arguments)
arguments<-list(pattern=rep_email,replacement=" [email_adress] ",x=y, ignore.case=TRUE)
y<-do.call(gsub, arguments)
arguments<-list(pattern=rep_url,replacement=" [url] ",x=y, ignore.case=TRUE)
y<-do.call(gsub, arguments)
arguments<-list(pattern=rep_wordC,replacement=" [wordC] ",x=y, ignore.case=TRUE)
y<-do.call(gsub, arguments)
# Some regular expressions
arguments<-list(pattern="\\+[[:digit:]]*.?[[:digit:]]+%",replacement=" [positive_percentage] ",x=y, ignore.case=TRUE)
y<-do.call(gsub, arguments)
arguments<-list(pattern="-[[:digit:]]*.?[[:digit:]]+%",replacement=" [negative_percentage] ",x=y, ignore.case=TRUE)
y<-do.call(gsub, arguments)
arguments<-list(pattern="[[:digit:]]*.?[[:digit:]]+%",replacement=" [percentage] ",x=y, ignore.case=TRUE)
y<-do.call(gsub, arguments)
arguments<-list(pattern="\\$[[:digit:]]*.?[[:digit:]]+",replacement=" [dollar_value] ",x=y,ignore.case=TRUE)
y<-do.call(gsub, arguments)
arguments<-list(pattern="\\+[[:digit:]]*.?[[:digit:]]+",replacement=" [pos_number] ",x=y, ignore.case=TRUE)# remaining numbers
y<-do.call(gsub, arguments)
arguments<-list(pattern="\\-[[:digit:]]*.?[[:digit:]]+",replacement=" [neg_number] ",x=y, ignore.case=TRUE)
y<-do.call(gsub, arguments)
arguments<-list(pattern="[[:digit:]]*.?[[:digit:]]+",replacement=" [number] ",x=y, ignore.case=TRUE)
y<-do.call(gsub, arguments)
arguments<-list(pattern=rep_question,replacement=" [question] ", x=y, ignore.case=TRUE)
y<-do.call(gsub, arguments)
# Unify synonyms
arguments<-list(pattern=rep_syno1,replacement="happy", x=y, ignore.case=TRUE)
y<-do.call(gsub, arguments)
arguments<-list(pattern=rep_syno2,replacement="sad", x=y, ignore.case=TRUE)
y<-do.call(gsub, arguments)
arguments<-list(pattern=rep_syno3,replacement="people", x=y, ignore.case=TRUE)
y<-do.call(gsub, arguments)
arguments<-list(pattern=rep_syno4,replacement="father", x=y, ignore.case=TRUE)
y<-do.call(gsub, arguments)
arguments<-list(pattern=rep_syno5,replacement="mother", x=y, ignore.case=TRUE)
y<-do.call(gsub, arguments)
arguments<-list(pattern=rep_syno6,replacement="money", x=y, ignore.case=TRUE)
y<-do.call(gsub, arguments)
# Remove words
# Punctuation (I know there a pre-defined R commands for this, but I need to customize this
arguments<-list(pattern=rem_punct,replacement="", x=y, ignore.case=TRUE)
y<-do.call(gsub, arguments)
arguments<-list(pattern=rem_linebreak,replacement=" ", x=y, ignore.case=TRUE) #Remove line breaks
y<-do.call(gsub, arguments)
#Append Positive or Negative Emotion
arguments<-list(x=y)
y<-do.call(appendEmotion, arguments)
# Output
result<-list(
textclean=y,
first_signal=first,
all_signals=signal_words,
signal_count=count)
return(result)
}
resultList<-mclapply(dataframe$text_column,preprocessText)
(返回将是一个列表,我计划将其转换为 data.frame)。
之前,我也尝试分别调用每个,因此在每个文本字符串上gsub
执行第一个,然后是第二个,依此类推..但我想这效率更低。gsub
gsub
代码本身有效,但对我来说似乎可以加快速度。不幸的是,我不熟悉哈希表,我听说这可能是一个解决方案。
非常感谢您的建议和帮助!
定义内部调用的一个函数preprocessText
appendEmotion<-function(x){
if (grepl(app_pos,x)){
x<-paste(x," [posemo] ")
}
if(grepl(app_neg,x)){
x<-paste(x," [negemo] ")
}
#Output
return(x)
}
示例数据:
+------------+-----------------------------------------+
| ID | Text |
+------------+-----------------------------------------+
| 123 | My dad and me finished the race top 5% |
| 456 | Look at this http://www.google.com, Like it ? |
..
+------------+-----------------------------------------+
应该成为
+------------+-------------------------------------------------+
| ID | Text |
+------------+-------------------------------------------------+
| 123 | my father and me finished the race top [percentage] |
| 456 | look at this [url] like it [question] |
..
+------------+--------------------------------------------------+