4

非常感谢您对我的问题的帮助。我已经研究了好几个星期了,但找不到解决方案……抱歉,描述太长了。我想我发布了很多细节,所以你完全明白了这个问题,虽然它并不难掌握。

情况:

由 400 万个条目组成的数据框(总大小:250 MB)。两列:IDTEXT。每个文本字符串最多 200 个字符。

任务:

预处理文本字符串

问题:

时间太长了。在我的 8GB RAM 双核机器上:1 天后取消。在 70GB 8 核 Amazon EC2 实例上:1 天后取消。

细节:

我基本上是

  • 计算某些单词在一个字符串中出现的频率
  • 将此数字写入新列 (COUNT)
  • 替换这个(计数的)单词
  • 替换其他词(我之前不需要计算)
  • 替换一些正则表达式

用作模式的向量如下所示:

"\\bWORD1\\b|\\bWORD2\\b|\\bWORD3\\b|\\bWORD4\\b..."

因此,这些“替换向量”是长度为 1 的字符向量,每个包含最多 800 个单词

主要的:

library("parallel")
library("stringr")

preprocessText<-function(x){

  # Replace the 'html-and'
  arguments<-list(pattern="\\&amp\\;",replacement="and",x=x, ignore.case=TRUE)
  y<-do.call(gsub, arguments)  

  # Remove some special characters
   arguments<-list(pattern="[^-[:alnum:]\\'\\:\\/\\$\\%\\.\\,\\+\\-\\#\\@\\_\\!\\?+[:space:]]",replacement="",x=y, ignore.case=TRUE)
  y<-do.call(gsub, arguments)  

  # Lowercase 
  arguments<-list(string=y,pattern=tolower(rep_ticker))
  first<-do.call(str_match,arguments)  

  # Identify signal words and count them
  # Need to be done in parts, because otherwise R can't handle this many at once
  arguments<-list(string=x, pattern=rep_words_part1)
  t1<-do.call(str_extract_all,arguments)

  arguments<-list(string=x, pattern=rep_words_part2)
  t2<-do.call(str_extract_all,arguments)

  arguments<-list(string=x, pattern=rep_words_part3)
  t3<-do.call(str_extract_all,arguments)

  arguments<-list(string=x, pattern=rep_words_part4)
  t4<-do.call(str_extract_all,arguments)

  count=length(t1[[1]])+length(t2[[1]])+length(t3[[1]])+length(t4[[1]])
  signal_words=c(t1[[1]],t2[[1]],t3[[1]],t4[[1]])


  # Replacements

  arguments<-list(pattern=rep_wordsA,replacement=" [wordA] ",x=y, ignore.case=TRUE)
  y<-do.call(gsub, arguments) 

  arguments<-list(pattern=rep_wordB_part1,replacement=" [wordB] ",x=y, ignore.case=TRUE)
  y<-do.call(gsub, arguments)   

  arguments<-list(pattern=rep_wordB_part2,replacement=" [wordB] ",x=y, ignore.case=TRUE)
  y<-do.call(gsub, arguments)   

  arguments<-list(pattern=rep_wordB_part3,replacement=" [wordB] ",x=y, ignore.case=TRUE)
  y<-do.call(gsub, arguments)   

  arguments<-list(pattern=rep_wordB_part4,replacement=" [wordB] ",x=y, ignore.case=TRUE)
  y<-do.call(gsub, arguments)   

  arguments<-list(pattern=rep_email,replacement=" [email_adress] ",x=y, ignore.case=TRUE)
  y<-do.call(gsub, arguments)   

  arguments<-list(pattern=rep_url,replacement=" [url] ",x=y, ignore.case=TRUE)
  y<-do.call(gsub, arguments)   

  arguments<-list(pattern=rep_wordC,replacement=" [wordC] ",x=y, ignore.case=TRUE)
  y<-do.call(gsub, arguments)   

  # Some regular expressions
  arguments<-list(pattern="\\+[[:digit:]]*.?[[:digit:]]+%",replacement=" [positive_percentage] ",x=y, ignore.case=TRUE)
  y<-do.call(gsub, arguments)   

  arguments<-list(pattern="-[[:digit:]]*.?[[:digit:]]+%",replacement=" [negative_percentage] ",x=y, ignore.case=TRUE)
  y<-do.call(gsub, arguments)   

  arguments<-list(pattern="[[:digit:]]*.?[[:digit:]]+%",replacement=" [percentage] ",x=y, ignore.case=TRUE)
  y<-do.call(gsub, arguments)   

  arguments<-list(pattern="\\$[[:digit:]]*.?[[:digit:]]+",replacement=" [dollar_value] ",x=y,ignore.case=TRUE)
  y<-do.call(gsub, arguments)   

  arguments<-list(pattern="\\+[[:digit:]]*.?[[:digit:]]+",replacement=" [pos_number] ",x=y, ignore.case=TRUE)# remaining numbers 
  y<-do.call(gsub, arguments)   

  arguments<-list(pattern="\\-[[:digit:]]*.?[[:digit:]]+",replacement=" [neg_number] ",x=y, ignore.case=TRUE)
  y<-do.call(gsub, arguments)   

  arguments<-list(pattern="[[:digit:]]*.?[[:digit:]]+",replacement=" [number] ",x=y, ignore.case=TRUE)
  y<-do.call(gsub, arguments)   

  arguments<-list(pattern=rep_question,replacement=" [question] ", x=y, ignore.case=TRUE)
  y<-do.call(gsub, arguments)    


  # Unify synonyms
  arguments<-list(pattern=rep_syno1,replacement="happy", x=y, ignore.case=TRUE)
  y<-do.call(gsub, arguments)  

  arguments<-list(pattern=rep_syno2,replacement="sad", x=y, ignore.case=TRUE)
  y<-do.call(gsub, arguments)  

  arguments<-list(pattern=rep_syno3,replacement="people", x=y, ignore.case=TRUE)
  y<-do.call(gsub, arguments)  

  arguments<-list(pattern=rep_syno4,replacement="father", x=y, ignore.case=TRUE)
  y<-do.call(gsub, arguments)  

  arguments<-list(pattern=rep_syno5,replacement="mother", x=y, ignore.case=TRUE)
  y<-do.call(gsub, arguments)  

  arguments<-list(pattern=rep_syno6,replacement="money", x=y, ignore.case=TRUE)
  y<-do.call(gsub, arguments)  

  # Remove words
  # Punctuation (I know there a pre-defined R commands for this, but I need to customize this
  arguments<-list(pattern=rem_punct,replacement="", x=y, ignore.case=TRUE) 
  y<-do.call(gsub, arguments)  

  arguments<-list(pattern=rem_linebreak,replacement=" ", x=y, ignore.case=TRUE) #Remove line breaks
  y<-do.call(gsub, arguments) 

  #Append Positive or Negative Emotion  
  arguments<-list(x=y)
  y<-do.call(appendEmotion, arguments)  


  # Output
  result<-list(
    textclean=y,
    first_signal=first,
    all_signals=signal_words,
    signal_count=count)

  return(result)
}

resultList<-mclapply(dataframe$text_column,preprocessText)

(返回将是一个列表,我计划将其转换为 data.frame)。

之前,我也尝试分别调用每个,因此在每个文本字符串上gsub执行第一个,然后是第二个,依此类推..但我想这效率更低。gsubgsub

代码本身有效,但对我来说似乎可以加快速度。不幸的是,我不熟悉哈希表,我听说这可能是一个解决方案。

非常感谢您的建议和帮助!

定义内部调用的一个函数preprocessText

appendEmotion<-function(x){

  if (grepl(app_pos,x)){
    x<-paste(x," [posemo] ")
  } 
  if(grepl(app_neg,x)){
    x<-paste(x," [negemo] ")
  }  
  #Output
  return(x)
}

示例数据:

+------------+-----------------------------------------+
|  ID  |                     Text                      |
+------------+-----------------------------------------+
| 123  | My dad and me finished the race top 5%        |
| 456  | Look at this http://www.google.com, Like it ? |
  ..
+------------+-----------------------------------------+ 

应该成为

+------------+-------------------------------------------------+
|  ID  |                     Text                              |
+------------+-------------------------------------------------+
| 123  | my father and me finished the race top [percentage]   |
| 456  | look at this [url] like it [question]                 |
  ..
+------------+--------------------------------------------------+ 
4

0 回答 0