6

我正在尝试为基于自定义距离函数的字符串创建一个距离矩阵(用于聚类)。我在一个 6000 个单词的列表上运行了代码,并且它自最后 90 分钟以来仍在运行。我有 8 GB RAM 和 Intel-i5,所以问题仅出在代码上。这是我的代码:

library(stringdist)
#Calculate distance between two monograms/bigrams
stringdist2 <- function(word1, word2)
{
    #for bigrams - phrases with two words
    if (grepl(" ",word1)==TRUE) {
        #"Hello World" and "World Hello" are not so different for me
        d=min(stringdist(word1, word2),
        stringdist(word1, gsub(word2, 
                          pattern = "(.*) (.*)", 
                          repl="\\2,\\1")))
    }
    #for monograms(words)
    else{
        #add penalty of 5 points if first character is not same
        #brave and crave are more different than brave and bravery
        d=ifelse(substr(word1,1,1)==substr(word2,1,1),
                            stringdist(word1,word2),
                            stringdist(word1,word2)+5)
    }   
    d
}
#create distance matrix
stringdistmat2 = function(arr)
{
    mat = matrix(nrow = length(arr), ncol= length(arr))
    for (k in 1:(length(arr)-1))
    {
        for (j in k:(length(arr)-1))
        {           
            mat[j+1,k]  = stringdist2(arr[k],arr[j+1])      
        }
    }
    as.dist(mat)    
}

test = c("Hello World","World Hello", "Hello Word", "Cello Word")
mydmat = stringdistmat2(test)
> mydmat
  1 2 3
2 1    
3 1 2  
4 2 3 1

我认为问题可能是我使用循环而不是应用 - 但后来我发现在很多地方循环并不是那么低效。更重要的是,我不够熟练,无法使用 apply 因为我的循环是嵌套循环,例如k in 1:nand j in k:n。我想知道是否还有其他可以优化的东西。

4

3 回答 3

4

有趣的问题。所以一步一步来:

1 -stringdist函数已经矢量化:

#> stringdist("byye", c('bzyte','byte'))
#[1] 2 1

#> stringdist(c('doggy','gadgy'), 'dodgy')
#[1] 1 2

但是给两个具有相同长度的向量,stringdist将导致在每个向量上并行循环(不会产生具有交叉结果的矩阵),如下Map所示:

#> stringdist(c("byye","alllla"), c('bzyte','byte'))
#[1] 2 6

2 -重写你的函数,使你的新函数保持这个矢量化特征

stringdistFast <- function(word1, word2)
{
    d1 = stringdist(word1, word2)
    d2 = stringdist(word1, gsub("(.+) (.+)", "\\2 \\1", word2))

    ifelse(d1==d2,d1+5*(substr(d1,1,1)!=substr(d2,1,1)),pmin(d1,d2))
}

它确实以同样的方式工作:

#> stringdistFast("byye", c('bzyte','byte'))
#[1] 2 1

#> stringdistFast("by ye", c('bzyte','byte','ye by'))
#[1] 3 2 0

3 -重写 dismatrix 函数,只有一个循环循环并且只在一个三角形部分(不在outer那里,它很慢!):

stringdistmatFast <- function(test)
{
    m = diag(0, length(test))
    sapply(1:(length(test)-1), function(i)
    {
        m[,i] <<- c(rep(0,i), stringdistFast(test[i],test[(i+1):length(test)]))
    }) 

    `dimnames<-`(m + t(m), list(test,test))
}

4 -使用功能:

#> stringdistmatFast(test)
#            Hello World World Hello Hello Word Cello Word
#Hello World           0           0          1          2
#World Hello           0           0          1          2
#Hello Word            1           1          0          1
#Cello Word            2           2          1          0
于 2015-09-02T11:59:34.870 回答
3

Loops are indeed very inefficient, and here is a quick example that shows that:

x=rnorm(1000000)
system.time({y1=sum(x)})
system.time({
        y2=0
        for(i in 1:length(x)){
                y2=y2+x[i]
        }
})

This is a simple comparison of internal vectorised function sum(), that essentially just calculates sum of all elements in a cycle internally; second function does the same in R code, which makes it call another internal function + over and over, which is not very efficient.

First of all, you have a couple of mistakes/inconsistencies in your user defined function. This part: gsub(word2, pattern = "(.*) (.*)", repl="\\2,\\1") replaces all white spaces with comas, which automatically adds +1 to distance score (was it intended?) Second of all, you don't compare first letters for strings that have spaces in them, because then only the first part of the function is executed. That is true even if only the first of the compared words contains space, so "Hello " and "Cello" comparison would be calculated as closer distance than "Hello" and "Cello".

Other then that, your code seems to be easy vectorisable, because all the functions you use are already vectorised: stringdist(),grepl(),gsub(),substr() etc. Basically you perform 3 calculations for each word-pair: simple 'stringdist()', stringdist() of swapped words (if there is space in the first word), and simple comparison of first letters that adds +5 points if they are different.

Here is the code that reproduces your function in a vectorised manner, which gives around 50x speed up on calculating 300x300 matrix:

stringdist3<-function(words1,words2){
m1<-stringdist(words1,words2)
m2<-stringdist(words1,gsub(words2, 
                           pattern = "(.*) (.*)", 
                           repl="\\2,\\1"))
m=mapply(function(x,y) min(x,y),m1,m2)

m3<-5*(substr(words1,1,1)!=substr(words2,1,1) & !grepl(" ",words1))

m3+m
}
stringdistmat3 = function(arr){
        outer(arr,arr,function(x,y) stringdist3(x,y))
}
test = c("Hello World","World Hello", "Hello Word", "Cello Word")
arr=sample(test,size=300,replace=TRUE)
system.time({mat = stringdistmat2(arr)})
system.time({
        mat2=stringdistmat3(arr)
        })
于 2015-09-02T10:57:09.543 回答
0

我还试图创建一种替代方法来改进我的答案。基本上我删除了创建距离的函数并直接创建了距离矩阵。所以这就是我想出的。我知道这个解决方案可以改进。所以欢迎任何建议

strdistmat2 <- function(v1,v2,type="m"){
    #for monograms
    if (type=="m")  {
        penalty = sapply(substr(v1,1,1),stringdist,b=substr(v2,1,1)) * 5
        d = sum(sapply(v1,stringdist,b=v2),penalty)
    }
    #for bigrams
    else if(type=="b")  {       
        d1 = sapply(v1,stringdist,b=v2) 
        d2 = sapply(v1,stringdist,b=gsub(v2,pattern = "(.*) (.*)", repl="\\2 \\1"))
        d = pmin(d1,d2)
    }
    d
}

我比较了以下各种解决方案的时间。

> test = c("Hello World","World Hello", "Hello Word", "Cello Word")
> arr=sample(test,size=6000,replace=TRUE)
> system.time({mat=strdistmat2(arr,arr,"b")})
   user  system elapsed 
  96.89    1.63   70.36 
> system.time({mat2=stringdistmat3(arr)})
   user  system elapsed 
 469.40    5.69  439.96 
> system.time({mat3=stringdistmatFast(arr)})
   user  system elapsed 
  57.34    0.72   41.22 

因此——上校的回答是最快的。

同样根据实际数据,我的代码和 Maksim 代码都崩溃了,只有上校的答案有效。这是结果

> system.time({mat3=stringdistmatFast(words)})
   user  system elapsed 
 314.63    1.78  291.94 

当我在实际数据上运行我的解决方案时 - 错误消息是 - 无法分配 684 MB 的向量并且在运行 Maksim 的解决方案时 - R 停止工作。

于 2015-09-03T05:52:15.460 回答