Aim:
I'm trying to compare each element in a list with all the other elements below it using Levenshtein distance from this package stringsim
to find text that is similar.
Obstacle:
The problem is that due to the time and space complexity, it will take much time to run. This is the complexity for a 5 element array, ending in 10 comparisons/iterations (4+3+2+1):
The calculator and theory can be found here link
Attempt:
I'll reproduce using a normal for loop.
fruits <- fruit[1:5] # 5 elements from fruit
n <- len(fruits) # n set to 5
score_df <- data_frame(x=character(0),y=character(0),score=numeric(0)) # initialize an a matrix to host the strings compare and the score
cnt=0 # Count, for counting the how many iterations ran
i=j=0
for(i in 1:(n-1)){
print(i)
print('----')
for(j in i+1:(n-i)){
cnt = cnt+1
print(j)
initial_term = fruits[i] # First element
compared_term = fruits[j] # second element beneath it
score <- stringsim(initial_term,compared_term, method = 'lv') # Compute Levenshtein distance
term <- data_frame(x=initial_term, y=compared_term, score=score) # Adding term to a dataframe
score_df <- bind_rows(score_df, term) # Appending rows to a dataframe
}
print('====')
}
print(paste('operations count: ', cnt)) # Print the iterations count
You can see the result appears correctly of the 10 elements compared:
> as_tibble(fruits)
# A tibble: 5 x 1
value
<chr>
1 apple
2 apricot
3 avocado
4 banana
5 bell pepper
> score_df
# A tibble: 10 x 3
x y score
<chr> <chr> <dbl>
1 apple apricot 0.286
2 apple avocado 0.143
3 apple banana 0.167
4 apple bell pepper 0.273
5 apricot avocado 0.143
6 apricot banana 0
7 apricot bell pepper 0.0909
8 avocado banana 0.143
9 avocado bell pepper 0
10 banana bell pepper 0.0909
Request:
I was finally able to convert that ordinary loop to a parallelized one. Below is a sample running on this dataset stringr::fruit
I need assistance on optimizing the below loop, so I can run it on ~6k to ~7k rows, if there is one; As my attempt using the below code led my RStudio to crash.
My processor is the below
PS> Get-WmiObject -Class Win32_Processor -ComputerName. | Select-Object -Property Name,NumberOfCores,NumberOfEnabledCore,NumberOfLogicalProcessors,Description
Name : Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
NumberOfCores : 6
NumberOfEnabledCore : 6
NumberOfLogicalProcessors : 12
Description : Intel64 Family 6 Model 158 Stepping 10
PS>
library(foreach)
library(parallel)
library(doParallel)
fruits <- fruit
n <- length(fruits)
score_df <- data_frame(x=character(0),y=character(0),score=numeric(0))
numCores <- detectCores() # 12
registerDoParallel(numCores - 1) # Assigning 11 threads out of 12
i=j=0
score_df <- foreach(i = 1:(n-1), .combine = 'rbind') %:%
foreach(j = i+1:(n-i), .packages = c("stringdist","tibble","dplyr"), .combine = 'rbind') %dopar% {
initial_term = fruits[i]
compared_term = fruits[j]
score <- stringsim(initial_term,compared_term, method = 'lv')
term <- data_frame(x=initial_term, y=compared_term, score=score)
}
stopImplicitCluster()
The result was the correct expected number of (3160 rows)
> score_df
# A tibble: 3,160 x 3
x y score
<chr> <chr> <dbl>
1 apple apricot 0.286
2 apple avocado 0.143
3 apple banana 0.167
4 apple bell pepper 0.273
5 apple bilberry 0.125
6 apple blackberry 0.200
7 apple blackcurrant 0.0833
8 apple blood orange 0.0833
9 apple blueberry 0.111
10 apple boysenberry 0.0909
# ... with 3,150 more rows
References:
parallel
https://nceas.github.io/oss-lessons/parallel-computing-in-r/parallel-computing-in-r.html
foreach
https://cran.r-project.org/web/packages/foreach/vignettes/foreach.html
Nested foreach
https://cran.r-project.org/web/packages/foreach/vignettes/nested.html