r - How to create a word grouping report using R language and .Net?

Question

I would like to create a simple application in C# that takes in a group of words, then returns all groupings of those individual words from a data set.

For example, given car and bike, return a list of groups/combinations of words (with the number of combinations found) from a data set.

To further clarify - given a category named "car", I would like to see a list of word groupings with the word "car". This category could also be several words rather than just one.

With a sample data set of:

CAR:

Another car for sale
Blue car on the horizon
For Sale - used car
this car is painted blue

should return

car : for sale : 2
car : blue : 2

I'd like to set a threshold, say 20 or greater, so if there are over 20 instances of the word(s) with car, then display them - category, words, count, where only category is known; words and count is determined by the algorithm.

The data set is in a SQL Server 2008 table, and I was hoping to use something like a .Net implementation of R to accomplish this.

I am guessing that the best way to accomplish this may be with the R programming language, and am only now looking at R.Net.

I would prefer to do this with .Net, as that is what I am most familiar with, but open to suggestions.

Can someone with some experience with this lead me in the right direction?

Thanks.

score 0 · Accepted Answer

您的问题似乎由 4 个部分组成：

从 SQL Server 2008 获取数据
从一组字符串中提取子字符串
设置何时接受该数字的阈值
生成包含此内容的某些文档或其他输出（？）。

对于 1，我认为这是一个不同的问题（请参阅RODBC包），但我不会在这里处理这个问题，因为这不是你问题的主要部分。你留下了 4. 有点模糊，我认为这也是你问题的核心。

第 2 部分可以使用正则表达式轻松处理：

countstring <- function(string, pattern){
  stringcount <- sum(grepl(pattern, string, ignore.case=TRUE), na.rm=TRUE)
  paste(deparse(substitute(string)), pattern, stringcount, sep=" : ")
}

这个函数基本上得到一个字符串向量和一个要搜索的模式。它找到它们中的哪一个匹配并获得匹配的数字的总和（即计数）。然后它将这些一起打印在一个字符串中。例如：

car <- c("Another car for sale", "Blue car on the horizon", "For Sale - used car", "this car is painted blue")
countstring(car, "blue")
## [1] "car : blue : 2"

第 3 部分需要对函数稍作改动

countstring <- function(string, pattern, threshold=20){
  stringcount <- sum(grepl(pattern, string, ignore.case=TRUE), na.rm=TRUE)

  if(stringcount >= threshold){
    paste(deparse(substitute(string)), pattern, stringcount, sep=" : ")
  }

}

r - How to create a word grouping report using R language and .Net?

1 回答 1

Related

Reference