-2

I would like to create a simple application in C# that takes in a group of words, then returns all groupings of those individual words from a data set.

For example, given car and bike, return a list of groups/combinations of words (with the number of combinations found) from a data set.

To further clarify - given a category named "car", I would like to see a list of word groupings with the word "car". This category could also be several words rather than just one.

With a sample data set of:

CAR:

  • Another car for sale
  • Blue car on the horizon
  • For Sale - used car
  • this car is painted blue

should return

car : for sale : 2
car : blue : 2

I'd like to set a threshold, say 20 or greater, so if there are over 20 instances of the word(s) with car, then display them - category, words, count, where only category is known; words and count is determined by the algorithm.

The data set is in a SQL Server 2008 table, and I was hoping to use something like a .Net implementation of R to accomplish this.

I am guessing that the best way to accomplish this may be with the R programming language, and am only now looking at R.Net.

I would prefer to do this with .Net, as that is what I am most familiar with, but open to suggestions.

Can someone with some experience with this lead me in the right direction?

Thanks.

4

1 回答 1

0

您的问题似乎由 4 个部分组成:

  1. 从 SQL Server 2008 获取数据
  2. 从一组字符串中提取子字符串
  3. 设置何时接受该数字的阈值
  4. 生成包含此内容的某些文档或其他输出(?)。

对于 1,我认为这是一个不同的问题(请参阅RODBC包),但我不会在这里处理这个问题,因为这不是你问题的主要部分。你留下了 4. 有点模糊,我认为这也是你问题的核心。

第 2 部分可以使用正则表达式轻松处理:

countstring <- function(string, pattern){
  stringcount <- sum(grepl(pattern, string, ignore.case=TRUE), na.rm=TRUE)
  paste(deparse(substitute(string)), pattern, stringcount, sep=" : ")
}

这个函数基本上得到一个字符串向量和一个要搜索的模式。它找到它们中的哪一个匹配并获得匹配的数字的总和(即计数)。然后它将这些一起打印在一个字符串中。例如:

car <- c("Another car for sale", "Blue car on the horizon", "For Sale - used car", "this car is painted blue")
countstring(car, "blue")
## [1] "car : blue : 2"

第 3 部分需要对函数稍作改动

countstring <- function(string, pattern, threshold=20){
  stringcount <- sum(grepl(pattern, string, ignore.case=TRUE), na.rm=TRUE)

  if(stringcount >= threshold){
    paste(deparse(substitute(string)), pattern, stringcount, sep=" : ")
  }

}
于 2013-02-11T06:32:45.113 回答