python - 从 R 或 Python 中的列中获取经常出现的字符串模式

Question

我有一列字符串名称，我想找到经常出现的模式（单词）。有没有办法返回比 X 更长（或等于）长度的字符串，并且在整个列中出现的频率高于 Y 次？

column <- c("bla1okay", "okay1243bla", "blaokay", "bla12okay", "okaybla")
getOftenOccuringPatterns <- function(.....) 
getOftenOccuringPatterns(column, atleaststringsize=3, atleasttimes=4)
>     what   times 
[1]   bla    5
[2]   okay   5

参考蒂姆的评论：

我希望删除嵌套的，所以如果有“aaaaaaa”和“aaaa”，并且两者都会出现在输出中，那么只有“aaaaaaaa”和一个出现的次数很重要。

如果atleaststringsize=3和atleaststringsize=4，则两者的输出将相同。比方说atleasttimes=10，“aaaaaaaa”出现 15 次，“aaaaaa”出现 15 次，那么：

getOftenOccurringPatterns(column, atleaststringsize=3, atleasttimes=10)
>    what      times
[1]  aaaaaaaa    15

和

getOftenOccurringPatterns(column, atleaststringsize=4, atleasttimes=10) 
>    what      times
[1]  aaaaaaaa    15

最长的停留时间，对于 atleast=3 和 atleast=4 都是一样的。

score 2 · Accepted Answer

这将创建一个包含所有子字符串的所有出现的向量；它天真地这样做了，迭代输入字符串 max(nchar(x)) 的最大长度并查找长度为 1、2、... max(nchar(x)) 的所有子序列，因此在多项式时间内进行缩放——对于超大问题，它不会有效。

此修订包含以下更改：

.accumulate在先前版本的内部和外部循环中实现了可怕的“复制和附加”模式；现在我们将结果累积到一个预先分配的列表中answer0，然后在内循环之后累积这些结果。
allSubstrings()有参数min_occur, min_nchar(and max_nchar) 来限制搜索空间。特别是，min_occur（保留子字符串必须出现的最小次数）有助于减少在其中搜索较长子字符串的字符向量的长度。
该函数.filter()可用于更积极地删除不包含长度为 i 的子字符串的字符串；这可能代价高昂，因此useFilter可以设置启发式和参数。过滤器的使用使整个解决方案看起来更像是一种 hack，而不是一种算法——关于子字符串的信息已经被提取出来，所以我们不必再回去寻找它们的出现。

这是修改后的主要功能

allSubstrings <-
    function(x, min_occur=1L, min_nchar=1L, max_nchar=max(nchar(x)),
             ..., useFilter=max(nchar(x)) > 100L)
{
    len <- nchar(x)
    x <- x[len >= min_nchar]; len <- len[len >= min_nchar]
    answer <- vector("list", max_nchar - min_nchar + 1L)
    for (i in seq(min_nchar, max_nchar)) {
        ## suffix of length i, starting at character j
        x0 <- x; len0 <- len; n <- max(len0) - i + 1L
        answer0 <- vector("list", n)
        for (j in seq_len(n)) {
            end <- j + i - 1L
            f <- factor(substr(x0, j, end))
            answer0[[j]] <- setNames(tabulate(f), levels(f))
            x0 <- x0[len0 != end]; len0 <- len0[len0 != end]
        }
        answer0 <- unlist(answer0)        # accumulate across start positions
        answer0 <- vapply(split(answer0, names(answer0)), sum, integer(1))
        answer0 <- answer0[answer0 >= min_occur]
        if (length(answer0) == 0L)
            break
        answer[[i - min_nchar + 1L]] <- answer0

        idx <- len != i                   # no need to process some strings
        if (useFilter)
            idx[idx] <- .filter(x[idx], names(answer0))
        x <- x[idx]; len <- len[idx]
        if (length(x) == 0L)
            break
    }
    unlist(answer[seq_len(i)])
}

和.filter功能

.filter <-
    function(s, q)
{
    ## which 's' contain at least one 'q'
    answer <- rep(FALSE, length(s))
    idx <- !answer      # use this to minimize the number of greps
    for (elt in q) {
        answer[idx] <- answer[idx] | grepl(elt, s[idx], fixed=TRUE)
        idx[idx] <- !answer[idx]
    }
    answer
}

和以前一样，结果是一个命名向量，其中名称是字符串，值是它们出现的次数。

> column <- c("bla1okay", "okay1243bla", "blaokay", "bla12okay", "okaybla")
> xx <- allSubstrings(column)
> head(sort(xx, decreasing=TRUE))
 a  b  o  k  l  y 
10  5  5  5  5  5 
> xtabs(~nchar(names(xx)) + xx)
                xx
nchar(names(xx))  1  2  3  5 10
              1   2  1  1  5  1
              2   8  2  0  5  0
              3  15  1  0  3  0
              4  20  1  0  1  0
              5  22  0  0  0  0
....

像原始问题中的查询很容易执行，例如，所有 >= 3 个字符的子字符串出现超过 4 次：

> (ok <- xx[nchar(names(xx)) >= 3 & xx > 4])
 bla  oka  kay okay 
   5    5    5    5

该代码没有完全回答问题，例如存在嵌套子字符串，但可能会替换lapply@user1609452 答案的嵌套部分。对这个结果进行后处理以消除嵌套子序列有点不雅，但由于后处理的结果不大，可能会足够快，例如，消除嵌套子串

> fun <- function(p, q) length(grep(p, q, fixed=TRUE))
> ok[ sapply(names(ok), fun, names(ok)) == 1L ]   
 bla okay 
   5    5

在这里，我们使用笔记本电脑上的 99k 单词词典进行输入，并为修改后的算法提供了一些基本时序

> timer <- function(n, x, ...)
    system.time(allSubstrings(head(x, n), ...))[[3]]
> n <- c(100, 1000, 10000, 20000)
> data.frame(n=n, elapsed=sapply(n, timer, words))
      n elapsed
1   100   0.050
2  1000   0.074
3 10000   0.490
4 20000   1.031

这比原始算法快了大约 10 倍，在这种情况下完全归功于修订版 1（使用预分配和填充，然后是累积）。

这是一个较长句子的语料库

shakes <- readLines("http://www.gutenberg.org/cache/epub/100/pg100.txt")
shakes <- paste(shakes[nchar(shakes) != 0], collapse=" ")
shakes <- gsub(" +", " ", shakes)
shakes <- strsplit(shakes, "\\. +",)[[1]]

和一些时间。min_occur这从指定参数和使用过滤器中受益匪浅。

> n <- c(100, 1000, 2000, 5000)
> data.frame(n=n, elapsed=sapply(n, timer, shakes, min_occur=10))
     n elapsed
1  100   1.725
2 1000   7.724
3 2000  12.415
4 5000  60.914

使用过滤器的需要和较长字符串的较差性能导致人们想要获得更好的算法，例如suffix array；“Rlibstree”包也可能有用，尽管我不确定从哪里获得当前版本或者接口的暴露部分是否足以回答原始问题。

score 2 · Accepted Answer

它绝不经过测试，也不会赢得任何速度比赛：

getOftenOccuringPatterns <- function(column, atleaststringsize, atleasttimes, uniqueInColumns = FALSE){

  res <- 
  lapply(column,function(x){
    lapply(atleaststringsize:nchar(x),function(y){
      if(uniqueInColumns){
        unique(substring(x, 1:(nchar(x)-y+1), y:nchar(x)))
      }else{
        substring(x, 1:(nchar(x)-y+1), y:nchar(x))
      }
    })
  })

  orderedRes <- unlist(res)[order(unlist(res))]
  encodedRes <- rle(orderedRes)

  partRes <- with(encodedRes, {check = (lengths >= atleasttimes);
                               list(what = values[check], times = lengths[check])})
  testRes <- sapply(partRes$what, function(x){length(grep(x, partRes$what)) > 1})

  lapply(partRes, '[', !testRes)

}


column <- c("bla1okay", "okay1243bla", "blaokay", "bla12okay", "okaybla")
getOftenOccuringPatterns(column, atleaststringsize=3, atleasttimes=4)
$what

 "bla" "okay" 

$times

5 5 


getOftenOccuringPatterns(c("aaaaaaaa", "aaaaaaa", "aaaaaa", "aaaaa", "aaaa", "aaa"), atleaststringsize=3, atleasttimes=4)
$what
[1] "aaaaaa"

$times
[1] 6


getOftenOccuringPatterns(c("aaaaaaaa", "aaaaaaa", "aaaaaa", "aaaaa", "aaaa", "aaa"), atleaststringsize=3, atleasttimes=4, uniqueInColumn = TRUE)
$what
[1] "aaaaa"

$times
[1] 4

score 1 · Accepted Answer

好的，我已经用 Python 编写了一个解决方案。抱歉，我不能给你一个有效的 R 程序，但你应该可以从中实现一个。正如您所看到的，这是一个非常强力的解决方案，但我并没有真正看到从输入中的所有字符串构建所有可能的子字符串的方法。

我已将问题分解为简单、独立的步骤。这些应该很容易翻译成 R。我确信 R 中有用于列表、集合和计数器的可比较的数据结构。

from collections import Counter
strings = ["bla1okay", "okay1243bla", "blaokay", "bla12okay", "okaybla"]

def substrings(s, minlength=3):
    """Finds all possible unique substrings of s, given a minimum length.

    >>> substrings("12345")
    {'1234', '234', '345', '12345', '123', '2345'}
    >>> substrings("123123")
    {'2312', '123123', '12312', '123', '23123', '1231', '231', '3123', '312'}
    >>> substrings("aaaaa")
    {'aaaaa', 'aaaa', 'aaa'}
    """
    maxsize = current = len(s)
    result = []
    while current >= minlength:
        result.extend([s[start:start+current] 
                       for start in range(maxsize-current+1)])
                                  # range(5) is [0,1,2,3,4]
        current -= 1
    return set(result) # set() removes duplicates

def all_substrings(strings, minlength=3):
    """Returns the union of all the sets of substrings of a list of strings.

    >>> all_substrings(["abcd", "1234"])
    {'123', 'abc', 'abcd', '1234', 'bcd', '234'}
    >>> all_substrings(["abcd", "bcde"])
    {'abc', 'bcd', 'cde', 'abcd', 'bcde'}
    """
    result = set()
    for s in strings:
        result |= substrings(s, minlength)
        # "|=" is the set union operator
    return result

def count(strings, minlength=3):
    """Counts the occurrence of each substring within the provided list of strings,
    given a minimum length for each substring.

    >>> count(["abcd", "bcde"])
    Counter({'bcd': 2, 'bcde': 1, 'abc': 1, 'abcd': 1, 'cde': 1})
    """
    substrings = all_substrings(strings, minlength)
    counts = Counter()
    for substring in substrings:       # Check each substring
         for string in strings:        # against each of the original strings
             if substring in string:   # to see whether it is contained there
                 counts[substring] += 1
    return counts

def prune(counts, mincount=4):
    """Returns only the longest substrings whose count is >= mincount.
    First, all the substrings with a count < mincount are eliminated.
    Then, only those that aren't substrings of a longer string are kept.
    >>> prune(Counter({'bla': 5, 'kay': 5, 'oka': 5, 'okay': 5, 'la1': 2, 'bla1': 2}))
    [('okay', 5), ('bla', 5)]
    """
    # Throw out all counts < mincount. Sort result by length of the substrings.
    candidates = sorted(((s,c) for s,c in counts.items() if c >= mincount), 
                        key=lambda l: len(l[0]), reverse=True) # descending sort
    result = []
    seenstrings = set()      # Set of strings already in our result
    # (we could also look directly in the result, but set lookup is faster)
    for item in candidates:
        s = item[0]          # item[0] contains the substring
        # Make sure that s is not already in our result list
        if not any(s in seen for seen in seenstrings): 
            result.append(item)
            seenstrings.add(s)
    return result

counts = count(strings)
print(prune(counts))

输出：

[('okay', 5), ('bla', 5)]

python - 从 R 或 Python 中的列中获取经常出现的字符串模式

3 回答 3

Related

Reference