python - 包含python中所有项目的重复子序列

Question

想象一下，我们有一个列表，比如 [255,7,0,0,255,7,0,0,255,7,0,0,255,7,0,0] 我们想找到包含子序列中所有项目的最短公共子序列（NOT SUBSTRING），255,7,0,0在这种情况下，我们不知道模式的长度。

即使中间有一些像这个序列这样的乱码，该程序也应该可以工作。255,7,0,0,4,3,255,5,6,7,0,0,255,7,0,0,255,7,0,1,2,0，它应该返回重复的子序列255,7,0,0。

我尝试了最长的公共子序列，但由于算法是贪婪的，它不适用于这种情况，因为它会返回所有匹配而不是最短的匹配。非常感谢您的帮助。

import numpy as np
cimport numpy as np
from libc.stdlib cimport *
from clcs cimport *
np.import_array()
def lcs_std(x, y):

"""Standard Longest Common Subsequence (LCS)
algorithm as described in [Cormen01]_.Davide Albanese
The elements of sequences must be coded as integers.

:Parameters:
   x : 1d integer array_like object (N)
      first sequence
   y : 1d integer array_like object (M)
      second sequence
:Returns:
   length : integer
      length of the LCS of x and y
   path : tuple of two 1d numpy array (path_x, path_y)
      path of the LCS
"""

cdef np.ndarray[np.int_t, ndim=1] x_arr
cdef np.ndarray[np.int_t, ndim=1] y_arr
cdef np.ndarray[np.int_t, ndim=1] px_arr
cdef np.ndarray[np.int_t, ndim=1] py_arr
cdef char **b
cdef int i
cdef Path p
cdef int length

x_arr = np.ascontiguousarray(x, dtype=np.int)
y_arr = np.ascontiguousarray(y, dtype=np.int)

b = <char **> malloc ((x_arr.shape[0]+1) * sizeof(char *))
for i in range(x_arr.shape[0]+1):
    b[i] = <char *> malloc ((y_arr.shape[0]+1) * sizeof(char))    

length = std(<long *> x_arr.data, <long *> y_arr.data, b,
              <int> x_arr.shape[0], <int> y_arr.shape[0])

trace(b, <int> x_arr.shape[0], <int> y_arr.shape[0], &p)

for i in range(x_arr.shape[0]+1):
    free (b[i])
free(b)

px_arr = np.empty(p.k, dtype=np.int)
py_arr = np.empty(p.k, dtype=np.int)

for i in range(p.k):
     px_arr[i] = p.px[i]
     py_arr[i] = p.py[i]

free (p.px)
free (p.py)

return length, (px_arr, py_arr)

score 0 · Accepted Answer

看看顺序模式挖掘

您似乎重新发明了序列中的频繁项集，但我认为有十几种算法。

韩，J。程，H。辛，D。闫 X. (2007)。《频繁模式挖掘：现状与未来方向》。数据挖掘和知识发现 15 (1): 55–86。

python - 包含python中所有项目的重复子序列

1 回答 1

Related

Reference