string - 面试街字符串相似性挑战的 Haskell 解决方案

Question

这是我为 InterviewStreet解决字符串相似性挑战的最佳尝试。

import Control.Monad
import Data.Text as T
import qualified Data.Text.IO as TIO


sumSimilarities s = (T.length s) + (sum $ Prelude.map (similarity s) (Prelude.tail $ tails s))

similarity :: Text -> Text -> Int
similarity a b = case commonPrefixes a b of
                     Just (x,_,_) -> T.length x
                     Nothing -> 0

main = do
    cases <- fmap read getLine
    inputs <- replicateM cases TIO.getLine
    forM_ inputs $ print . sumSimilarities

它只通过了 7/10 的测试用例。测试用例 7、8 和 9 失败，因为它们超过了分配的执行时间。

我一半试图验证这确实可以在 Haskell 中解决，一半在寻找优化的 Haskell 程序的样子。

谢谢！泰勒

score 5 · Accepted Answer

像user5402一样，我很好奇等效（对于某些等效值）C 程序是否会在时间限制内完成或超时。ByteString如果可以的话，看看使用s的等效程序是否能及时完成会很有趣。- 并不是说ByteString s 本身比快Text，但由于输入必须转换为Textwhile的内部表示ByteString，因此可能会有所不同。ByteStrings 可能更快的另一个可能原因（如果测试机器具有 32 位 GHC）是该文本的 fusion 至少过去需要比 32 位架构上通常可用的更多寄存器才能获得全部利润[很久以前，在 text-0.5 到 text-0.7 的日子里，在我的 32 位盒子上，字节串曾经是快了很多，不知道这是否仍然适用于较新的文本版本]。

好的，因为user5402已经验证了 naïve 算法在 C 中足够快，所以我继续使用ByteStrings编写了 naïve 算法的实现

{-# LANGUAGE BangPatterns #-}
module Main (main) where

import qualified Data.ByteString as B
import qualified Data.ByteString.Char8 as C
import qualified Data.ByteString.Unsafe as U
import Control.Monad
import Data.Word

main :: IO ()
main = do
    cl <- C.getLine
    case C.readInt cl of
      Just (cases,_) -> replicateM_ cases (C.getLine >>= print . similarity)
      Nothing -> return ()

-- Just to keep the condition readable.
(?) :: B.ByteString -> Int -> Word8
(?) = U.unsafeIndex

similarity :: B.ByteString -> Int
similarity bs
    | len == 0  = 0
    | otherwise = go len 1
      where
        !len = B.length bs
        go !acc i
            | i < len   = go (acc + prf 0 i) (i+1)
            | otherwise = acc
        prf !k j
            | j < len && bs ? k == bs ? j   = prf (k+1) (j+1)
            | otherwise = k

并将其与Text一些不良情况下的 OP 版本进行比较。在我的机器上，这比Text版本快四倍多，所以它是否足够快会很有趣（C 版本又快了 4.5 倍，所以很可能不是）。

但是，我认为由于使用具有二次最坏情况行为的朴素算法，超过时间限制的可能性更大。可能有一些测试用例会唤起朴素算法的最坏情况。

所以解决方案是使用一种可以更好地扩展、最佳线性的算法。一种计算字符串相似度的线性算法是Z 算法。

这个想法很简单（但是，像大多数好想法一样，不容易拥有）。让我们将一个（非空）子字符串称为前缀子字符串，它也是字符串的前缀。为了避免重新计算，该算法使用前缀子字符串的窗口，该窗口在当前考虑的索引之前开始，向右延伸最远（最初，该窗口是空的）。

使用的变量和算法的不变量：

i，所考虑的索引，从 1 开始（对于基于 0 的索引；不考虑整个字符串）并递增到length - 1
left和right，前缀子字符串窗口的第一个和最后一个索引；不变量：
1. left < i, left <= right < length(S),left > 0或者right < 1,
2. 如果left > 0, 那么S[left .. right]是 and 的最大公共S前缀S[left .. ],
3. 如果1 <= j < i和S[j .. k]是的前缀S，则k <= right
一个数组Z，不变量： for 1 <= k < i，包含andZ[k]的最长公共前缀的长度。S[k .. ]S

算法：

设置i = 1, left = right = 0（left <= right < 1允许任何值），并Z[j] = 0为所有索引设置1 <= j < length(S)。
如果i == length(S)，停止。
如果i > right，求和l的最长公共前缀的长度S，S[i .. ]将其存储在中Z[i]。如果l > 0我们发现一个窗口比前一个窗口延伸得更远，则设置left = iand right = i+l-1，否则保持不变。递增i并转到 2。
在这里left < i <= right，子串S[i .. right]是已知的——因为它S[left .. right]是的前缀S，所以它等于S[i-left .. right-left]。

S现在考虑从 index 开始的子字符串的最长公共前缀i - left。它的长度是Z[i-left]，因此S[k] = S[i-left + k]对于0 <= k < Z[i-left]和
S[Z[i-left]] ≠ S[i-left+Z[i-left]]。现在，如果Z[i-left] <= right-i，则i + Z[i-left]在已知窗口内，因此
```
S[i + Z[i-left]] = S[i-left + Z[i-left]] ≠ S[Z[i-left]]
S[i + k]         = S[i-left + k]         = S[k]   for 0 <= k < Z[i-left]
```
我们看到和的最长公共前缀的S长度S[i .. ]有 length Z[i-left]。然后设置Z[i] = Z[i-left]，递增i，然后转到 2。

否则，S[i .. right]是的前缀，S我们检查它的扩展范围，开始比较索引处的字符right+1和right+1 - i。设长度为l。设置Z[i] = l, left = i, right = i + l - 1, increment i，然后转到 2。

由于窗口从不向左移动，并且比较总是在窗口结束后开始，因此字符串中的每个字符最多与字符串中较早的字符比较一次成功，并且对于每个起始索引，最多有一个不成功的比较，因此该算法是线性的。

代码（ByteString出于习惯使用，应该可以轻松移植到Text）：

{-# LANGUAGE BangPatterns #-}
module Main (main) where

import qualified Data.ByteString as B
import qualified Data.ByteString.Char8 as C
import qualified Data.ByteString.Unsafe as U
import Data.Array.ST
import Data.Array.Base
import Control.Monad.ST
import Control.Monad
import Data.Word

main :: IO ()
main = do
    cl <- C.getLine
    case C.readInt cl of
      Just (cases,_) -> replicateM_ cases (C.getLine >>= print . similarity)
      Nothing -> return ()

-- Just to keep the condition readable.
(?) :: B.ByteString -> Int -> Word8
(?) = U.unsafeIndex

-- Calculate the similarity of a string using the Z-algorithm
similarity :: B.ByteString -> Int
similarity bs
    | len == 0  = 0
    | otherwise = runST getSim
      where
        !len = B.length bs
        getSim = do
            za <- newArray (0,len-1) 0 :: ST s (STUArray s Int Int)
            -- The common prefix of the string with itself is entire string.
            unsafeWrite za 0 len
            let -- Find the length of the common prefix.
                go !k j
                    | j < len && (bs ? j == bs ? k) = go (k+1) (j+1)
                    | otherwise = return k
                -- The window with indices in [left .. right] is the prefix-substring
                -- starting before i that extends farthest.
                loop !left !right i
                    | i >= len  = count 0 0 -- when done, sum
                    | i > right = do
                        -- We're outside the window, simply
                        -- find the length of the common prefix
                        -- and store it in the Z-array.
                        w <- go 0 i
                        unsafeWrite za i w
                        if w > 0
                          -- We got a non-empty common prefix and a new window.
                          then loop i (i+w-1) (i+1)
                          -- No new window, same procedure at next index.
                          else loop left right (i+1)
                    | otherwise = do
                        -- We're inside the window, so the substring starting at
                        -- (i - left) has a common prefix with the substring
                        -- starting at i of length at least (right - i + 1)
                        -- (since the [left .. right] window is a prefix of bs).
                        -- But we already know how long the common prefix
                        -- starting at (i - left) is.
                        z <- unsafeRead za (i-left)
                        let !s = right-i+1 -- length of known prefix starting at i
                        if z < s
                          -- If the common prefix of the substring starting at
                          -- (i - left) is shorter than the rest of the window,
                          -- the common prefix of the substring starting at i
                          -- is the same. Store it and move on with the same window.
                          then do
                              unsafeWrite za i z
                              loop left right (i+1)
                          else do
                              -- Otherwise, find out how far the common prefix
                              -- extends, starting at (right + 1) == s + i.
                              w <- go s (s+i)
                              unsafeWrite za i w
                              loop i (i+w-1) (i+1)
                count !acc i
                    | i == len  = return acc
                    | otherwise = do
                        n <- unsafeRead za i
                        count (acc+n) (i+1)
            loop 0 0 1

string - 面试街字符串相似性挑战的 Haskell 解决方案

1 回答 1

Related

Reference