递归是你的朋友,但我会做一些不同的事情。首先,当我分裂时,我会让我的条件为真,而不是让它为假。其次,我会使用Data.List
被调用的一个方便的函数break
> :t break
break :: (a -> Bool) -> [a] -> ([a], [a])
> break (== ' ') "This is a test"
("This", " is a test")
我会用它来定义我的函数
splitWith' :: (a -> Bool) -> [a] -> [[a]]
splitWith' cond [] = []
splitWith' cond xs = first : splitWith' cond (safeTail rest)
where
(first, rest) = break cond xs
-- Need this function to handle an empty list
safeTail [] = []
safeTail (_:ys) = ys
或者,如果你想写得尽可能混乱
splitWith'' :: (a -> Bool) -> [a] -> [[a]]
splitWith'' _ [] = []
splitWith'' cond xs = uncurry (:) $ fmap (splitWith'' cond . safeTail) $ break cond xs
where
safeTail [] = []
safeTail (_:ys) = ys
这是有效的,因为fmap
超过 2 元组将该函数应用于元组的第二个元素。然后它 uncurries:
并将其应用于第一个和休息。
更新
如果您希望它在谓词为假时拆分,您可以使用span
而不是break
,或者将其定义为
splitWithWeird cond xs = splitWith' (not . cond) xs
虽然第二个显然会产生稍小的开销(除非编译器可以优化它)
更新 2
如果你想处理重复的字符,如果它适合你的需要,有一个简单、快速的修复:
> filter (not . null) $ splitWithWeird (/= ' ') "This is a test"
["This","is","a","test"]
通过如此简单的修复,我们可能会想将其构建到算法本身中:
splitWithWeird :: (a -> Bool) -> [a] -> [[a]]
splitWithWeird cond [] = []
splitWithWeird cond xs = filter (not . null) $ first : splitWithWeird cond (safeTail rest)
where
(first, rest) = span cond xs
safeTail [] = []
safeTail (_:ys) = ys
但这将是一个坏主意。由于这是一个递归函数,因此您将filter (not . null)
在每个级别添加对的调用,因此在函数中的每个拆分位置。所有这些都必须在返回之前扫描整个列表,因此必须执行额外的检查。最好将它定义为一个单独的函数,以便filter (not . null)
只调用一次:
splitWithWeird' :: (a -> Bool) -> [a] -> [[a]]
splitWithWeird' cond xs = filter (not . null) $ splitWithWeird cond xs
或者,如果您希望将其内置到算法中:
splitWithWeird :: (a -> Bool) -> [a] -> [[a]]
splitWithWeird cond xs = filter (not . null) $ splitWithHelper cond xs
where
safeTail [] = []
safeTail (_:ys) = ys
splitWithHelper cond [] = []
splitWithHelper cond xs =
let (first, rest) = span cond xs
in first : splitWithHelper cond (safeTail rest)
这实际上只是在内部做与定义两个函数相同的事情。请注意,我必须在let ... in ...
此处使用附加语句(我不喜欢嵌套 wheres),因为(first, rest) = span cond xs
属于splitWithHelper
,而不是splitWithWeird
。如果将其留在 where 子句中,算法将不起作用。
更新 3
不想在这里只留下一个不理想的解决方案,我已经继续编写了一个算法,用于在子序列上而不是在条件或元素上进行拆分。它确实使用了first
from 函数Control.Arrow
,但只是为了使代码更加紧凑。
import Control.Arrow (first)
isPrefixOf :: Eq a => [a] -> [a] -> Bool
isPrefixOf [] _ = True
isPrefixOf _ [] = False
isPrefixOf (x:xs) (y:ys) = x == y && isPrefixOf xs ys
splitSubseq :: Eq a => [a] -> [a] -> [[a]]
splitSubseq sub [] = []
splitSubseq sub xs = initial : splitSubseq sub rest
where
lsub = length sub
splitter [] = ([], [])
splitter yss@(y:ys)
| isPrefixOf sub yss = ([], drop lsub yss)
| otherwise = first (y :) $ splitter ys
(initial, rest) = splitter xs
我并不是说这是一个有效的解决方案,但它应该很容易遵循。首先,我定义了一个函数isPrefixOf
,如果第二个列表以第一个列表开头,则返回 True。
我想保持相同的递归模式(first : recursive rest
),所以我写splitter
来代替span
or break
,这就是进来的地方。如果子序列是列表的前缀,isPrefixOf
则返回([], restAfterSubsequence)
列表,然后从下一个元素开始重复此操作。我在first
这里的使用只是为了让我可以递归而简洁地编写这个函数。它只适用(y :)
于返回值的第一个元素splitter
。从返回的元组的第二个元素splitter
只是尚未消耗的输入的其余部分。
如果您有兴趣,这里是该算法的性能统计数据(使用--make -O2
, i5 quad 编译):
main = print $ sum $ take (10 ^ 7) $ map length $ splitSubseq " " $ cycle "Testing "
70000000
6,840,052,808 bytes allocated in the heap
2,032,868 bytes copied during GC
42,900 bytes maximum residency (2 sample(s))
22,636 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 13114 colls, 0 par 0.06s 0.07s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0002s 0.0004s
TASKS: 3 (1 bound, 2 peak workers (2 total), using -N1)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 3.68s ( 3.74s elapsed)
GC time 0.06s ( 0.07s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 3.74s ( 3.81s elapsed)
然后去嵌入总和和长度:
main = print $ sum $ take (10 ^ 7) $ map length $ repeat "Testing"
70000000
240,052,572 bytes allocated in the heap
12,812 bytes copied during GC
42,900 bytes maximum residency (2 sample(s))
22,636 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 458 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0001s 0.0001s
TASKS: 3 (1 bound, 2 peak workers (2 total), using -N1)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.09s ( 0.09s elapsed)
GC time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.11s ( 0.09s elapsed)
所以我们可以看到,这只花费了大约 0.1 秒的时间,留给我们大约 3.64 秒的时间让这个算法分割一个包含"Testing "
重复 1000 万次的字符串,所有这些都只使用了少量的内存。-threaded
唯一的缺点是,当使用更多内核编译和运行时,该算法实际上会显着减慢。