haskell - 在haskell中解析字符串

Question

我有一些字符串要解析为“块”列表。我的字符串看起来像这样

"some text [[anchor]] some more text, [[another anchor]]. An isolated ["

我希望得到这样的东西

[
   TextChunk "some text ",
   Anchor "anchor",
   TextChunk " some more text, "
   Anchor "another anchor",
   TextChunk ". An isolated ["
]

我已经设法编写了一个函数和类型来满足我的需要，但它们看起来过于丑陋。有没有更好的方法来做到这一点？

data Token = TextChunk String | Anchor String deriving (Show)
data TokenizerMode = EatString | EatAnchor deriving (Show)

tokenize::[String] -> [Token]
tokenize xs =  
  let (_,_,tokens) = tokenize' (EatString, unlines xs, [TextChunk ""])
  in reverse tokens

tokenize' :: (TokenizerMode, String, [Token]) -> (TokenizerMode, String,[Token])
-- If we're starting an anchor, add a new anchor and switch modes
tokenize' (EatString, '[':'[':xs, tokens) = tokenize' (EatIdentifier, xs, (Identifier ""):tokens )
-- If we're ending an anchor ass a new text chunk and switch modes
tokenize' (EatAnchor, ']':']':xs, tokens) = tokenize' (EatString, xs, (TextChunk ""):tokens )
-- Otherwise if we've got stuff to consume append it
tokenize' (EatString, x:xs, (TextChunk t):tokens) = tokenize'( EatString, xs, (TextChunk (t++[x])):tokens)
tokenize' (EatAnchor, x:xs, (Identifier t):tokens) = tokenize'( EatAnchor, xs, (Identifier (t++[x])):tokens)
--If we've got nothing more to consume we're done.
tokenize' (EatString, [], tokens) = ( EatString, [], tokens)
--We'll only get here if we're given an invalid string
tokenize' xx = error ("Error parsing .. so far " ++ (show xx))

score 11 · Accepted Answer

这应该有效，包括单独的括号：

import Control.Applicative ((<$>), (<*), (*>))
import Text.Parsec

data Text = TextChunk String
          | Anchor String
          deriving Show

chunkChar = noneOf "[" <|> try (char '[' <* notFollowedBy (char '[')) 
chunk     = TextChunk <$> many1 chunkChar
anchor    = Anchor <$> (string "[[" *> many (noneOf "]") <* string "]]")
content   = many (chunk <|> anchor)

parseS :: String -> Either ParseError [Text]
parseS input = parse content "" input

注意当解析器匹配两个左括号try时允许回溯的使用。chunkChar没有try，第一个括号将在那时被消耗掉。

score 4 · Accepted Answer

这是一个使用两个相互递归函数的简单版本。

module Tokens where

data Token = TextChunk String | Anchor String deriving (Show)

tokenize :: String -> [Token]
tokenize = textChunk emptyAcc


textChunk :: Acc -> String -> [Token]
textChunk acc []           = [TextChunk $ getAcc acc]
textChunk acc ('[':'[':ss) = TextChunk (getAcc acc) : anchor emptyAcc ss 
textChunk acc (s:ss)       = textChunk (snocAcc acc s) ss

anchor :: Acc -> String -> [Token]
anchor acc []              = error $ "Anchor not terminated" 
anchor acc (']':']':ss)    = Anchor (getAcc acc) : textChunk emptyAcc ss
anchor acc (s:ss)          = anchor (snocAcc acc s) ss


-- This is a Hughes list (also called DList) which allows 
-- efficient 'Snoc' (adding to the right end).
--
type Acc = String -> String

emptyAcc :: Acc
emptyAcc = id

snocAcc :: Acc -> Char -> Acc
snocAcc acc c = acc . (c:)

getAcc :: Acc -> String
getAcc acc = acc []

这个版本有一个问题，如果输入以 Anchor 开始或结束，或者文本中有两个连续的锚点，它将生成空的 TextChunks。

如果累加器为空，则添加检查以不生成 TextChunk 是直截了当的，但它使代码的长度增加了大约两倍 - 也许我毕竟会达到 Parsec ......

score 1 · Accepted Answer

使用 monadic Parsec 的解决方案。

import Text.ParserCombinators.Parsec

data Text = TextChunk String
          | Anchor String
          deriving Show

inputString = "some text [[anchor]] some more text, [[another anchor]]."  

content :: GenParser Char st [Text]
content = do
    s1 <- many (noneOf "[")
    string "[["
    s2 <- many (noneOf "]")
    string "]]"
    s3 <- many (noneOf "[")
    string "[["
    s4 <- many (noneOf "]")
    string "]]."
    return $ [TextChunk s1, Anchor s2, TextChunk s3, Anchor s4]


parseS :: String -> Either ParseError [Text]
parseS input = parse content "" input

这个怎么运作：

> parseS inputString 
Right [TextChunk "some text ",Anchor "anchor",TextChunk " some more text, ",Anchor "another anchor"]
it :: Either ParseError [Text]

haskell - 在haskell中解析字符串

3 回答 3

Related

Reference