parsing - 使用 attoparsec 对解析后的数据进行操作

Question

背景

我已经使用 attoparsec 编写了一个日志文件解析器。我所有的小型解析器都成功了，组合的最终解析器也是如此。我已经通过测试证实了这一点。但是我在使用解析后的流执行操作时遇到了麻烦。

我试过的

我首先尝试将成功解析的输入传递给函数。但似乎得到的只是Done ()，我假设这意味着日志文件已经被消耗掉了。

prepareStats :: Result Log -> IO ()
prepareStats r =
case r of
    Fail _ _ _ -> putStrLn $ "Parsing failed"
    Done _ parsedLog -> putStrLn "Success" -- This now has a [LogEntry] array. Do something with it.

main :: IO ()
main = do
[f] <- getArgs
logFile <- B.readFile (f :: FilePath)
let results = parseOnly parseLog logFile
putStrLn "TBC"

我正在尝试做的事情

我想在使用输入时从日志文件中积累一些统计信息。例如，我正在解析响应代码，我想计算有多少 2** 响应和多少 4/5** 响应。我正在解析作为 Ints 返回的每个响应的字节数，并且我想有效地将这些相加（听起来像foldl'？）。我已经定义了这样的数据类型：

data Stats = Stats {
    successfulRequestsPerMinute :: Int
  , failingRequestsPerMinute    :: Int
  , meanResponseTime            :: Int
  , megabytesPerMinute          :: Int
  } deriving Show

我想在解析输入时不断更新它。但是在我消费时执行操作的部分是我卡住的地方。到目前为止print，这是我成功将输出传递给的唯一函数，它通过Done在打印输出之前返回来显示解析成功。

我的主要解析器如下所示：

parseLogEntry :: Parser LogEntry
parseLogEntry = do
ip <- logItem
_ <- char ' '
logName <- logItem
_ <- char ' '
user <- logItem
_ <- char ' '
time <- datetimeLogItem
_ <- char ' '
firstLogLine <- quotedLogItem
_ <- char ' '
finalRequestStatus <- intLogItem
_ <- char ' '
responseSizeB <- intLogItem
_ <- char ' '
timeToResponse <- intLogItem
return $ LogEntry ip logName user time firstLogLine finalRequestStatus responseSizeB timeToResponse

type Log = [LogEntry]

parseLog :: Parser Log
parseLog = many $ parseLogEntry <* endOfLine

期望的结果

我想将每个解析的行传递给将更新上述数据类型的函数。理想情况下，我希望这非常节省内存，因为它将在大文件上运行。

score 3 · Accepted Answer

您必须以解析单个日志条目而不是日志条目列表为单位。

它并不漂亮，但这里有一个如何交错解析和处理的示例：

（取决于bytestring和）attoparsecmtl

{-# LANGUAGE NoMonomorphismRestriction, FlexibleContexts #-}

import qualified Data.ByteString.Char8 as BS
import qualified Data.Attoparsec.ByteString.Char8 as A
import Data.Attoparsec.ByteString.Char8 hiding (takeWhile)
import Data.Char
import Control.Monad.State.Strict

aWord :: Parser BS.ByteString
aWord = skipSpace >> A.takeWhile isAlphaNum

getNext :: MonadState [a] m => m (Maybe a)
getNext = do
  xs <- get
  case xs of
    [] -> return Nothing
    (y:ys) -> put ys >> return (Just y)

loop iresult =
  case iresult of
    Fail _ _ msg  -> error $ "parse failed: " ++ msg
    Done x' aword -> do lift $ process aword; loop (parse aWord x')
    Partial _     -> do
      mx <- getNext
      case mx of
        Just y  -> loop (feed iresult y)
        Nothing -> case feed iresult BS.empty of
                     Fail _ _ msg  -> error $ "parse failed: " ++ msg
                     Done x' aword -> do lift $ process aword; return ()
                     Partial _     -> error $ "partial returned"  -- probably can't happen

process :: Show a => a -> IO ()
process w = putStrLn $ "got a word: " ++ show w

theWords = map BS.pack [ "this is a te", "st of the emergency ", "broadcasting sys", "tem"]


main = runStateT (loop (Partial (parse aWord))) theWords

笔记：

我们一次解析 a并在每个单词被识别后aWord调用。process
用于feed在解析器返回时为解析器提供更多输入Partial。
当没有更多输入时，向解析器提供一个空字符串。
返回时Done，处理识别的单词并继续parse aWord。
getNext只是获取下一个输入单元的一元函数的示例。将其替换为您自己的版本 - 即从文件中读取下一行的内容。

更新

这是使用parseWith@dfeuer建议的解决方案：

noMoreInput = fmap null get

loop2 x = do
  iresult <- parseWith (fmap (fromMaybe BS.empty) getNext) aWord x
  case iresult of
    Fail _ _ msg  -> error $ "parse failed: " ++ msg
    Done x' aword -> do lift $ process aword;
                        if BS.null x'
                           then do b <- noMoreInput
                                   if b then return ()
                                        else loop2 x'
                           else loop2 x'
    Partial _     -> error $ "huh???" -- this really can't happen

main2 = runStateT (loop2 BS.empty) theWords

score 1 · Accepted Answer

如果每个日志条目正好是一行，这里有一个更简单的解决方案：

do loglines <- fmap BS.lines $ BS.readfile "input-file.log"
   foldl' go initialStats loglines
   where
     go stats logline = 
        case parseOnly yourParser logline of
          Left e  -> error $ "oops: " ++ e
          Right r -> let stats' = ... combine r with stats ...
                     in stats'

基本上，您只是逐行读取文件并调用parseOnly每一行并累积结果。

score 1 · Accepted Answer

这是通过流媒体库正确完成的

main = do
  f:_ <- getArgs
  withFile f ReadMode $ \h -> do
       result <- foldStream $ streamProcess $ streamHandle h
       print result
where
 streamHandle  = undefined
 streamProcess = undefined
 foldStream    = undefined

任何流媒体库都可以填充空白，例如

 import qualified Pipes.Prelude as P
 import Pipes
 import qualified Pipes.ByteString as PB
 import Pipes.Group (folds)
 import qualified Control.Foldl as L
 import Control.Lens (view) -- or import Lens.Simple (view), or whatever

 streamHandle =  Pipes.ByteStream.fromHandle :: Handle -> Producer ByteString IO ()

在这种情况下，我们可以进一步分工：

 streamProcess :: Producer ByteString m r -> Producer LogEntry m r
 streamProcess p =  streamLines p >-> lineParser

 streamLines :: Producer ByteString m r -> Producer ByteString m r
 streamLines p = L.purely fold L.list (view (Pipes.ByteString.lines p)) >-> P.map B.toStrict

 lineParser :: Pipe ByteString LogEntry m r
 lineParser = P.map (parseOnly line_parser) >-> P.concat -- concat removes lefts

（这有点费力，因为管道对于累积行和内存通常是明智的：我们只是试图获得单个严格字节串行的生产者，然后将其转换为解析行的生产者，然后扔掉坏的解析，如果有的话。使用 io-streams 或管道，事情将基本相同，并且该特定步骤会更容易。）

我们现在可以折叠我们的Producer LogEntry IO (). 这可以使用明确地完成Pipes.Prelude.fold，这会产生严格的左折叠。在这里，我们将只复制来自 user5402 的结构

 foldStream str = P.fold go initial_stats id
  where
   go stats_till_now new_entry = undefined

如果您习惯使用该foldl库并将折叠应用到 Producer 中L.purely fold some_fold，那么您可以Control.Foldl.Fold使用组件为您的 LogEntries 构建 s，并根据需要插入不同的请求。

如果您pipes-attoparsec在解析器中使用并包含换行位，那么您只需编写

 handleToLogEntries :: Handle -> Producer LogEntry IO ()
 handleToLogEntries h = void $ parsed my_line_parser (fromHandle h) >-> P.concat

并获得Producer LogEntry IO ()更直接的。（但是，这种超简单的编写方式会在解析错误时停止；首先按行划分会比使用 attoparsec 识别换行符更快。）这对于 io-streams 也非常简单，您可以编写类似

import qualified System.IO.Streams as Streams

io :: Handle -> IO ()
io h = do  
    bytes <- Streams.handleToInputStream h
    log_entries <- Streams.parserToInputStream my_line_parser bytes
    fold_result <- Stream.fold go initial_stats log_entries
    print fold_result

或保持上面的结构：

 where 
  streamHandle = Streams.handleToInputStream
  streamProcess io_bytes = 
      io_bytes >>= Streams.parserToInputStream my_line_parser
  foldStream io_logentries =
      log_entries >>= Stream.fold go initial_stats

无论哪种方式，my_line_parser都应该返回 aMaybe LogEntry并且应该识别换行符。

parsing - 使用 attoparsec 对解析后的数据进行操作

背景

我试过的

我正在尝试做的事情

期望的结果

3 回答 3

更新

Related

Reference