haskell - Haskell 解析数字行文件的更有效方法

Question

所以我有一个大约 8mb 的文件，每个文件有 6 个整数，由空格分隔。

我目前的解析方法是：

tuplify6 :: [a] -> (a, a, a, a, a, a)
tuplify6 [l, m, n, o, p, q] = (l, m, n, o, p, q)

toInts :: String -> (Int, Int, Int, Int, Int, Int)
toInts line =
        tuplify6 $ map read stringNumbers
        where stringNumbers = split " " line

并映射到Ints

liftM lines . readFile

这将返回一个元组列表。但是，当我运行它时，加载文件并解析它需要将近 25 秒。有什么办法可以加快速度吗？该文件只是纯文本。

score 8 · Accepted Answer

您可以使用 s 来加速它ByteString，例如

module Main (main) where

import System.Environment (getArgs)
import qualified Data.ByteString.Lazy.Char8 as C
import Data.Char

main :: IO ()
main = do
    args <- getArgs
    mapM_ doFile args

doFile :: FilePath -> IO ()
doFile file = do
    bs <- C.readFile file
    let tups = buildTups 0 [] $ C.dropWhile (not . isDigit) bs
    print (length tups)

buildTups :: Int -> [Int] -> C.ByteString -> [(Int,Int,Int,Int,Int,Int)]
buildTups 6 acc bs = tuplify6 acc : buildTups 0 [] bs
buildTups k acc bs
    | C.null bs = if k == 0 then [] else error ("Bad file format " ++ show k)
    | otherwise = case C.readInt bs of
                    Just (i,rm) -> buildTups (k+1) (i:acc) $ C.dropWhile (not . isDigit) rm
                    Nothing -> error ("No Int found: " ++ show (C.take 100 bs))

tuplify6:: [a] -> (a, a, a, a, a, a)
tuplify6 [l, m, n, o, p, q] = (l, m, n, o, p, q)

跑得很快：

$ time ./fileParse IntList 
200000

real    0m0.119s
user    0m0.115s
sys     0m0.003s

对于 8.1 MiB 文件。

~~另一方面，使用Strings 和您的转换（使用几个seqs 来强制评估）也只花费了 0.66 秒，因此大部分时间似乎不是用于解析，而是用于处理结果。~~

糟糕，错过了 aseq所以reads 实际上没有针对String版本进行评估。解决这个问题，String+read大约需要四秒钟，比Int@Rotsor 评论中的自定义解析器略高一点

foldl' (\a c -> 10*a + fromEnum c - fromEnum '0') 0

所以解析显然确实花费了大量时间。

haskell - Haskell 解析数字行文件的更有效方法

1 回答 1

Related

Reference