17

我一直在探索Stack Overflow 数据转储,到目前为止,我一直在利用友好的 XML 和正则表达式“解析”。我尝试使用各种 Haskell XML 库来查找特定用户按文档顺序发布的第一篇文章都遇到了令人讨厌的颠簸。

标签汤

import Control.Monad
import Text.HTML.TagSoup

userid = "83805"

main = do
  posts <- liftM parseTags (readFile "posts.xml")
  print $ head $ map (fromAttrib "Id") $
                 filter (~== ("<row OwnerUserId=" ++ userid ++ ">"))
                 posts

hxt

import Text.XML.HXT.Arrow
import Text.XML.HXT.XPath

userid = "83805"

main = do
  runX $ readDoc "posts.xml" >>> posts >>> arr head
  where
    readDoc = readDocument [ (a_tagsoup, v_1)
                           , (a_parse_xml, v_1)
                           , (a_remove_whitespace, v_1)
                           , (a_issue_warnings, v_0)
                           , (a_trace, v_1)
                           ]

posts :: ArrowXml a => a XmlTree String
posts = getXPathTrees byUserId >>>
        getAttrValue "Id"
  where byUserId = "/posts/row/@OwnerUserId='" ++ userid ++ "'"

xml

import Control.Monad
import Control.Monad.Error
import Control.Monad.Trans.Maybe
import Data.Either
import Data.Maybe
import Text.XML.Light

userid = "83805"

main = do
  [posts,votes] <- forM ["posts", "votes"] $
    liftM parseXML . readFile . (++ ".xml")
  let ps = elemNamed "posts" posts
  putStrLn $ maybe "<not present>" show
           $ filterElement (byUser userid) ps

elemNamed :: String -> [Content] -> Element
elemNamed name = head . filter ((==name).qName.elName) . onlyElems

byUser :: String -> Element -> Bool
byUser id e = maybe False (==id) (findAttr creator e)
  where creator = QName "OwnerUserId" Nothing Nothing

我哪里做错了?使用 Haskell 处理大量 XML 文档的正确方法是什么?

4

6 回答 6

17

I notice you're doing String IO in all these cases. You absolutely must use either Data.Text or Data.Bytestring(.Lazy) if you hope to process large volumes of text efficiently, as String == [Char], which is an inappropriate representation for very large flat files.

That then implies you'll need to use a Haskell XML library that supports bytestrings. The couple-of-dozen xml libraries are here: http://hackage.haskell.org/packages/archive/pkg-list.html#cat:xml

I'm not sure which support bytestrings, but that's the condition you're looking for.

于 2010-02-18T23:36:39.857 回答
9

下面是一个使用hexpat的例子:

{-# LANGUAGE PatternGuards #-}

module Main where

import Text.XML.Expat.SAX

import qualified Data.ByteString.Lazy as B

userid = "83805"

main :: IO ()
main = B.readFile "posts.xml" >>= print . earliest
  where earliest :: B.ByteString -> SAXEvent String String
        earliest = head . filter (ownedBy userid) . parse opts
        opts = ParserOptions Nothing Nothing

ownedBy :: String -> SAXEvent String String -> Bool
ownedBy uid (StartElement "row" as)
  | Just ouid <- lookup "OwnerUserId" as = ouid == uid
  | otherwise = False
ownedBy _ _ = False

的定义ownedBy有点笨拙。也许是一个视图模式:

{-# LANGUAGE ViewPatterns #-}

module Main where

import Text.XML.Expat.SAX

import qualified Data.ByteString.Lazy as B

userid = "83805"

main :: IO ()
main = B.readFile "posts.xml" >>= print . earliest
  where earliest :: B.ByteString -> SAXEvent String String
        earliest = head . filter (ownedBy userid) . parse opts
        opts = ParserOptions Nothing Nothing

ownedBy :: String -> SAXEvent String String -> Bool
ownedBy uid (ownerUserId -> Just ouid) = uid == ouid
ownedBy _ _ = False

ownerUserId :: SAXEvent String String -> Maybe String
ownerUserId (StartElement "row" as) = lookup "OwnerUserId" as
ownerUserId _ = Nothing
于 2010-02-22T20:02:33.627 回答
8

你可以试试我的fast-tagsoup库。它是 tagoup 的简单替代品,并以20-200MB /秒的速度解析。

tagsoup 包的问题在于,即使您使用 Text 或 ByteString 接口,它也可以在内部使用 String 。fast-tagsoup 使用高性能低级解析与严格的 ByteStrings 一起工作,同时仍然返回惰性标签列表作为输出。

于 2012-10-18T12:05:16.893 回答
3

I had a similar problem (using HXT) - I avoided the memory issue by using the Expat parser with HXT. On a 5MB XML file, just reading the document and printing it: peak memory consumption went from 2Gigs to about 180MB, and the execution time was much shorter (didn't measure).

于 2012-10-18T05:05:13.437 回答
3

TagSoup 通过其类 Text.StringLike 支持 ByteString。您的示例所需的唯一更改是调用 ByteString.Lazy's readFile,并将 a 添加fromStringfromAttrib

import Text.StringLike
import qualified Data.ByteString.Lazy as BSL
import qualified Data.ByteString.Char8 as BSC

userid = "83805"
file = "blah//posts.xml"
main = do
posts <- liftM parseTags (BSL.readFile file)
print $ head $ map (fromAttrib (fromString "Id")) $
               filter (~== ("<row OwnerUserId=" ++ userid ++ ">"))
               posts  

您的示例为我运行(4 gig RAM),需要 6 分钟;ByteString 版本花了 10 分钟。

于 2010-02-28T17:17:39.963 回答
1

也许您需要一个惰性 XML 解析器:您的使用看起来像是对输入的非常简单的扫描。HaXml 有一个惰性解析器,尽管您必须通过导入正确的模块来明确要求它。

于 2010-02-19T11:55:22.790 回答