haskell - Haskell http-conduit web-scraping daemon crashes with out of memory error

Question

I've written a daemon in Haskell that scrapes information from a webpage every 5 minutes.

The daemon originally ran fine for about 50 minutes, but then it unexpectedly died with out of memory (requested 1048576 bytes). Every time I ran it it died after the same amount of time. Setting it to sleep only 30 seconds, it instead died after 8 minutes.

I realized the code to scrape the website was incredibly memory inefficient (going from about 30M while sleeping to 250M while parsing 9M of html), so I rewrote it so that now it only uses about 15M extra while parsing. Thinking the problem was fixed, I ran the daemon overnight and when I woke up it was actually using less memory than it was that night. I thought I was done, but roughly 20 hours after it had started, it had crashed with the same error.

I started looking into ghc profiling but I wasn't able to get that to work. Next I started messing with rts options, and I tried setting -H64m to set the default heap size to be larger than my program was using, and also using -Ksize to shrink the maximum size of the stack to see if that would make it crash sooner.

Despite every change I've made, the daemon still seems to crash after a constant number of iterations. Making the parsing more memory efficient made this value higher, but it still crashes. This doesn't make sense to me because none of these have runs have even come close to using all of my memory, much less swap space. The heap size is supposed to be unlimited by default, shrinking the stack size didn't make a difference, and all my ulimits are either unlimited or significantly higher than what the daemon is using.

In the original code I pinpointed the crash to somewhere in the html parsing, but I haven't done the same for the more memory efficient version because 20 hours takes so long to run. I don't know if this would even be useful to know because it doesn't seem like any specific part of the program is broken because it run successfully for dozens of iterations before crashing.

Out of ideas, I even looked through the ghc source code for this error, and it appears to be a failed call to mmap, which wasn't very helpful to me because I assume that isn't the root of the problem.

(Edit: code rewritten and moved to end of post)

I'm pretty new at Haskell, so I'm hoping this is some quirk of lazy evaluation or something else that has a quick fix. Otherwise, I'm fresh out of ideas.

I'm using GHC version 7.4.2 on FreeBsd 9.1

Edit:

Replacing the downloading with static html got rid of the problem, so I've narrowed it down to how I'm using http-conduit. I've edited the code above to include my networking code. The hackage docs mention to share a manager so I've done that. And it also says that for http you have to explicitly close connections, but I don't think I need to do that for httpLbs.

Here's my code.

import Control.Monad.IO.Class (liftIO)
import qualified Data.Text as T
import qualified Data.ByteString.Lazy as BL
import Text.Regex.PCRE
import Network.HTTP.Conduit

main :: IO ()
main = do
    manager <- newManager def
    daemonLoop manager

daemonLoop :: Manager -> IO ()
daemonLoop manager = do
    rows <- scrapeWebpage manager
    putStrLn $ "number of rows parsed: " ++ (show $ length rows)
    doSleep
    daemonLoop manager

scrapeWebpage :: Manager -> IO [[BL.ByteString]]
scrapeWebpage manager = do
    putStrLn "before makeRequest"
    html <- makeRequest manager
    -- Force evaluation of html.
    putStrLn $ "html length: " ++ (show $ BL.length html)
    putStrLn "after makeRequest"
    -- Breaks ~10M html table into 2d list of bytestrings.
    -- Max memory usage is about 45M, which is about 15M more than when sleeping.
    return $ map tail $ html =~ pattern
    where
        pattern :: BL.ByteString
        pattern = BL.concat $ replicate 12 "<td[^>]*>([^<]+)</td>\\s*"

makeRequest :: Manager -> IO BL.ByteString
makeRequest manager = runResourceT $ do
    defReq <- parseUrl url
    let request = urlEncodedBody params $ defReq
                    -- Don't throw errors for bad statuses.
                    { checkStatus = \_ _ -> Nothing
                    -- 1 minute.
                    , responseTimeout = Just 60000000
                    }
    response <- httpLbs request manager
    return $ responseBody response

and it's output:

before makeRequest
html length: 1555212
after makeRequest
number of rows parsed: 3608
...
before makeRequest
html length: 1555212
after makeRequest
bannerstalkerd: out of memory (requested 2097152 bytes)

Getting rid of the regex computations fixed the problem, but it seems that the error happens after the networking and during the regex, presumably because of something I'm doing wrong with http-conduit. Any ideas?

Also, when I try to compile with profiling enabled I get this error:

Could not find module `Network.HTTP.Conduit'
Perhaps you haven't installed the profiling libraries for package `http-conduit-1.8.9'?

Indeed, I have not installed profiling libraries for http-conduit and I don't know how.

score 4 · Accepted Answer

所以你发现自己有漏洞。通过欺骗编译器选项和内存设置，您只能推迟程序崩溃的那一刻，但您无法消除问题的根源，因此无论您在那里设置什么，最终您仍然会耗尽内存。

我建议您仔细浏览所有非纯代码，主要是使用资源的部分。检查所有资源是否正确释放。检查您是否有累积状态，例如不断增长的无界通道。当然，正如 nm 明智地建议的那样，对它进行剖析。

我有一个抓取器，它可以在不暂停和下载文件的情况下解析页面，并且它同时完成所有操作。我从来没有见过它使用超过 60M 的内存。我一直在用 GHC 7.4.2、GHC 7.6.1 和 GHC 7.6.2 编译它，并且两者都没有问题。

应该注意的是，您的问题的根源也可能在您正在使用的库中。在我的刮刀中，我使用http-conduit,和.http-conduit-browserHandsomeSoupHXT

score 3 · Accepted Answer

我最终解决了我自己的问题。这似乎是 FreeBSD 上的 GHC 错误。我提交了一份错误报告并切换到了 Linux，现在它在过去几天里运行完美。

haskell - Haskell http-conduit web-scraping daemon crashes with out of memory error

2 回答 2

Related

Reference