I've written a daemon in Haskell that scrapes information from a webpage every 5 minutes.
The daemon originally ran fine for about 50 minutes, but then it unexpectedly died with out of memory (requested 1048576 bytes)
. Every time I ran it it died after the same amount of time. Setting it to sleep only 30 seconds, it instead died after 8 minutes.
I realized the code to scrape the website was incredibly memory inefficient (going from about 30M while sleeping to 250M while parsing 9M of html), so I rewrote it so that now it only uses about 15M extra while parsing. Thinking the problem was fixed, I ran the daemon overnight and when I woke up it was actually using less memory than it was that night. I thought I was done, but roughly 20 hours after it had started, it had crashed with the same error.
I started looking into ghc profiling but I wasn't able to get that to work. Next I started messing with rts options, and I tried setting -H64m
to set the default heap size to be larger than my program was using, and also using -Ksize
to shrink the maximum size of the stack to see if that would make it crash sooner.
Despite every change I've made, the daemon still seems to crash after a constant number of iterations. Making the parsing more memory efficient made this value higher, but it still crashes. This doesn't make sense to me because none of these have runs have even come close to using all of my memory, much less swap space. The heap size is supposed to be unlimited by default, shrinking the stack size didn't make a difference, and all my ulimits are either unlimited or significantly higher than what the daemon is using.
In the original code I pinpointed the crash to somewhere in the html parsing, but I haven't done the same for the more memory efficient version because 20 hours takes so long to run. I don't know if this would even be useful to know because it doesn't seem like any specific part of the program is broken because it run successfully for dozens of iterations before crashing.
Out of ideas, I even looked through the ghc source code for this error, and it appears to be a failed call to mmap, which wasn't very helpful to me because I assume that isn't the root of the problem.
(Edit: code rewritten and moved to end of post)
I'm pretty new at Haskell, so I'm hoping this is some quirk of lazy evaluation or something else that has a quick fix. Otherwise, I'm fresh out of ideas.
I'm using GHC version 7.4.2 on FreeBsd 9.1
Edit:
Replacing the downloading with static html got rid of the problem, so I've narrowed it down to how I'm using http-conduit. I've edited the code above to include my networking code. The hackage docs mention to share a manager so I've done that. And it also says that for http
you have to explicitly close connections, but I don't think I need to do that for httpLbs
.
Here's my code.
import Control.Monad.IO.Class (liftIO)
import qualified Data.Text as T
import qualified Data.ByteString.Lazy as BL
import Text.Regex.PCRE
import Network.HTTP.Conduit
main :: IO ()
main = do
manager <- newManager def
daemonLoop manager
daemonLoop :: Manager -> IO ()
daemonLoop manager = do
rows <- scrapeWebpage manager
putStrLn $ "number of rows parsed: " ++ (show $ length rows)
doSleep
daemonLoop manager
scrapeWebpage :: Manager -> IO [[BL.ByteString]]
scrapeWebpage manager = do
putStrLn "before makeRequest"
html <- makeRequest manager
-- Force evaluation of html.
putStrLn $ "html length: " ++ (show $ BL.length html)
putStrLn "after makeRequest"
-- Breaks ~10M html table into 2d list of bytestrings.
-- Max memory usage is about 45M, which is about 15M more than when sleeping.
return $ map tail $ html =~ pattern
where
pattern :: BL.ByteString
pattern = BL.concat $ replicate 12 "<td[^>]*>([^<]+)</td>\\s*"
makeRequest :: Manager -> IO BL.ByteString
makeRequest manager = runResourceT $ do
defReq <- parseUrl url
let request = urlEncodedBody params $ defReq
-- Don't throw errors for bad statuses.
{ checkStatus = \_ _ -> Nothing
-- 1 minute.
, responseTimeout = Just 60000000
}
response <- httpLbs request manager
return $ responseBody response
and it's output:
before makeRequest
html length: 1555212
after makeRequest
number of rows parsed: 3608
...
before makeRequest
html length: 1555212
after makeRequest
bannerstalkerd: out of memory (requested 2097152 bytes)
Getting rid of the regex computations fixed the problem, but it seems that the error happens after the networking and during the regex, presumably because of something I'm doing wrong with http-conduit. Any ideas?
Also, when I try to compile with profiling enabled I get this error:
Could not find module `Network.HTTP.Conduit'
Perhaps you haven't installed the profiling libraries for package `http-conduit-1.8.9'?
Indeed, I have not installed profiling libraries for http-conduit
and I don't know how.