There's a huge disconnect between the methodology of parseTags
and the methodology of conduit
and pipes
: parseTags
assumes it can access the next chunk of data purely, while pipes
/conduit
let you handle situations where that's impossible, such as streaming from a file. In order to mix parsing into pipes
/conduit
you must have a way to mix consuming a parse into steps which pull new chunks of data.
(I'll use pipes
in the sequel because I'm more familiar with them, but the idea is transferable.)
We can see this disconnect in the types, though I'll begin with a slightly restricted version.
parseTags :: Lazy.ByteString -> [Tag Lazy.ByteString]
We can think of Lazy.ByteString
as streaming apparatus all by itself, it is, after all, essentially just
type LazyByteString = [Strict.ByteString]
such that if we were generating the Lazy.ByteString
ourselves then we could rely on the laziness of lists to ensure that we don't generate more than what parseTags
needed in order to proceed (I'll assume, without looking, that parseTags
is written so that it could incrementally parse a streaming structure like that).
sillyGen :: LazyByteString
sillyGen = gen 10 where
gen 0 = []
gen n = "<tag> </tag>" : gen (n-1)
Now the problem here is that the streaming behavior of a list depends crucially upon being able to generate the tail of the list purely. In the discussion so far there hasn't been any mention of a monad at all. Unfortunately, that cannot be true with a string being streamed from a file---we need to somehow integrate an IO action between each streamed chunk where we consider whether or not we've reached EOF and close the file as necessary.
This is exactly the realm of pipes
and conduit
, so let's look at what that do to solve that issue.
-- from pipes-bytestring
fromHandle :: Handle -> Producer' Strict.ByteString IO ()
We can think of fromHandle
as being the "monadically-interwoven" equivalent to
Lazy.hGetContents :: Handle -> IO Lazy.ByteString
The types suggest a crucial difference between these two operations--hGetContents
can be executed in exactly one IO
action while when we pass a Handle
to pipes-bytestring
's fromHandle
it returns a type which is parameterized over IO
but cannot be simply freed from it. This is exactly indicative of hGetContents
using lazy IO (which can be unpredictable due to the use of unsafeInterleaveIO
) while fromHandle
uses deterministic streaming.
We can write a type similar to Producer Strict.ByteString IO ()
as
data IOStreamBS = IOSBS { stepStream :: IO (Strict.ByteString, Either IOStreamBS ()) }
In other words we can think of Producer Strict.ByteString IO ()
as not much more than an IO action which produces exactly the next chunk of the file and (possibly) a new action to get the next chunk. This is how pipes
and conduit
provide deterministic streaming.
But it also means that you cannot escape from the IO
in one fell swoop—you have to carry it around.
We might thus want to adjust parseTags
, which is capable of some generalization over its input, to just accept Producer Strict.ByteString IO ()
as a StringLike
type
parseTags :: StringLike str => str -> [Tag str]
Let's assume for argument that we've instantiated StringLike (Producer Strict.ByteString IO ())
. That would mean that applying parseTags
to our producer would provide us with a list of Tag (Producer Strict.ByteString IO ())
.
type DetStream = Producer Strict.ByteString IO ()
parseTags :: DetStream -> [Tag DetStream]
For this to happen we would have had to peek into our Producer
and cut it up into chunks without executing anything in the IO
monad. By this point it should be clear that such a function is impossible---we couldn't even get the first chunk from the file without doing something in IO
.
To remedy this situation, systems like pipes-parse
and pipes-group
have arisen which replace the function signature with something more like
parseTagsGrouped :: Producer Strict.ByteString IO ()
-> FreeT (Producer (Tag Strict.ByteString) IO) IO ()
which is scary looking but serves an identical purpose to parseTags
except that it generalizes the list to a structure which allows us to execute arbitrary IO
actions between each element. This kind of transformation, as the type shows, can be done purely and thus allows us to assemble our streaming machinery using pure combinations and only incur an IO
step when we execute it at the end (using runEffect
).
So, all said and done, it's probably not going to be possible to use pipes
or conduit
to stream to parseTags
---it simply assumes that certain transformations can be done purely, pushing all the IO
to one point in time, while pipes
/conduit
are basically mechanisms for spreading IO
throughout a computation without too much mental overhead.
If you're stuck using parseTags
, however, you can get by using lazy IO
as long as you're careful. Try a few variations with hGetContents
from Data.ByteString.Lazy
. The primary problem will be that the file may close prior to the unsafeInterleaveIO
'd operations actually getting around to reading it. You'll thus need to manage strictness very carefully.
Essentially that's the big difference between pipes
/conduit
and lazy IO. When using lazy IO, all of the "read a chunk" operations are made invisible and implicitly controlled by Haskell laziness. This is dynamic, implicit, and tough to observe or predict. In pipes
/conduit
all of this motion is made extraordinarily explicit and static, but it's up to you to manage the complexity.