4

I'm trying to parse a very large file using FParsec. The file's size is 61GB, which is too big to hold in RAM, so I'd like to generate a sequence of results (i.e. seq<'Result>), rather than a list, if possible. Can this be done with FParsec? (I've come up with a jerry-rigged implementation that actually does this, but it doesn't work well in practice due to the O(n) performance of CharStream.Seek.)

The file is line-oriented (one record per line), which should make it possible in theory to parse in batches of, say, 1000 records at a time. The FParsec "Tips and tricks" section says:

If you’re dealing with large input files or very slow parsers, it might also be worth trying to parse multiple sections within a single file in parallel. For this to be efficient there must be a fast way to find the start and end points of such sections. For example, if you are parsing a large serialized data structure, the format might allow you to easily skip over segments within the file, so that you can chop up the input into multiple independent parts that can be parsed in parallel. Another example could be a programming languages whose grammar makes it easy to skip over a complete class or function definition, e.g. by finding the closing brace or by interpreting the indentation. In this case it might be worth not to parse the definitions directly when they are encountered, but instead to skip over them, push their text content into a queue and then to process that queue in parallel.

This sounds perfect for me: I'd like to pre-parse each batch of records into a queue, and then finish parsing them in parallel later. However, I don't know how to accomplish this with the FParsec API. How can I create such a queue without using up all my RAM?

FWIW, the file I'm trying to parse is here if anyone wants to give it a try with me. :)

4

1 回答 1

5

想到的“显而易见”的事情是使用File.ReadLines之类的东西对文件进行预处理,然后一次解析一行。

如果这不起作用(您的 PDF 看起来,就像一条记录有几行长),那么您可以使用普通的 FileStream 读取来制作一系列记录或 1000 条记录或类似的东西。这不需要知道记录的详细信息,但是如果您至少可以分隔记录,那将很方便。

无论哪种方式,您最终都会得到一个解析器可以读取的惰性序列。

于 2015-05-12T06:52:50.687 回答