我是 Haskell 的新手,我一直遇到效率问题。
任务是:从列具有恒定大小的 4GB 文本文件构建 CSV 文件
列大小是已知的,例如 [col1: 4 chars wide, col2: 2 chars wide, etc...
文件只能包含 [A-Z0-9] ASCII 字符,因此转义单元格没有意义
I have:
$ cat example.txt
AAAABBCCCC...
AAA1B1CCC1...
... (72 chars per line, usually 50 mln lines)
I need:
$ cat done.csv
AAAA,BB,CCCC, ...
AAA1,B1,CCC1, ...
...
这是我在 Haskell 中最快的代码,处理整个 4GB 文件大约需要 2 分钟。
我最多需要 30 秒
import qualified Data.ByteString.Lazy as BL
import qualified Data.ByteString as B
import qualified Data.ByteString.Unsafe as U
import Data.ByteString.Lazy.Builder
import Data.Monoid
import Data.List
col_sizes = intercalate [1] $ map (`replicate` 0) cs
where
cs = [4, 4, 4, 3, 5, 1, 1, 3, 3, 3, 3, 3, 3, 10, 3, 1, 1, 1, 2, 3, 10]
sp = char8 ',' -- column separator
nl = char8 '\n'
separator !cs !cl !xs !xl !ci !xi
| c == 1 = ps
| xi == xl = mempty -- at the end of bytestring, end recursion
| cl == ci = pr
| otherwise = pc
where
c = U.unsafeIndex cs ci -- get column separation indicator
w = word8 . U.unsafeIndex xs -- get char from BS at position
p = separator cs cl xs xl -- partial recursion call
pr = nl <> p 0 (xi + 1) -- end of row, put '\n', reset counter, recur
ps = sp <> p (ci + 1) xi -- end of column, put column separator, recur
pc = w xi <> p (ci + 1) (xi + 1) -- in the middle of column, copy byte, recur
main = do
contents <- B.getContents
BL.putStr . toLazyByteString $ init_sep sp_after_char contents
init_sep cs xs = separator cs (l cs) xs (l xs) 0 0
where l = fromIntegral . B.length
sp_after_char = B.pack col_sizes
这是我在 C http://pastebin.com/Kjz3Mugs中的实现
(在这里粘贴很长时间......)
处理同一个文件大约需要 5 秒
所以我的 Haskell 代码大约是。慢 20 倍。
因为 Haskell ByteString 过滤器和映射比我在 C 中的实现要快,
(两者都需要不到 2 秒的时间来处理同一个文件,做一些简单的修改)
我希望我的 Haskell 代码有问题,我不会被迫使用 C。
更新:测试数据生成器可在此处获得http://pastebin.com/aJ3RW3jG
在生产中,数据从一个二进制文件传输到另一个二进制文件,因此没有硬盘驱动器 IO
为了测试我使用 SSD 驱动器的解决方案,但我认为 Ext4 还是将该文件缓存在 RAM 中
time cat test.txt > /dev/null
cat test.txt > /dev/null 0,00s user 0,35s system 99% cpu 0,353 total
香草发电机:
time ./data_builder | head -50000000 > /dev/null
./data_builder 0,02s user 1,09s system 30% cpu 3,709 total
head -50000000 > /dev/null 2,95s user 0,76s system 99% cpu 3,708 total
我的 C 解决方案:
time ./tocsvc < test.txt > /dev/null
./tocsvc < test.txt > /dev/null 5,35s user 0,35s system 100% cpu 5,689 total
带发电机
time ./data_builder | head -50000000 | ./tocsvc > /dev/null
./data_builder 0,02s user 1,18s system 18% cpu 6,460 total
head -50000000 3,15s user 1,19s system 67% cpu 6,459 total
./tocsvc > /dev/null 5,81s user 0,55s system 98% cpu 6,459 total
@GabrielGonzalez Haskell 解决方案
time ./tocsvh1 < test.txt > /dev/null
./tocsv < test.txt > /dev/null 19,56s user 0,41s system 100% cpu 19,950 total
带发电机
time ./data_builder | head -50000000 | ./tocsvh1 > /dev/null
./data_builder 0,11s user 3,04s system 7% cpu 41,320 total
head -50000000 7,29s user 3,56s system 26% cpu 41,319 total
./tocsvh2 > /dev/null 33,01s user 2,42s system 85% cpu 41,327 total
我的 Haskell 解决方案
time ./tocsvh2 < test.txt > /dev/null
./tocsvh2 < test.txt > /dev/null 128,63s user 2,95s system 100% cpu 2:11,45 total
带发电机
time ./data_builder | head -50000000 | ./tocsvh2 > /dev/null
./data_builder 0,02s user 1,26s system 28% cpu 4,526 total
head -50000000 3,17s user 1,33s system 99% cpu 4,524 total
./tocsvh2 > /dev/null 129,95s user 3,33s system 98% cpu 2:14,75 total
@LukeTaylor 解决方案
time ./tocsvh3 < test.txt > /dev/null
./tocsv < test.txt > /dev/null 324,38s user 4,13s system 100% cpu 5:28,18 total
带发电机
time ./data_builder | head -50000000 | ./tocsvh3 > /dev/null
./data_builder 0,43s user 4,46s system 1% cpu 5:30,34 total
head -50000000 5,20s user 2,82s system 2% cpu 5:30,34 total
./tocsv > /dev/null 329,08s user 4,21s system 100% cpu 5:32,96 total