我想读取包含数百万条记录的大二进制文件,并且我想获取一些记录报告。我BinaryReader
用来读取(我认为在读取器中具有最佳性能)并将读取的字节转换为数据模型。由于记录的数量,将模型传递给报表层是另一个问题:我更喜欢IEnumerable
在开发报表时使用 LINQ 功能和特性。
这是示例数据类:
Public Class MyData
Public A1 As UInt64
Public A2 As UInt64
Public A3 As Byte
Public A4 As UInt16
Public A5 As UInt64
End Class
我用这个子来创建文件:
Sub CreateSampleFile()
Using streamWriter As New FileStream(fileName, FileMode.Create, FileAccess.Write, FileShare.Write)
For i As Integer = 1 To 1000
For j As Integer = 1 To 1000
For k = 1 To 30
Dim item As New MyData With {.A1 = i, .A2 = j, .A3 = k, .A4 = j, .A5 = i * j}
Dim bytes() As Byte = BitConverter.GetBytes(item.A1).Concat(BitConverter.GetBytes(item.A2)).Concat({item.A3}).Concat(BitConverter.GetBytes(item.A4)).Concat(BitConverter.GetBytes(item.A5)).ToArray
streamWriter.Write(bytes, 0, bytes.Length)
Next
Next
Next
End Using
End Sub
这是我的读者课:
Imports System.IO
Public Class FileReader
Public Const BUFFER_LENGTH As Long = 4096 * 256 * 27
Public Const MY_DATA_LENGTH As Long = 27
Private _buffer(BUFFER_LENGTH - 1) As Byte
Private _streamWriter As FileStream
Public Event OnByteRead(sender As FileReader, bytes() As Byte, index As Long)
Public Sub StartReadBinary(fileName As String)
Dim currentBufferReadCount As Long = 0
Using fileStream As New FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read)
Using streamReader As New BinaryReader(fileStream)
currentBufferReadCount = streamReader.Read(Me._buffer, 0, Me._buffer.Length)
While currentBufferReadCount > 0
For i As Integer = 0 To currentBufferReadCount - 1 Step MY_DATA_LENGTH
RaiseEvent OnByteRead(Me, Me._buffer, i)
Next
currentBufferReadCount = streamReader.Read(Me._buffer, 0, Me._buffer.Length)
End While
End Using
End Using
End Sub
Public Iterator Function GetAll(fileName As String) As IEnumerable(Of MyData)
Dim currentBufferReadCount As Long = 0
Using fileStream As New FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read)
Using streamReader As New BinaryReader(fileStream)
currentBufferReadCount = streamReader.Read(Me._buffer, 0, Me._buffer.Length)
While currentBufferReadCount > 0
For i As Integer = 0 To currentBufferReadCount - 1 Step MY_DATA_LENGTH
Yield GetInstance(_buffer, i)
Next
currentBufferReadCount = streamReader.Read(Me._buffer, 0, Me._buffer.Length)
End While
End Using
End Using
End Function
Public Function GetInstance(bytes() As Byte, index As Long) As MyData
Return New MyData With {.A1 = BitConverter.ToUInt64(bytes, index), .A2 = BitConverter.ToUInt64(bytes, index + 8), .A3 = bytes(index + 16), .A4 = BitConverter.ToUInt16(bytes, index + 17), .A5 = BitConverter.ToUInt64(bytes, index + 19)}
End Function
End Class
我在考虑IEnumerable
性能,所以我尝试对从文件中读取的每条记录同时使用GetAll
方法IEnumerable
和引发事件。这是测试模块:
Imports System.IO
Module Module1
Private fileName As String = "MyData.dat"
Private readerJustTraverse As New FileReader
Private WithEvents readerWithoutInstance As New FileReader
Private WithEvents readerWithInstance As New FileReader
Private readerIEnumerable As New FileReader
Sub Main()
Dim s As New Stopwatch
s.Start()
readerJustTraverse.StartReadBinary(fileName)
s.Stop()
Console.WriteLine("Read bytes: {0}", s.ElapsedMilliseconds)
s.Restart()
readerWithoutInstance.StartReadBinary(fileName)
s.Stop()
Console.WriteLine("Read bytes, raise event: {0}", s.ElapsedMilliseconds)
s.Restart()
readerWithInstance.StartReadBinary(fileName)
s.Stop()
Console.WriteLine("Read bytes, raise event, get instance: {0}", s.ElapsedMilliseconds)
s.Restart()
For Each item In readerIenumerable.GetAll(fileName)
Next
Console.WriteLine("Read bytes, get instance, return yield: {0}", s.ElapsedMilliseconds)
s.Stop()
Console.ReadLine()
End Sub
Private Sub readerWithInstance_OnByteRead(sender As FileReader, bytes() As Byte, index As Long) Handles readerWithInstance.OnByteRead
Dim item As MyData = sender.GetInstance(bytes, index)
End Sub
Private Sub readerWithoutInstance_OnByteRead(sender As FileReader, bytes() As Byte, index As Long) Handles readerWithoutInstance.OnByteRead
'do nothing
End Sub
End Module
我想知道的是每个进程的经过时间,这是测试结果(在华硕超极本 - Zenbook Core i7 上测试):
读取字节:384(不触及读取字节!)
读取字节,引发事件:583
读取字节,引发事件,获取实例:3923
读取字节,获取实例,返回产量:4917
它表明以字节形式读取文件非常快,而将字节转换为模型很慢。同样引发事件而不是获得 IEnumerable 结果,速度提高了 25%。
在 IEnumerable 中进行迭代真的有这种性能成本还是我错过了什么?