2

我想读取包含数百万条记录的大二进制文件,并且我想获取一些记录报告。我BinaryReader用来读取(我认为在读取器中具有最佳性能)并将读取的字节转换为数据模型。由于记录的数量,将模型传递给报表层是另一个问题:我更喜欢IEnumerable在开发报表时使用 LINQ 功能和特性。

这是示例数据类:

Public Class MyData
    Public A1 As UInt64
    Public A2 As UInt64
    Public A3 As Byte
    Public A4 As UInt16
    Public A5 As UInt64
End Class

我用这个子来创建文件:

Sub CreateSampleFile()
    Using streamWriter As New FileStream(fileName, FileMode.Create, FileAccess.Write, FileShare.Write)
        For i As Integer = 1 To 1000
            For j As Integer = 1 To 1000
                For k = 1 To 30
                    Dim item As New MyData With {.A1 = i, .A2 = j, .A3 = k, .A4 = j, .A5 = i * j}
                    Dim bytes() As Byte = BitConverter.GetBytes(item.A1).Concat(BitConverter.GetBytes(item.A2)).Concat({item.A3}).Concat(BitConverter.GetBytes(item.A4)).Concat(BitConverter.GetBytes(item.A5)).ToArray
                    streamWriter.Write(bytes, 0, bytes.Length)
                Next
            Next
        Next
    End Using
End Sub

这是我的读者课:

Imports System.IO

Public Class FileReader

    Public Const BUFFER_LENGTH As Long = 4096 * 256 * 27
    Public Const MY_DATA_LENGTH As Long = 27
    Private _buffer(BUFFER_LENGTH - 1) As Byte
    Private _streamWriter As FileStream
    Public Event OnByteRead(sender As FileReader, bytes() As Byte, index As Long)

    Public Sub StartReadBinary(fileName As String)
        Dim currentBufferReadCount As Long = 0
        Using fileStream As New FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read)
            Using streamReader As New BinaryReader(fileStream)
                currentBufferReadCount = streamReader.Read(Me._buffer, 0, Me._buffer.Length)
                While currentBufferReadCount > 0
                    For i As Integer = 0 To currentBufferReadCount - 1 Step MY_DATA_LENGTH
                        RaiseEvent OnByteRead(Me, Me._buffer, i)
                    Next
                    currentBufferReadCount = streamReader.Read(Me._buffer, 0, Me._buffer.Length)
                End While
            End Using
        End Using
    End Sub

    Public Iterator Function GetAll(fileName As String) As IEnumerable(Of MyData)
        Dim currentBufferReadCount As Long = 0
        Using fileStream As New FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read)
            Using streamReader As New BinaryReader(fileStream)
                currentBufferReadCount = streamReader.Read(Me._buffer, 0, Me._buffer.Length)
                While currentBufferReadCount > 0
                    For i As Integer = 0 To currentBufferReadCount - 1 Step MY_DATA_LENGTH
                        Yield GetInstance(_buffer, i)
                    Next
                    currentBufferReadCount = streamReader.Read(Me._buffer, 0, Me._buffer.Length)
                End While
            End Using
        End Using
    End Function

    Public Function GetInstance(bytes() As Byte, index As Long) As MyData
        Return New MyData With {.A1 = BitConverter.ToUInt64(bytes, index), .A2 = BitConverter.ToUInt64(bytes, index + 8), .A3 = bytes(index + 16), .A4 = BitConverter.ToUInt16(bytes, index + 17), .A5 = BitConverter.ToUInt64(bytes, index + 19)}
    End Function

End Class

我在考虑IEnumerable性能,所以我尝试对从文件中读取的每条记录同时使用GetAll方法IEnumerable和引发事件。这是测试模块:

Imports System.IO

Module Module1

    Private fileName As String = "MyData.dat"
    Private readerJustTraverse As New FileReader
    Private WithEvents readerWithoutInstance As New FileReader
    Private WithEvents readerWithInstance As New FileReader
    Private readerIEnumerable As New FileReader

    Sub Main()

        Dim s As New Stopwatch

        s.Start()
        readerJustTraverse.StartReadBinary(fileName)
        s.Stop()
        Console.WriteLine("Read bytes: {0}", s.ElapsedMilliseconds)

        s.Restart()
        readerWithoutInstance.StartReadBinary(fileName)
        s.Stop()
        Console.WriteLine("Read bytes, raise event: {0}", s.ElapsedMilliseconds)

        s.Restart()
        readerWithInstance.StartReadBinary(fileName)
        s.Stop()
        Console.WriteLine("Read bytes, raise event, get instance: {0}", s.ElapsedMilliseconds)

        s.Restart()
        For Each item In readerIenumerable.GetAll(fileName)

        Next
        Console.WriteLine("Read bytes, get instance, return yield: {0}", s.ElapsedMilliseconds)
        s.Stop()

        Console.ReadLine()

    End Sub

    Private Sub readerWithInstance_OnByteRead(sender As FileReader, bytes() As Byte, index As Long) Handles readerWithInstance.OnByteRead
        Dim item As MyData = sender.GetInstance(bytes, index)
    End Sub

    Private Sub readerWithoutInstance_OnByteRead(sender As FileReader, bytes() As Byte, index As Long) Handles readerWithoutInstance.OnByteRead
        'do nothing
    End Sub

End Module

我想知道的是每个进程的经过时间,这是测试结果(在华硕超极本 - Zenbook Core i7 上测试):

读取字节:384(不触及读取字节!)

读取字节,引发事件:583

读取字节,引发事件,获取实例:3923

读取字节,获取实例,返回产量:4917

它表明以字节形式读取文件非常快,而将字节转换为模型很慢。同样引发事件而不是获得 IEnumerable 结果,速度提高了 25%。

在 IEnumerable 中进行迭代真的有这种性能成本还是我错过了什么?

4

1 回答 1

2

是的,使用迭代器函数会带来性能损失。

我编译了你的代码,得到了和你一样的结果。我查看了生成的 IL 代码。从 GetAll 方法创建的状态机确实包含很多东西,但大多数指令是 nop 或简单操作。

正如您所说,使用/不使用迭代器函数的结果相差 25%。这还不算太多。当您使用 StartReadBinary 时,只需一个大循环即可(通过事件)调用 OnByteRead 方法 30 亿次。但是,当您在 foreach 循环中创建对象时,您必须为每个对象调用生成的枚举器的 GetCurrent() 方法和 MoveNext(),后者并非微不足道(GetAll 中的大部分代码是移到那里)并使用大量编译器生成的变量。

使用“Yield”通常会减慢您的程序,因为编译器必须创建复杂的 IL 代码来表示状态机。

于 2014-01-24T21:39:02.540 回答