2

我有一个带有空格分隔数字的文件。它的大小约为 1Gb,我想从中获取数字。我决定使用内存映射文件来快速阅读,但我不明白该怎么做。我试着下一步做:

var mmf = MemoryMappedFile.CreateFromFile("test", FileMode.Open, "myFile");
var mmfa = mmf.CreateViewAccessor(0, 0, MemoryMappedFileAccess.Read);
var nums = new int[6];
var a = mmfa.ReadArray<int>(0, nums, 0, 6); 

但如果“test”在 num[0] 中只包含“01”,我得到 12337。12337 = 48*256+49。我在互联网上搜索过,但没有找到关于我的问题的任何信息。仅关于字节数组或进程间通信。你能告诉我如何在 num[0] 中得到 1 吗?

4

3 回答 3

3

以下示例将以可能的最快方式从内存映射文件中读取 ASCII 整数,而无需创建任何字符串。MiMo 提供的解决方案要慢得多。它确实以 5 MB/s 的速度运行,这对您没有多大帮助。MiMo 解决方案的最大问题是它确实为每个字符调用了一个方法(读取),这会花费高达 15 倍的性能。如果您最初的问题是您遇到性能问题,我想知道您为什么接受他的解决方案。您可以使用哑字符串阅读器获得 20 MB/s 并将字符串解析为整数。通过方法调用获取每个字节确实会破坏您可能的读取性能。

下面的代码确实将文件映射为 200 MB 块,以防止填满 32 位地址空间。然后它会使用非常快的字节指针扫描缓冲区。如果不考虑本地化,整数解析很容易。有趣的是,如果我确实创建了映射视图,那么获取指向视图缓冲区的指针的唯一方法不允许我从映射区域开始。

我认为这是 .NET 框架中的一个错误,在 .NET 4.5 中仍未修复。SafeMemoryMappedViewHandle 缓冲区按照操作系统的分配粒度进行分配。如果你前进到某个偏移量,你会得到一个返回的指针,它仍然指向缓冲区的开始。这真的很不幸,因为这在解析性能上造成了 5MB/s 和 77MB/s 之间的差异。

Did read 258.888.890 bytes with 77 MB/s


using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.IO.MemoryMappedFiles;
using System.Runtime.InteropServices;

unsafe class Program
{
    static void Main(string[] args)
    {
        new Program().Start();
    }

    private void Start()
    {
        var sw = Stopwatch.StartNew();
        string fileName = @"C:\Source\BigFile.txt";//@"C:\Source\Numbers.txt";
        var file = MemoryMappedFile.CreateFromFile(fileName);
        var fileSize = new FileInfo(fileName).Length;
        int viewSize = 200 * 100 * 1000;
        long offset = 0;
        for (; offset < fileSize-viewSize; offset +=viewSize ) // create 200 MB views
        {
            using (var accessor = file.CreateViewAccessor(offset, viewSize))
            {
                int unReadBytes = ReadData(accessor, offset);
                offset -= unReadBytes;
            }
        }

        using (var rest = file.CreateViewAccessor(offset, fileSize - offset))
        {
            ReadData(rest, offset);
        }
        sw.Stop();
        Console.WriteLine("Did read {0:N0} bytes with {1:F0} MB/s", fileSize, (fileSize / (1024 * 1024)) / sw.Elapsed.TotalSeconds);
    }


    List<int> Data = new List<int>();

    private int ReadData(MemoryMappedViewAccessor accessor, long offset)
    {
        using(var safeViewHandle = accessor.SafeMemoryMappedViewHandle)
        {
            byte* pStart = null;
            safeViewHandle.AcquirePointer(ref pStart);
            ulong correction = 0;
            // needed to correct offset because the view handle does not start at the offset specified in the CreateAccessor call
            // This makes AquirePointer nearly useless.
            // http://connect.microsoft.com/VisualStudio/feedback/details/537635/no-way-to-determine-internal-offset-used-by-memorymappedviewaccessor-makes-safememorymappedviewhandle-property-unusable
            pStart = Helper.Pointer(pStart, offset, out correction);
            var len = safeViewHandle.ByteLength - correction;
            bool digitFound = false;
            int curInt = 0;
            byte current =0;
            for (ulong i = 0; i < len; i++)
            {
                current = *(pStart + i);
                if (current == (byte)' ' && digitFound)
                {
                    Data.Add(curInt);
                  //  Console.WriteLine("Add {0}", curInt);
                    digitFound = false;
                    curInt = 0;
                }
                else
                {
                    curInt = curInt * 10 + (current - '0');
                    digitFound = true;
                }
            }

            // scan backwards to find partial read number
            int unread = 0;
            if (curInt != 0 && digitFound)
            {
                byte* pEnd = pStart + len;
                while (true)
                {
                    pEnd--;
                    if (*pEnd == (byte)' ' || pEnd == pStart)
                    {
                        break;
                    }
                    unread++;

                }
            }

            safeViewHandle.ReleasePointer();
            return unread;
        }
    }

    public unsafe static class Helper
    {
        static SYSTEM_INFO info;

        static Helper()
        {
            GetSystemInfo(ref info);
        }

        public static byte* Pointer(byte *pByte, long offset, out ulong diff)
        {
            var num = offset % info.dwAllocationGranularity;
            diff = (ulong)num; // return difference

            byte* tmp_ptr = pByte;

            tmp_ptr += num;

            return tmp_ptr;
        }

        [DllImport("kernel32.dll", SetLastError = true)]
        internal static extern void GetSystemInfo(ref SYSTEM_INFO lpSystemInfo);

        internal struct SYSTEM_INFO
        {
            internal int dwOemId;
            internal int dwPageSize;
            internal IntPtr lpMinimumApplicationAddress;
            internal IntPtr lpMaximumApplicationAddress;
            internal IntPtr dwActiveProcessorMask;
            internal int dwNumberOfProcessors;
            internal int dwProcessorType;
            internal int dwAllocationGranularity;
            internal short wProcessorLevel;
            internal short wProcessorRevision;
        }
    }

    void GenerateNumbers()
    {
        using (var file = File.CreateText(@"C:\Source\BigFile.txt"))
        {
            for (int i = 0; i < 30 * 1000 * 1000; i++)
            {
                file.Write(i.ToString() + " ");
            }
        }
    }

}
于 2012-05-05T19:56:59.957 回答
1

您需要解析文件内容,将字符转换为数字 - 如下所示:

List<int> nums = new List<int>();
long curPos = 0;
int curV = 0;
bool hasCurV = false;
while (curPos < mmfa.Capacity) {
  byte c;
  mmfa.Read(curPos++, out c);
  if (c == 0) {
    break;
  }
  if (c == 32) {
    if (hasCurV) {
      nums.Add(curV);
      curV = 0;
    }
    hasCurV = false;
  } else {
    curV = checked(curV*10 + (int)(c-48));
    hasCurV = true;
  }
}
if (hasCurV) {
  nums.Add(curV);
}

假设这mmfa.Capacity是要读取的字符总数,并且文件仅包含由空格分隔的数字(即没有结束行或其他空格)

于 2012-05-05T19:46:29.440 回答
0

48 = 0x30 = '0',49 = 0x31 = '1'

所以你得到了真正的字符,它们只是 ASCII 编码的。

字符串 "01" 需要 2 个字节,它们可以放在一个int中,因此您可以将它们都放在一个中int。如果要单独获取它们,则需要请求bytes 数组。


编辑:如果需要将“01”解析为常量1,即从ASCII表示形式转换为二进制,则需要采用其他方式。我会建议

  1. 不要使用内存映射文件,
  2. 使用 StreamReader 逐行读取文件(请参见此处的示例)
  3. 使用 string.Split 将每一行拆分为块
  4. 使用 string.Parse 将每个块解析为数字
于 2012-05-05T19:21:35.977 回答