c# - 字节数组的 Base-N 编码

Question

几天前，我遇到了这个 CodeReview for Base-36 encoding a byte array。但是，随后的答案没有涉及解码回字节数组，或者可能重用答案来执行不同基数（基数）的编码。

链接问题的答案使用 BigInteger。因此，就实现而言，可以对基数及其数字进行参数化。

不过，BigInteger 的问题在于我们将输入视为假定的整数。然而，我们的输入，一个字节数组，只是一系列不透明的值。

如果字节数组以一系列零字节结尾，例如 {0xFF,0x7F,0x00,0x00}，则在答案中使用算法时这些字节将丢失（仅编码 {0xFF,0x7F}。
如果最后一个非零字节设置了符号位，则使用前面的零字节，因为它被视为 BigInt 的符号分隔符。所以 {0xFF,0xFF,0x00,0x00} 只会编码为 {0xFF,0xFF,0x00}。

.NET 程序员如何使用 BigInteger 创建一个相当高效且与基数无关的编码器，具有解码支持，以及处理字节序的能力，以及“解决”丢失的结尾零字节的能力？

score 12 · Accepted Answer

编辑[2020/01/26]：FWIW，下面的代码及其单元测试与我在 Github 上的开源库一起存在。

编辑[2016/04/19]：如果您喜欢异常，您可能希望将一些 Decode 实现代码更改为 throwInvalidDataException而不是仅返回 null。

编辑[2014/09/14]：我在 Encode() 中添加了一个“HACK”来处理输入中最后一个字节被签名的情况（如果你要转换为 sbyte）。我现在能想到的唯一明智的解决方案就是将数组大小调整为一个。此案例的其他单元测试通过了，但我没有重新运行 perf 代码来解决这种情况。如果您能提供帮助，请始终让您对 Encode() 的输入在末尾包含一个虚拟 0 字节，以避免额外的分配。

用法

我创建了一个 RadixEncoding 类（在“代码”部分中找到），它使用三个参数进行初始化：

作为字符串的基数（长度当然决定了实际的基数），
输入字节数组的假定字节顺序（endian），
以及用户是否希望编码/解码逻辑确认结束零字节。

要创建一个 Base-36 编码，使用 little-endian 输入，并考虑到结束零字节：

const string k_base36_digits = "0123456789abcdefghijklmnopqrstuvwxyz";
var base36_no_zeros = new RadixEncoding(k_base36_digits, EndianFormat.Little, false);

然后实际执行编码/解码：

const string k_input = "A test 1234";
byte[] input_bytes = System.Text.Encoding.UTF8.GetBytes(k_input);
string encoded_string = base36_no_zeros.Encode(input_bytes);
byte[] decoded_bytes = base36_no_zeros.Decode(encoded_string);

表现

使用 Diagnostics.Stopwatch 计时，在 i7 860 @2.80GHz 上运行。Timing EXE 自己运行，而不是在调试器下运行。

使用上面相同的k_base36_digits字符串 EndianFormat.Little 初始化编码，并确认结束零字节（即使 UTF8 字节没有任何额外的结束零字节）

将“A test 1234”的 UTF8 字节编码 1,000,000 次需要 2.6567905 秒
解码相同的字符串相同的次数需要 3.3916248 秒

对“A test 1234. Made稍大！”的UTF8字节进行编码。100,000 次需要 1.1577325 秒
要解码相同的字符串，相同的次数需要 1.244326 秒

代码

如果您没有CodeContracts 生成器，则必须使用 if/throw 代码重新实现合同。

using System;
using System.Collections.Generic;
using System.Numerics;
using Contract = System.Diagnostics.Contracts.Contract;

public enum EndianFormat
{
    /// <summary>Least Significant Bit order (lsb)</summary>
    /// <remarks>Right-to-Left</remarks>
    /// <see cref="BitConverter.IsLittleEndian"/>
    Little,
    /// <summary>Most Significant Bit order (msb)</summary>
    /// <remarks>Left-to-Right</remarks>
    Big,
};

/// <summary>Encodes/decodes bytes to/from a string</summary>
/// <remarks>
/// Encoded string is always in big-endian ordering
/// 
/// <p>Encode and Decode take a <b>includeProceedingZeros</b> parameter which acts as a work-around
/// for an edge case with our BigInteger implementation.
/// MSDN says BigInteger byte arrays are in LSB->MSB ordering. So a byte buffer with zeros at the 
/// end will have those zeros ignored in the resulting encoded radix string.
/// If such a loss in precision absolutely cannot occur pass true to <b>includeProceedingZeros</b>
/// and for a tiny bit of extra processing it will handle the padding of zero digits (encoding)
/// or bytes (decoding).</p>
/// <p>Note: doing this for decoding <b>may</b> add an extra byte more than what was originally 
/// given to Encode.</p>
/// </remarks>
// Based on the answers from http://codereview.stackexchange.com/questions/14084/base-36-encoding-of-a-byte-array/
public class RadixEncoding
{
    const int kByteBitCount = 8;

    readonly string kDigits;
    readonly double kBitsPerDigit;
    readonly BigInteger kRadixBig;
    readonly EndianFormat kEndian;
    readonly bool kIncludeProceedingZeros;

    /// <summary>Numerial base of this encoding</summary>
    public int Radix { get { return kDigits.Length; } }
    /// <summary>Endian ordering of bytes input to Encode and output by Decode</summary>
    public EndianFormat Endian { get { return kEndian; } }
    /// <summary>True if we want ending zero bytes to be encoded</summary>
    public bool IncludeProceedingZeros { get { return kIncludeProceedingZeros; } }

    public override string ToString()
    {
        return string.Format("Base-{0} {1}", Radix.ToString(), kDigits);
    }

    /// <summary>Create a radix encoder using the given characters as the digits in the radix</summary>
    /// <param name="digits">Digits to use for the radix-encoded string</param>
    /// <param name="bytesEndian">Endian ordering of bytes input to Encode and output by Decode</param>
    /// <param name="includeProceedingZeros">True if we want ending zero bytes to be encoded</param>
    public RadixEncoding(string digits,
        EndianFormat bytesEndian = EndianFormat.Little, bool includeProceedingZeros = false)
    {
        Contract.Requires<ArgumentNullException>(digits != null);
        int radix = digits.Length;

        kDigits = digits;
        kBitsPerDigit = System.Math.Log(radix, 2);
        kRadixBig = new BigInteger(radix);
        kEndian = bytesEndian;
        kIncludeProceedingZeros = includeProceedingZeros;
    }

    // Number of characters needed for encoding the specified number of bytes
    int EncodingCharsCount(int bytesLength)
    {
        return (int)Math.Ceiling((bytesLength * kByteBitCount) / kBitsPerDigit);
    }
    // Number of bytes needed to decoding the specified number of characters
    int DecodingBytesCount(int charsCount)
    {
        return (int)Math.Ceiling((charsCount * kBitsPerDigit) / kByteBitCount);
    }

    /// <summary>Encode a byte array into a radix-encoded string</summary>
    /// <param name="bytes">byte array to encode</param>
    /// <returns>The bytes in encoded into a radix-encoded string</returns>
    /// <remarks>If <paramref name="bytes"/> is zero length, returns an empty string</remarks>
    public string Encode(byte[] bytes)
    {
        Contract.Requires<ArgumentNullException>(bytes != null);
        Contract.Ensures(Contract.Result<string>() != null);

        // Don't really have to do this, our code will build this result (empty string),
        // but why not catch the condition before doing work?
        if (bytes.Length == 0) return string.Empty;

        // if the array ends with zeros, having the capacity set to this will help us know how much
        // 'padding' we will need to add
        int result_length = EncodingCharsCount(bytes.Length);
        // List<> has a(n in-place) Reverse method. StringBuilder doesn't. That's why.
        var result = new List<char>(result_length);

        // HACK: BigInteger uses the last byte as the 'sign' byte. If that byte's MSB is set, 
        // we need to pad the input with an extra 0 (ie, make it positive)
        if ( (bytes[bytes.Length-1] & 0x80) == 0x80 )
            Array.Resize(ref bytes, bytes.Length+1);

        var dividend = new BigInteger(bytes);
        // IsZero's computation is less complex than evaluating "dividend > 0"
        // which invokes BigInteger.CompareTo(BigInteger)
        while (!dividend.IsZero)
        {
            BigInteger remainder;
            dividend = BigInteger.DivRem(dividend, kRadixBig, out remainder);
            int digit_index = System.Math.Abs((int)remainder);
            result.Add(kDigits[digit_index]);
        }

        if (kIncludeProceedingZeros)
            for (int x = result.Count; x < result.Capacity; x++)
                result.Add(kDigits[0]); // pad with the character that represents 'zero'

        // orientate the characters in big-endian ordering
        if (kEndian == EndianFormat.Little)
            result.Reverse();
        // If we didn't end up adding padding, ToArray will end up returning a TrimExcess'd array, 
        // so nothing wasted
        return new string(result.ToArray());
    }

    void DecodeImplPadResult(ref byte[] result, int padCount)
    {
        if (padCount > 0)
        {
            int new_length = result.Length + DecodingBytesCount(padCount);
            Array.Resize(ref result, new_length); // new bytes will be zero, just the way we want it
        }
    }
    #region Decode (Little Endian)
    byte[] DecodeImpl(string chars, int startIndex = 0)
    {
        var bi = new BigInteger();
        for (int x = startIndex; x < chars.Length; x++)
        {
            int i = kDigits.IndexOf(chars[x]);
            if (i < 0) return null; // invalid character
            bi *= kRadixBig;
            bi += i;
        }

        return bi.ToByteArray();
    }
    byte[] DecodeImplWithPadding(string chars)
    {
        int pad_count = 0;
        for (int x = 0; x < chars.Length; x++, pad_count++)
            if (chars[x] != kDigits[0]) break;

        var result = DecodeImpl(chars, pad_count);
        DecodeImplPadResult(ref result, pad_count);

        return result;
    }
    #endregion
    #region Decode (Big Endian)
    byte[] DecodeImplReversed(string chars, int startIndex = 0)
    {
        var bi = new BigInteger();
        for (int x = (chars.Length-1)-startIndex; x >= 0; x--)
        {
            int i = kDigits.IndexOf(chars[x]);
            if (i < 0) return null; // invalid character
            bi *= kRadixBig;
            bi += i;
        }

        return bi.ToByteArray();
    }
    byte[] DecodeImplReversedWithPadding(string chars)
    {
        int pad_count = 0;
        for (int x = chars.Length - 1; x >= 0; x--, pad_count++)
            if (chars[x] != kDigits[0]) break;

        var result = DecodeImplReversed(chars, pad_count);
        DecodeImplPadResult(ref result, pad_count);

        return result;
    }
    #endregion
    /// <summary>Decode a radix-encoded string into a byte array</summary>
    /// <param name="radixChars">radix string</param>
    /// <returns>The decoded bytes, or null if an invalid character is encountered</returns>
    /// <remarks>
    /// If <paramref name="radixChars"/> is an empty string, returns a zero length array
    /// 
    /// Using <paramref name="IncludeProceedingZeros"/> has the potential to return a buffer with an
    /// additional zero byte that wasn't in the input. So a 4 byte buffer was encoded, this could end up
    /// returning a 5 byte buffer, with the extra byte being null.
    /// </remarks>
    public byte[] Decode(string radixChars)
    {
        Contract.Requires<ArgumentNullException>(radixChars != null);

        if (kEndian == EndianFormat.Big)
            return kIncludeProceedingZeros ? DecodeImplReversedWithPadding(radixChars) : DecodeImplReversed(radixChars);
        else
            return kIncludeProceedingZeros ? DecodeImplWithPadding(radixChars) : DecodeImpl(radixChars);
    }
};

基本单元测试

using System;
using Microsoft.VisualStudio.TestTools.UnitTesting;

static bool ArraysCompareN<T>(T[] input, T[] output)
    where T : IEquatable<T>
{
    if (output.Length < input.Length) return false;
    for (int x = 0; x < input.Length; x++)
        if(!output[x].Equals(input[x])) return false;

    return true;
}
static bool RadixEncodingTest(RadixEncoding encoding, byte[] bytes)
{
    string encoded = encoding.Encode(bytes);
    byte[] decoded = encoding.Decode(encoded);

    return ArraysCompareN(bytes, decoded);
}
[TestMethod]
public void TestRadixEncoding()
{
    const string k_base36_digits = "0123456789abcdefghijklmnopqrstuvwxyz";
    var base36 = new RadixEncoding(k_base36_digits, EndianFormat.Little, true);
    var base36_no_zeros = new RadixEncoding(k_base36_digits, EndianFormat.Little, true);

    byte[] ends_with_zero_neg = { 0xFF, 0xFF, 0x00, 0x00 };
    byte[] ends_with_zero_pos = { 0xFF, 0x7F, 0x00, 0x00 };
    byte[] text = System.Text.Encoding.ASCII.GetBytes("A test 1234");

    Assert.IsTrue(RadixEncodingTest(base36, ends_with_zero_neg));
    Assert.IsTrue(RadixEncodingTest(base36, ends_with_zero_pos));
    Assert.IsTrue(RadixEncodingTest(base36_no_zeros, text));
}

score 1 · Accepted Answer

有趣的是，我能够将 Kornman 的技术移植到 Java 中，并获得了包括 base36 在内的预期输出。而在运行他的时候？使用 C:\Windows\Microsoft.NET\Framework\v4.0.30319 csc 来自 c# 的代码，输出不符合预期。

例如，尝试使用 Kornman 的 RadixEncoding 编码为下面的字符串“hello world”对获得的 MD5 hashBytes 进行 base16 编码，我可以看到每个字符的两个字节组的字节顺序错误。

而不是 5eb63bbbe01eeed093cb22bb8f5acdc3

我看到了类似 e56bb3bb0ee1....

这是在 Windows 7 上。

const string input = "hello world";

public static void Main(string[] args)
{

  using (System.Security.Cryptography.MD5 md5 = System.Security.Cryptography.MD5.Create())
  {
    byte[] inputBytes = System.Text.Encoding.ASCII.GetBytes(input);

    byte[] hashBytes = md5.ComputeHash(inputBytes);

    // Convert the byte array to hexadecimal string
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < hashBytes.Length; i++)
    {
      sb.Append(hashBytes[i].ToString("X2"));
    }
    Console.WriteLine(sb.ToString());
  }
}

Java 代码如下，感兴趣的人可以参考。如上所述，它仅适用于基数 36。

private static final char[] BASE16_CHARS = "0123456789abcdef".toCharArray();
private static final BigInteger BIGINT_16 = BigInteger.valueOf(16);

private static final char[] BASE36_CHARS = "0123456789abcdefghijklmnopqrstuvwxyz".toCharArray();
private static final BigInteger BIGINT_36 = BigInteger.valueOf(36);

public static String toBaseX(byte[] bytes, BigInteger base, char[] chars)
{
    if (bytes == null) {
        return null;
    }

    final int bitsPerByte = 8;
    double bitsPerDigit = Math.log(chars.length) / Math.log(2);

    // Number of chars to encode specified bytes
    int size = (int) Math.ceil((bytes.length * bitsPerByte) / bitsPerDigit);

    StringBuilder sb = new StringBuilder(size);

    for (BigInteger value = new BigInteger(bytes); !value.equals(BigInteger.ZERO);) {
        BigInteger[] quotientAndRemainder = value.divideAndRemainder(base);
        sb.insert(0, chars[Math.abs(quotientAndRemainder[1].intValue())]);
        value = quotientAndRemainder[0];
    }

    return sb.toString();
}

c# - 字节数组的 Base-N 编码

2 回答 2

用法

表现

代码

基本单元测试

Related

Reference