python - Python 格式大小应用（将 B 转换为 KB、MB、GB、TB）

Question

我正在尝试编写一个应用程序来将字节转换为 kb 到 mb 到 gb 到 tb。这是我到目前为止所拥有的：

def size_format(b):
    if b < 1000:
              return '%i' % b + 'B'
    elif 1000 <= b < 1000000:
        return '%.1f' % float(b/1000) + 'KB'
    elif 1000000 <= b < 1000000000:
        return '%.1f' % float(b/1000000) + 'MB'
    elif 1000000000 <= b < 1000000000000:
        return '%.1f' % float(b/1000000000) + 'GB'
    elif 1000000000000 <= b:
        return '%.1f' % float(b/1000000000000) + 'TB'

问题是，当我尝试应用程序时，我得到小数点后的所有内容。示例 size_format(623)产生“623B”，但使用size_format(6200)，而不是得到“6.2kb”，我得到“6.0kb”。任何想法为什么？

score 45 · Accepted Answer

Bryan_Rch 答案的固定版本：

def format_bytes(size):
    # 2**10 = 1024
    power = 2**10
    n = 0
    power_labels = {0 : '', 1: 'kilo', 2: 'mega', 3: 'giga', 4: 'tera'}
    while size > power:
        size /= power
        n += 1
    return size, power_labels[n]+'bytes'

score 28 · Accepted Answer

def humanbytes(B):
    """Return the given bytes as a human friendly KB, MB, GB, or TB string."""
    B = float(B)
    KB = float(1024)
    MB = float(KB ** 2) # 1,048,576
    GB = float(KB ** 3) # 1,073,741,824
    TB = float(KB ** 4) # 1,099,511,627,776

    if B < KB:
        return '{0} {1}'.format(B,'Bytes' if 0 == B > 1 else 'Byte')
    elif KB <= B < MB:
        return '{0:.2f} KB'.format(B / KB)
    elif MB <= B < GB:
        return '{0:.2f} MB'.format(B / MB)
    elif GB <= B < TB:
        return '{0:.2f} GB'.format(B / GB)
    elif TB <= B:
        return '{0:.2f} TB'.format(B / TB)


tests = [1, 1024, 500000, 1048576, 50000000, 1073741824, 5000000000, 1099511627776, 5000000000000]

for t in tests: print("{0} == {1}".format(t,humanbytes(t)))

输出：

1 == 1.0 Byte
1024 == 1.00 KB
500000 == 488.28 KB
1048576 == 1.00 MB
50000000 == 47.68 MB
1073741824 == 1.00 GB
5000000000 == 4.66 GB
1099511627776 == 1.00 TB
5000000000000 == 4.55 TB

对于未来的我来说，它也在 Perl 中：

sub humanbytes {
   my $B = shift;
   my $KB = 1024;
   my $MB = $KB ** 2; # 1,048,576
   my $GB = $KB ** 3; # 1,073,741,824
   my $TB = $KB ** 4; # 1,099,511,627,776

   if ($B < $KB) {
      return "$B " . (($B == 0 || $B > 1) ? 'Bytes' : 'Byte');
   } elsif ($B >= $KB && $B < $MB) {
      return sprintf('%0.02f',$B/$KB) . ' KB';
   } elsif ($B >= $MB && $B < $GB) {
      return sprintf('%0.02f',$B/$MB) . ' MB';
   } elsif ($B >= $GB && $B < $TB) {
      return sprintf('%0.02f',$B/$GB) . ' GB';
   } elsif ($B >= $TB) {
      return sprintf('%0.02f',$B/$TB) . ' TB';
   }
}

score 14 · Accepted Answer

对我来说好主意：

def convert_bytes(num):
    """
    this function will convert bytes to MB.... GB... etc
    """
    step_unit = 1000.0 #1024 bad the size

    for x in ['bytes', 'KB', 'MB', 'GB', 'TB']:
        if num < step_unit:
            return "%3.1f %s" % (num, x)
        num /= step_unit

score 11 · Accepted Answer

警告：所有其他答案都包含错误。从字面上看，它们都无法处理接近下一个单元边界的文件大小。这是唯一没有错误的答案。

划分字节以获得人类可读的答案似乎很容易，对吧？错误的！

其他所有答案都不正确，并且包含浮点舍入错误，这些错误会导致错误输出，例如“1024 KiB”而不是“1 MiB”。不过，他们不应该为此感到难过，因为这是过去甚至 Android 操作系统程序员都存在的错误，而且成千上万的程序员的眼睛也从未注意到世界上最受欢迎的 StackOverflow 答案中的错误，尽管人们使用了多年那个旧的Java答案。

所以有什么问题？嗯，这是由于浮点舍入的工作方式。当被告知将自身格式化为单十进制数字时，诸如“1023.95”之类的浮点数实际上会四舍五入为“1024.0”。大多数程序员不会考虑这个错误，但它完全打破了“人类可读字节”的格式。所以他们的代码认为“哦，1023.95，没关系，我们找到了正确的单位，因为数字小于 1024”，但他们没有意识到它会被四舍五入为“1024.0”，应该格式化为 NEXT尺寸单位。

此外，许多其他答案都使用非常慢的代码和一堆数学函数，例如 pow/log，这可能看起来“整洁”但完全破坏了性能。大多数其他答案使用疯狂的 if/else 嵌套，或其他性能杀手，例如临时列表、实时字符串连接/创建等。简而言之，它们浪费 CPU 周期做无意义的繁重工作。

他们中的大多数也忘记包含更大的单元，因此只支持最常见文件大小的一小部分。给定一个更大的数字，这样的代码会输出类似“1239213919393491123.1 Gigabytes”的东西，这很愚蠢。他们中的一些人甚至不会这样做，并且如果输入数字大于他们已经实现的最大单元，就会简单地中断。

此外，几乎没有一个处理负输入，例如“负 2 兆字节”，并且完全中断这样的输入。

他们还对非常个人的选择进行硬编码，例如精度（小数位数）和单位类型（公制或二进制）。这意味着他们的代码几乎不能重用。

所以......好吧，我们遇到了当前答案不正确的情况......那么为什么不把所有事情都做对呢？这是我的函数，它侧重于性能和可配置性。您可以选择 0-3 位小数，以及是否需要公制（1000 的幂）或二进制（1024 的幂）表示。它包含一些代码注释和使用示例，以帮助人们理解它为什么这样做以及通过这种方式工作避免了哪些错误。如果把所有的注释都删掉的话，行号会缩水很多，但是我建议在copypasta-ing的时候保留注释，这样你以后再看代码。;-)

from typing import List, Union

class HumanBytes:
    METRIC_LABELS: List[str] = ["B", "kB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB"]
    BINARY_LABELS: List[str] = ["B", "KiB", "MiB", "GiB", "TiB", "PiB", "EiB", "ZiB", "YiB"]
    PRECISION_OFFSETS: List[float] = [0.5, 0.05, 0.005, 0.0005] # PREDEFINED FOR SPEED.
    PRECISION_FORMATS: List[str] = ["{}{:.0f} {}", "{}{:.1f} {}", "{}{:.2f} {}", "{}{:.3f} {}"] # PREDEFINED FOR SPEED.

    @staticmethod
    def format(num: Union[int, float], metric: bool=False, precision: int=1) -> str:
        """
        Human-readable formatting of bytes, using binary (powers of 1024)
        or metric (powers of 1000) representation.
        """

        assert isinstance(num, (int, float)), "num must be an int or float"
        assert isinstance(metric, bool), "metric must be a bool"
        assert isinstance(precision, int) and precision >= 0 and precision <= 3, "precision must be an int (range 0-3)"

        unit_labels = HumanBytes.METRIC_LABELS if metric else HumanBytes.BINARY_LABELS
        last_label = unit_labels[-1]
        unit_step = 1000 if metric else 1024
        unit_step_thresh = unit_step - HumanBytes.PRECISION_OFFSETS[precision]

        is_negative = num < 0
        if is_negative: # Faster than ternary assignment or always running abs().
            num = abs(num)

        for unit in unit_labels:
            if num < unit_step_thresh:
                # VERY IMPORTANT:
                # Only accepts the CURRENT unit if we're BELOW the threshold where
                # float rounding behavior would place us into the NEXT unit: F.ex.
                # when rounding a float to 1 decimal, any number ">= 1023.95" will
                # be rounded to "1024.0". Obviously we don't want ugly output such
                # as "1024.0 KiB", since the proper term for that is "1.0 MiB".
                break
            if unit != last_label:
                # We only shrink the number if we HAVEN'T reached the last unit.
                # NOTE: These looped divisions accumulate floating point rounding
                # errors, but each new division pushes the rounding errors further
                # and further down in the decimals, so it doesn't matter at all.
                num /= unit_step

        return HumanBytes.PRECISION_FORMATS[precision].format("-" if is_negative else "", num, unit)

print(HumanBytes.format(2251799813685247)) # 2 pebibytes
print(HumanBytes.format(2000000000000000, True)) # 2 petabytes
print(HumanBytes.format(1099511627776)) # 1 tebibyte
print(HumanBytes.format(1000000000000, True)) # 1 terabyte
print(HumanBytes.format(1000000000, True)) # 1 gigabyte
print(HumanBytes.format(4318498233, precision=3)) # 4.022 gibibytes
print(HumanBytes.format(4318498233, True, 3)) # 4.318 gigabytes
print(HumanBytes.format(-4318498233, precision=2)) # -4.02 gibibytes

顺便说一句，硬编码PRECISION_OFFSETS是为了获得最佳性能而创建的。我们可以使用公式以编程方式计算偏移量unit_step_thresh = unit_step - (0.5/(10**precision))以支持任意精度。但是用大量的 4+ 尾随十进制数字格式化文件大小确实没有意义。这就是为什么我的函数完全支持人们使用的内容：0、1、2 或 3 位小数。因此，我们避免了一堆 pow 和除法数学。此决定是使此功能快速的许多注重细节的小选择之一。另一个性能选择的例子是决定使用基于字符串的if unit != last_label检查来检测列表的结尾，而不是通过索引迭代并查看我们是否已经到达最终的列表索引。range()通过或元组通过enumerate()生成索引较慢不仅仅是对存储在_LABELS列表中的 Python 的不可变字符串对象进行地址比较，而这正是这段代码所做的！

当然，在性能上投入这么多工作有点过分，但我讨厌“写草率的代码，只有在项目中的数千个缓慢的函数使整个项目变得迟缓之后才进行优化”的态度。大多数程序员赖以生存的“过早优化”引用被完全误解并被用作草率的借口。:-P

我将此代码放在公共领域。随意在您的项目中使用它，包括免费软件和商业软件。我实际上建议您将它放在一个.py模块中，然后将其从“类名称空间”更改为普通模块。我只使用了一个类来保持 StackOverflow 的代码整洁，并且如果您不想使用模块，则可以轻松地将其粘贴到自包含的 Python 脚本中。

享受并玩得开心！:-)

score 10 · Accepted Answer

另一个humanbytes 版本，没有循环/if..else，采用python3 语法。

从@whereisalext 的答案中窃取的测试号码。

请注意，它仍然是一个草图，例如，如果数字足够大，它将回溯。

import math as m


MULTIPLES = ["B", "k{}B", "M{}B", "G{}B", "T{}B", "P{}B", "E{}B", "Z{}B", "Y{}B"]


def humanbytes(i, binary=False, precision=2):
    base = 1024 if binary else 1000
    multiple = m.trunc(m.log2(i) / m.log2(base))
    value = i / m.pow(base, multiple)
    suffix = MULTIPLES[multiple].format("i" if binary else "")
    return f"{value:.{precision}f} {suffix}"


if __name__ == "__main__":
    sizes = [
        1, 1024, 500000, 1048576, 50000000, 1073741824, 5000000000,
        1099511627776, 5000000000000]

    for i in sizes:
        print(f"{i} == {humanbytes(i)}, {humanbytes(i, binary=True)}")

结果：

1 == 1.00 B, 1.00 B
1024 == 1.02 kB, 1.00 kiB
500000 == 500.00 kB, 488.28 kiB
1048576 == 1.05 MB, 1.00 MiB
50000000 == 50.00 MB, 47.68 MiB
1073741824 == 1.07 GB, 1.00 GiB
5000000000 == 5.00 GB, 4.66 GiB
1099511627776 == 1.10 TB, 1.00 TiB
5000000000000 == 5.00 TB, 4.55 TiB

更新：

正如评论中所指出的（正如最初所指出的：“请注意，它仍然是一个草图”），这段代码很慢而且有问题。请参阅@mitch-mcmabers 的回答。

更新 2：我也撒谎说没有ifs。

score 5 · Accepted Answer

我有相当可读的函数将字节转换为更大的单位：

def bytes_2_human_readable(number_of_bytes):
    if number_of_bytes < 0:
        raise ValueError("!!! number_of_bytes can't be smaller than 0 !!!")

    step_to_greater_unit = 1024.

    number_of_bytes = float(number_of_bytes)
    unit = 'bytes'

    if (number_of_bytes / step_to_greater_unit) >= 1:
        number_of_bytes /= step_to_greater_unit
        unit = 'KB'

    if (number_of_bytes / step_to_greater_unit) >= 1:
        number_of_bytes /= step_to_greater_unit
        unit = 'MB'

    if (number_of_bytes / step_to_greater_unit) >= 1:
        number_of_bytes /= step_to_greater_unit
        unit = 'GB'

    if (number_of_bytes / step_to_greater_unit) >= 1:
        number_of_bytes /= step_to_greater_unit
        unit = 'TB'

    precision = 1
    number_of_bytes = round(number_of_bytes, precision)

    return str(number_of_bytes) + ' ' + unit

score 3 · Accepted Answer

您可以更改除法的行为，而不是修改代码：

from __future__ import division

这为 Python 2.x 使用的“经典”样式提供了“真正的”划分。有关详细信息，请参阅PEP 238 - 更改除法运算符。

这现在是 Python 3.x 中的默认行为

score 3 · Accepted Answer

一个非常简单的解决方案是：

SIZE_UNITS = ['B', 'KB', 'MB', 'GB', 'TB', 'PB']

def get_readable_file_size(size_in_bytes):
    index = 0
    while size_in_bytes >= 1024:
        size_in_bytes /= 1024
        index += 1
    try:
        return f'{size_in_bytes} {SIZE_UNITS[index]}'
    except IndexError:
        return 'File too large'

score 1 · Accepted Answer

当您将值相除时，您使用的是整数除法，因为这两个值都是整数。您需要先将其中一个转换为浮动：

return '%.1f' % float(b)/1000 + 'KB'

甚至只是

return '%.1f' % b/1000.0 + 'KB'

score 1 · Accepted Answer

这是一个紧凑的版本，可以将 B（字节）转换为任何更高阶的 MB、GB，而无需if...else在 python 中大量使用。我使用按位来处理这个问题。return_output如果您将函数中的参数触发为 True，它还允许返回浮点输出：

import math

def bytes_conversion(number, return_float=False):

    def _conversion(number, return_float=False):

        length_number = int(math.log10(number))

        if return_float:

           length_number = int(math.log10(number))
           return length_number // 3, '%.2f' % (int(number)/(1 << (length_number//3) *10))

        return length_number // 3, int(number) >> (length_number//3) * 10

    unit_dict = {
        0: "B",  1: "kB",
        2: "MB", 3: "GB",
        4: "TB", 5: "PB",
        6: "EB"
    }

    if return_float:

        num_length, number = _conversion(number, return_float=return_float)

    else:
        num_length, number = _conversion(number)

    return "%s %s" % (number, unit_dict[num_length])

#Example usage:
#print(bytes_conversion(491266116, return_float=True))

这只是我在 StackOverflow 中的一些帖子。如果我有任何错误或违规行为，请告诉我。

score 1 · Accepted Answer

在我看来，我已经改进了@whereisalext 答案，使其具有更通用的功能，一旦要添加更多单元，就不需要添加更多 if 语句：

AVAILABLE_UNITS = ['bytes', 'KB', 'MB', 'GB', 'TB']

def get_amount_and_unit(byte_amount):
    for index, unit in enumerate(AVAILABLE_UNITS):
        lower_threshold = 0 if index == 0 else 1024 ** (index - 1)
        upper_threshold = 1024 ** index
        if lower_threshold <= byte_amount < upper_threshold:
            if lower_threshold == 0:
                return byte_amount, unit
            else:
                return byte_amount / lower_threshold, AVAILABLE_UNITS[index - 1]
    # Default to the maximum
    max_index = len(AVAILABLE_UNITS) - 1
    return byte_amount / (1024 ** max_index), AVAILABLE_UNITS[max_index]

请注意，这与@whereisalext 的算法略有不同：

这将返回一个元组，其中包含第一个索引处的转换金额和第二个索引处的单位
这不会尝试区分单个字节和多个字节（因此 1 个字节是这种方法的输出）

score 1 · Accepted Answer

我认为这是一个简短而简洁的。这个想法是基于我多年前编写的一些图形缩放代码。代码片段round(log2(size)*4)/40在这里发挥了作用，以 2**10 的幂计算边界。“正确”的实现是：trunc(log2(size)/10，但是当大小接近新边界时，你会得到奇怪的行为。例如datasize(2**20-1)将返回 (1024.00, 'KiB')。通过使用round和缩放log2结果，您可以在接近新边界时获得很好的截断。

from math import log2
def datasize(size):
    """
    Calculate the size of a code in B/KB/MB.../
    Return a tuple of (value, unit)
    """
    assert size>0, "Size must be a positive number"
    units = ("B", "KiB", "MiB", "GiB", "TiB", "PiB",  "EiB", "ZiB", "YiB") 
    scaling = round(log2(size)*4)//40
    scaling = min(len(units)-1, scaling)
    return  size/(2**(10*scaling)), units[scaling]

for size in [2**10-1, 2**10-10, 2**10-100, 2**20-10000, 2**20-2**18, 2**20, 2**82-2**72, 2**80-2**76]:
    print(size, "bytes= %.3f %s" % datasize(size))

1023 bytes= 0.999 KiB
1014 bytes= 0.990 KiB
924 bytes= 924.000 B
1038576 bytes= 0.990 MiB
786432 bytes= 768.000 KiB
1048576 bytes= 1.000 MiB
4830980911975647053611008 bytes= 3.996 YiB
1133367955888714851287040 bytes= 0.938 YiB

score 1 · Accepted Answer

现在有一个方便的DataSize包：

pip install datasize

import datasize
import sys

a = [i for i in range(1000000)]
s = sys.getsizeof(a)
print(f"{datasize.DataSize(s):MiB}")

输出：

8.2945556640625MiB

score 0 · Accepted Answer

在做除法之前做float(b)，例如做float(b)/1000而不是float(b/1000)，因为b和1000都是整数，b/1000仍然是一个没有小数部分的整数。

score 0 · Accepted Answer

这里是将字节转换为千、兆、兆。

#From bytes to kilo, mega, tera
def  get_(size):

    #2**10 = 1024
    power = 2**10
    n = 1
    Dic_powerN = {1:'kilobytes', 2:'megabytes', 3:'gigabytes', 4:'Terabytes'}

    if size <= power**2 :
        size /=  power
        return size, Dic_powerN[n]

    else: 
        while size   >  power :
            n  += 1
            size /=  power**n

        return size, Dic_powerN[n]

score 0 · Accepted Answer

没有小数位的输出：

>>> format_file_size(12345678)
'11 MiB, 792 KiB, 334 bytes'

format_file_size(
    def format_file_size(fsize):
        result = []
        units = {s: u for s, u in zip(reversed([2 ** n for n in range(0, 40, 10)]), ['GiB', 'MiB', 'KiB', 'bytes'])}
        for s, u in units.items():
            t = fsize // s
            if t > 0:
                result.append('{} {}'.format(t, u))
            fsize = fsize % s
        return ', '.join(result) or '0 bytes'

score 0 · Accepted Answer

让我添加我的，其中没有变量在循环中更新或类似的容易出错的行为。实现的逻辑很简单。它仅使用 Python 3 进行了测试。

def format_bytes(size: int) -> str:
    power_labels = {40: "TB", 30: "GB", 20: "MB", 10: "KB"}
    for power, label in power_labels.items():
        if size >= 2 ** power:
            approx_size = size // 2 ** power
            return f"{approx_size} {label}"
    return f"{size} bytes"

它已经过测试，例如在 KB/MB 边界：

1024*1024-1 返回“1023 KB”
1024*1024 返回“1 MB”
1024*1024+1 返回“1 MB”

approx_size如果您想要浮点数而不是四舍五入的整数，您可以轻松更改。

score 0 · Accepted Answer

我知道这里已经有很多答案和解释，但是我尝试了这种基于类的方法，它对我来说非常有效。它可能看起来很大，但看看我是如何使用属性和方法的。

class StorageUnits:
    b, Kb, Kib, Mb, Mib, Gb, Gib, Tb, Tib, Pb, Pib, Eb, Eib, Zb, Zib, Yb, Yib, B, KB, KiB, MB, MiB, GB, GiB, TB,\
        TiB, PB, PiB, EB, EiB, ZB, ZiB, YB, YiB = [0]*34


class DigitalStorageConverter:
    def __init__(self):
        self.storage = StorageUnits()
        self.bit_conversion_value_table = {
            'b': 1, 'Kb': 1000, 'Mb': 1000**2, 'Gb': 1000**3, 'Tb': 1000**4, 'Pb': 1000**5, 'Eb': 1000**6,
            'Zb': 1000**7, 'Yb': 1000**8, 'Kib': 1024, 'Mib': 1024**2, 'Gib': 1024**3, 'Tib': 1024**4, 'Pib': 1024**5,
            'Eib': 1024**6, 'Zib': 1024**7, 'Yib': 1024**8,
            'B': 8, 'KB': 8*1000, 'MB': 8*(1000**2), 'GB': 8*(1000**3), 'TB': 8*(1000**4), 'PB': 8*(1000**5),
            'EB': 8*(1000**6), 'ZB': 8*(1000**7), 'YB': 8*(1000**8), 'KiB': 8*1024, 'MiB': 8*(1024**2),
            'GiB': 8*(1024**3), 'TiB': 8*(1024**4), 'PiB': 8*(1024**5), 'EiB': 8*(1024**6), 'ZiB': 8*(1024**7),
            'YiB': 8*(1024**8)
        }
        "Values of all the units in bits"
        self.name_conversion_table = {
            'bit': 'b', 'kilobit': 'Kb', 'megabit': 'Mb', 'gigabit': 'Gb', 'terabit': 'Tb', 'petabit': 'Pb',
            'exabit': 'Eb', 'zettabit': 'Zb', 'yottabit': 'Yb', 'kibibit': 'Kib', 'mebibit': 'Mib', 'Gibibit': 'Gib',
            'tebibit': 'Tib', 'pebibit': 'Pb', 'exbibit': 'Eib', 'zebibit': 'Zib', 'yobibit': 'Yib',
            'byte': 'B', 'kilobyte': 'KB', 'megabyte': 'MB', 'gigabyte': 'GB', 'terabyte': 'TB', 'petabyte': 'PB',
            'exabyte': 'EB', 'zettabyte': 'ZB', 'yottabyte': 'YB', 'kibibyte': 'KiB', 'mebibyte': 'MiB',
            'gibibyte': 'GiB', 'tebibyte': 'TiB', 'pebibyte': 'PiB', 'exbibyte': 'EiB', 'zebibyte': 'ZiB',
            'yobibyte': 'YiB'
        }
        self.storage_units = [u for u in list(StorageUnits.__dict__.keys()) if not u.startswith('__')]

    def get_conversion(self, value: float, from_type: str) -> StorageUnits:
        if from_type in list(self.name_conversion_table.values()):
            from_type_bit_value = self.bit_conversion_value_table[from_type]
        elif from_type in list(self.name_conversion_table.keys()):
            from_type = self.name_conversion_table[from_type]
            from_type_bit_value = self.bit_conversion_value_table[from_type]
        else:
            raise KeyError(f'Invalid storage unit type "{from_type}"')

        value = value * from_type_bit_value

        for i in self.storage_units:
            self.storage.__setattr__(i, value / self.bit_conversion_value_table[i])
        return self.storage


if __name__ == '__main__':
    c = DigitalStorageConverter()
    s = c.get_conversion(5000, 'KiB')
    print(s.KB, s.MB, s.TB)   # , ..., ..., etc till whatever you may want

如果数字太大，该程序将以指数形式为您提供答案。

注意：如果发现不正确，请更正存储值的名称

python - Python 格式大小应用（将 B 转换为 KB、MB、GB、TB）

18 回答 18

警告：所有其他答案都包含错误。从字面上看，它们都无法处理接近下一个单元边界的文件大小。这是唯一没有错误的答案。

Related

Reference