python - 生成字体名称无法正确解码的字符图像

Question

我正在创作中国篆书的图像。我为此任务准备了三种真字体（Jin_Wen_Da_Zhuan_Ti.7z、Zhong_Guo_Long_Jin_Shi_Zhuan.7z、Zhong_Yan_Yuan_Jin_Wen.7z，仅用于测试目的）。以下是 Microsoft Word 中的外观

Word中的外观

汉字“我”（I/me）。这是我的 Python 脚本：

import numpy as np
from PIL import Image, ImageFont, ImageDraw, ImageChops
import itertools
import os


def grey2binary(grey, white_value=1):
    grey[np.where(grey <= 127)] = 0
    grey[np.where(grey > 127)] = white_value
    return grey


def create_testing_images(characters,
                          font_path,
                          save_to_folder,
                          sub_folder=None,
                          image_size=64):
    font_size = image_size * 2
    if sub_folder is None:
        sub_folder = os.path.split(font_path)[-1]
        sub_folder = os.path.splitext(sub_folder)[0]
    sub_folder_full = os.path.join(save_to_folder, sub_folder)
    if not os.path.exists(sub_folder_full):
        os.mkdir(sub_folder_full)
    font = ImageFont.truetype(font_path,font_size)
    bg = Image.new('L',(font_size,font_size),'white')

    for char in characters:
        img = Image.new('L',(font_size,font_size),'white')
        draw = ImageDraw.Draw(img)
        draw.text((0,0), text=char, font=font)
        diff = ImageChops.difference(img, bg)
        bbox = diff.getbbox()
        if bbox:
            img = img.crop(bbox)
            img = img.resize((image_size, image_size), resample=Image.BILINEAR)

            img_array = np.array(img)
            img_array = grey2binary(img_array, white_value=255)

            edge_top = img_array[0, range(image_size)]
            edge_left = img_array[range(image_size), 0]
            edge_bottom = img_array[image_size - 1, range(image_size)]
            edge_right = img_array[range(image_size), image_size - 1]

            criterion = sum(itertools.chain(edge_top, edge_left, 
                                           edge_bottom, edge_right))

            if criteria > 255 * image_size * 2:
                img = Image.fromarray(np.uint8(img_array))
                img.save(os.path.join(sub_folder_full, char) + '.gif')

核心片段在哪里

        font = ImageFont.truetype(font_path,font_size)
        img = Image.new('L',(font_size,font_size),'white')
        draw = ImageDraw.Draw(img)
        draw.text((0,0), text=char, font=font)

例如，如果您将这些字体放在文件夹./fonts中，并使用

create_testing_images(['我'], 'fonts/金文大篆体.ttf', save_to_folder='test')

该脚本将./test/金文大篆体/我.gif在您的文件系统中创建。

现在的问题是，虽然它适用于第一种字体金文大篆体.ttf（在 Jin_Wen_Da_Zhuan_Ti.7z 中），但该脚本不适用于其他两种字体，即使它们可以在 Microsoft Word 中正确呈现：for 中国龙金石篆.ttf（Zhong_Guo_Long_Jin_Shi_Zhuan.7z），它什么bbox都不会画None；对于中研院金文.ttf（在Zhong_Yan_Yuan_Jin_Wen.7z），它会在图片中画一个没有字符的黑框。

在此处输入图像描述

因此未能通过的测试criterion，其目的是测试全黑输出。我用FontForge查看了字体的属性，发现第一个字体金文大篆体.ttf（在Jin_Wen_Da_Zhuan_Ti.7z中）使用的是UnicodeBmp

UnicodeBmp

而另外两个使用 Big5hkscs

Big5hkscs_中国龙金石篆中研院金文"></p> <p>这不是我系统的编码方案，这可能是我的系统无法识别字体名称的原因：</p> <p><img src=

实际上，我也尝试通过尝试获取具有混乱字体名称的字体来解决此问题。pycairo我在安装这些字体后尝试过：

import cairo

# adapted from
# http://heuristically.wordpress.com/2011/01/31/pycairo-hello-world/

# setup a place to draw
surface = cairo.ImageSurface(cairo.FORMAT_ARGB32, 100, 100)
ctx = cairo.Context (surface)

# paint background
ctx.set_source_rgb(1, 1, 1)
ctx.rectangle(0, 0, 100, 100)
ctx.fill()

# draw text
ctx.select_font_face('金文大篆体')
ctx.set_font_size(80)
ctx.move_to(12,80)
ctx.set_source_rgb(0, 0, 0)
ctx.show_text('我')

# finish up
ctx.stroke() # commit to surface
surface.write_to_png('我.gif')

这同样适用于金文大篆体.ttf（在 Jin_Wen_Da_Zhuan_Ti.7z 中）：

在此处输入图像描述

但仍然没有和其他人在一起。例如：（ctx.select_font_face('中國龍金石篆')报告_cairo_win32_scaled_font_ucs4_to_index:GetGlyphIndicesW）和ctx.select_font_face('¤¤°êÀsª÷¥Û½f')（使用默认字体绘制）都不起作用。（后一个名字就是上图字体查看器中显示的乱码，通过一行Mathematica代码得到，ToCharacterCode["中國龍金石篆", "CP950"] // FromCharacterCode这里CP950是Big5的代码页。）

所以我想我已经尽力解决这个问题，但仍然无法解决。我还提出了其他方法，例如使用 FontForge 重命名字体名称或将系统编码更改为 Big5，但我仍然更喜欢仅涉及 Python 的解决方案，因此用户需要较少的额外操作。任何提示将不胜感激。谢谢你。

致stackoverflow版主：这个问题乍一看似乎“过于本地化”，但它可能发生在其他语言/其他编码/其他字体中，解决方案可以推广到其他情况，所以请不要关闭它有了这个原因。谢谢你。

更新：奇怪的 Mathematica 可以识别 CP936 中的字体名称（GBK，可以认为是我的系统编码）。以中国龙金石篆.ttf（Zhong_Guo_Long_Jin_Shi_Zhuan.7z）为例：

但是设置ctx.select_font_face('ÖÐøý½ðÊ¯*')也不起作用，这将使用默认字体创建字符图像。

score 7 · Accepted Answer

西尔维娅对 OP 的评论...

您可能需要考虑指定encoding参数，例如 ImageFont.truetype(font_path,font_size,encoding="big5")

...让您走到一半，但如果您不使用 Unicode 字体，您似乎还必须手动翻译 Unicode 字符。

对于使用“big5hkscs”编码的字体，我必须这样做......

>>> u = u'\u6211'      # Unicode for 我
>>> u.encode('big5hkscs')
'\xa7\xda'

...然后使用u'\ua7da'来获得正确的字形，这有点奇怪，但它看起来是将多字节字符传递给 PIL 的唯一方法。

以下代码适用于我在 Python 2.7.4 和 Python 3.3.1 上使用 PIL 1.1.7 ...

from PIL import Image, ImageDraw, ImageFont


# Declare font files and encodings
FONT1 = ('Jin_Wen_Da_Zhuan_Ti.ttf',          'unicode')
FONT2 = ('Zhong_Guo_Long_Jin_Shi_Zhuan.ttf', 'big5hkscs')
FONT3 = ('Zhong_Yan_Yuan_Jin_Wen.ttf',       'big5hkscs')


# Declare a mapping from encodings used by str.encode() to encodings used by
# the FreeType library
ENCODING_MAP = {'unicode':   'unic',
                'big5':      'big5',
                'big5hkscs': 'big5',
                'shift-jis': 'sjis'}


# The glyphs we want to draw
GLYPHS = ((FONT1, u'\u6211'),
          (FONT2, u'\u6211'),
          (FONT3, u'\u6211'),
          (FONT3, u'\u66ce'),
          (FONT2, u'\u4e36'))


# Returns PIL Image object
def draw_glyph(font_file, font_encoding, unicode_char, glyph_size=128):

    # Translate unicode string if necessary
    if font_encoding != 'unicode':
        mb_string = unicode_char.encode(font_encoding)
        try:
            # Try using Python 2.x's unichr
            unicode_char = unichr(ord(mb_string[0]) << 8 | ord(mb_string[1]))
        except NameError:
            # Use Python 3.x-compatible code
            unicode_char = chr(mb_string[0] << 8 | mb_string[1])

    # Load font using mapped encoding
    font = ImageFont.truetype(font_file, glyph_size, encoding=ENCODING_MAP[font_encoding])

    # Now draw the glyph
    img = Image.new('L', (glyph_size, glyph_size), 'white')
    draw = ImageDraw.Draw(img)
    draw.text((0, 0), text=unicode_char, font=font)
    return img


# Save an image for each glyph we want to draw
for (font_file, font_encoding), unicode_char in GLYPHS:
    img = draw_glyph(font_file, font_encoding, unicode_char)
    filename = '%s-%s.png' % (font_file, hex(ord(unicode_char)))
    img.save(filename)

请注意，我将字体文件重命名为与 7zip 文件相同的名称。我尽量避免在代码示例中使用非 ASCII 字符，因为它们有时会在复制/粘贴时搞砸。

这个例子应该适用于在中声明的类型ENCODING_MAP，如果需要可以扩展（请参阅FreeType 编码字符串str.encode()以获得有效的 FreeType 编码），但在 Python不生成的情况下，您需要更改一些代码长度为 2 的多字节字符串。

更新

如果问题出在 ttf 文件中，如何在 PIL 和 FreeType 源代码中找到答案？上面，您似乎在说 PIL 是罪魁祸首，但是当您只想要 unicode_char 时，为什么必须通过 unicode_char.encode(...).decode(...) 呢？

据我了解，TrueType字体格式是在 Unicode 被广泛采用之前开发的，所以如果你想创建当时的中文字体，你必须使用当时使用的一种编码，而中国自 1980 年代中期以来，大部分时间都在使用Big5 。

因此，有理由认为，必须有一种方法可以使用 Big5 字符编码从 Big5 编码的 TTF 中检索字形。

使用 PIL 呈现字符串的 C 代码以font_render()函数开头，并最终调用FT_Get_Char_Index()以定位正确的字形，给定字符代码为unsigned long.

但是，PIL 的font_getchar()函数unsigned long只接受 Pythonstring和unicode类型，并且由于它似乎没有对字符编码本身进行任何转换，似乎获得 Big5 字符集正确值的唯一方法是强制通过利用内部存储为整数的事实将Pythonunicode字符转换为正确的值，无论是 16 位还是 32 位，具体取决于您编译 Python 的方式。unsigned longu'\ua7da'0xa7da

TBH，涉及到相当多的猜测，因为我没有费心去研究ImageFont.truetype()'sencoding参数的确切效果是什么，但从外观上看，它不应该对字符编码进行任何翻译，而是为了允许单个 TTF 文件支持相同字形的多个字符编码，使用FT_Select_Charmap()函数在它们之间切换。

所以，据我了解，FreeType 库与 TTF 文件的交互是这样的......

#!/usr/bin/env python
# -*- coding: utf-8 -*-

class TTF(object):
    glyphs = {}
    encoding_maps = {}

    def __init__(self, encoding='unic'):
        self.set_encoding(encoding)

    def set_encoding(self, encoding):
        self.current_encoding = encoding

    def get_glyph(self, charcode):
        try:
            return self.glyphs[self.encoding_maps[self.current_encoding][charcode]]
        except KeyError:
            return ' '


class MyTTF(TTF):
    glyphs = {1: '我',
              2: '曎'}
    encoding_maps = {'unic': {0x6211: 1, 0x66ce: 2},
                     'big5': {0xa7da: 1, 0x93be: 2}}


font = MyTTF()
print 'Get via Unicode map: %s' % font.get_glyph(0x6211)
font.set_encoding('big5')
print 'Get via Big5 map: %s' % font.get_glyph(0xa7da)

...但是由每个 TTF 提供encoding_maps变量，并且没有要求 TTF 为 Unicode 提供变量。事实上，在采用 Unicode 之前创建的字体不太可能有。

假设所有这些都是正确的，那么 TTF 就没有问题 - 问题只是 PIL 使得访问没有 Unicode 映射的字体的字形有点尴尬，并且所需字形的unsigned long字符代码大于255.

score 4 · Accepted Answer

问题是字体不严格符合 TrueType 规范。一个快速的解决方案是使用 FontForge（您已经在使用它），并让它清理字体。

打开字体文件
转到Encoding，然后选择Reencode
选择ISO 10646-1 (Unicode BMP)
到File那时Generate Fonts
另存为 TTF
使用新生成的字体运行脚本
瞧！它以漂亮的字体打印出我！

python - 生成字体名称无法正确解码的字符图像

2 回答 2

Related

Reference