unicode - 在 Unicode 中查找字形相似的字符？

Question

假设我有字符Ú，Ù，Ü。它们在字形上都与英语 U 相似。

是否有一些列表或算法可以做到这一点：

给定 Ú 或 Ù 或 Ü 返回英文 U
给定一个英文 U，返回所有 U 相似字符的列表

我不确定 Unicode 字符的代码点是否在所有字体中都相同？如果是，我想可能有一些简单有效的方法来做到这一点？

更新

如果您使用的是 Ruby，有一个 gem 可用unicode-confusable，这在某些情况下可能会有所帮助。

score 31 · Accepted Answer

目前还不清楚你在这里要求做什么。

有些字符的规范分解都以相同的基本字符开头：e, é, ê, ë, ē, ĕ, ė, ę, ě, ȅ, ȇ, ȩ, ḕ, ḗ, ḙ, ḛ, ḝ, ẹ, ẻ, ẽ, ế, ề, ể, ễ, ệ, e̳, ...或s, ś, ŝ, ş, š, ș, ṡ, ṣ, ṥ, ṧ, ṩ, ...。
有些字符的兼容性分解都包含一个特定字符：ᵉ、ₑ、ℯ、ⅇ、⒠、ⓔ、㋍、㋎、e、...或s、ſ、ˢ、ẛ、₨、℁、⒮、ⓢ、㎧、㎨, ㎮, ㎯, ㎰, ㎱, ㎲, ㎳, ㏛, ﬅ, ﬆ, s, ...或R, ᴿ, ₨, ℛ, ℜ, ℝ, Ⓡ, ㏚, Ｒ, ...。
有些字符恰好在某些字体中看起来很相似：ß 和 β 和 ϐ，或3 和 Ʒ 和 Ȝ 和 ȝ 和 ʒ 和 ӡ 和 ᴣ，或ɣ 和 ɤ 和 γ，或F 和 Ϝ 和 ϝ，或B和 Β 和 В，或∅ 和 ○ 和 0 和 O 和 ০ 和 ੦ 和 ౦ 和 ૦，或者1 和 l 和 I 和 Ⅰ 和 ᛁ 和 | 和 ǀ 和 ∣，……。
不区分大小写的相同字符，如 s 和 S 和 ſ，或ss 和 Ss 和 SS 和 ß 和 ẞ，...。
所有这些都具有相同的数值，如所有这些值为1：1¹111111୧111111111፩11111111፩1៱៱᠑᥇᧑᧚᪁᪑᭑᮱᱁᱑₁⅟①ꛦ⒈⓵⓵ⅰⅰꛦ㆒㆒㆒㆒㈠㊀㈠㊀
所有具有相同主要校对强度的字符，例如所有与d相同的字符：DdÐðĎďĐđ◌ͩᴰᵈᶞ◌ᷘ◌ᷙḊḋḌḍḎḏḐḑḒḓⅅⅆⅮⅾ Ⓓ ⓓ ꝹＤd 。请注意，其中一些无法通过任何类型的分解访问，而只能通过 DUCET/UCA 值访问；例如，相当常见的 ð 或新的 ꝺ 只能通过初级 UCA 强度比较等同于 d；ƶ 和 z、 ȼ 和 c 等相同。
在某些语言环境中相同的字符，如 æ 和 ae，或ä 和 ae，或ä 和 aa，或 MacKinley 和 McKinley，……。请注意，语言环境可以产生很大的不同，因为在某些语言环境中 c 和 ç 是相同的字符，而在其他语言环境中它们不是；同样适用于 n 和 ñ，或a 和 á 和 ã，...。

其中一些是可以处理的。有些不能。根据不同的需求，所有这些都需要不同的方法。

你真正的目标是什么？

score 12 · Accepted Answer

这不适用于所有情况，但摆脱大多数重音的一种方法是将字符转换为其分解形式，然后丢弃组合重音：

# coding: utf8
import unicodedata as ud
s=u'U, Ù, Ú, Û, Ü, Ũ, Ū, Ŭ, Ů, Ű, Ų, Ư, Ǔ, Ǖ, Ǘ, Ǚ, Ǜ, Ụ, Ủ, Ứ, Ừ, Ử, Ữ, Ự'
print ud.normalize('NFD',s).encode('ascii','ignore')

输出

U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U

要查找重音字符，请使用以下内容：

import unicodedata as ud
import string

def asc(unichr):
    return ud.normalize('NFD',unichr).encode('ascii','ignore')

U = u''.join(unichr(i) for i in xrange(65536))
for c in string.letters:
    print u''.join(u for u in U if asc(u) == c)

输出

aàáâãäåāăąǎǟǡǻȁȃȧḁạảấầẩẫậắằẳẵặ
bḃḅḇ
cçćĉċčḉ
dďḋḍḏḑḓ
eèéêëēĕėęěȅȇȩḕḗḙḛḝẹẻẽếềểễệ
fḟ
 :
etc.

score 5 · Accepted Answer

为什么不将字形与类似的东西进行比较呢？

package similarglyphcharacterdetector;

import java.awt.Color;
import java.awt.Font;
import java.awt.Graphics2D;
import java.awt.Rectangle;
import java.awt.font.FontRenderContext;
import java.awt.image.BufferedImage;
import java.util.HashMap;
import java.util.LinkedHashMap;
import java.util.Map;

public class SimilarGlyphCharacterDetector {

    static char[] TEST_CHARS = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890".toCharArray();
    static BufferedImage[] SAMPLES = null;

    public static BufferedImage drawGlyph(Font font, String string) {
        FontRenderContext frc = ((Graphics2D) new BufferedImage(1, 1, BufferedImage.TYPE_BYTE_GRAY).getGraphics()).getFontRenderContext();

        Rectangle r= font.getMaxCharBounds(frc).getBounds();

        BufferedImage res = new BufferedImage(r.width, r.height, BufferedImage.TYPE_BYTE_GRAY);
        Graphics2D g = (Graphics2D) res.getGraphics();
        g.setBackground(Color.WHITE);
        g.fillRect(0, 0, r.width, r.height);
        g.setPaint(Color.BLACK);
        g.setFont(font);
        g.drawString(string, 0, r.height - font.getLineMetrics(string, g.getFontRenderContext()).getDescent());
        return res;
    }

    private static void drawSamples(Font f) {
        SAMPLES = new BufferedImage[TEST_CHARS.length];
        for (int i = 0; i < TEST_CHARS.length; i++)
            SAMPLES[i] = drawGlyph(f, String.valueOf(TEST_CHARS[i]));
    }

    private static int compareImages(BufferedImage img1, BufferedImage img2) {
        if (img1.getWidth() != img2.getWidth() || img1.getHeight() != img2.getHeight())
            throw new IllegalArgumentException();
        int d = 0;
        for (int y = 0; y < img1.getHeight(); y++) {
            for (int x = 0; x < img1.getWidth(); x++) {
                if (img1.getRGB(x, y) != img2.getRGB(x, y))
                    d++;
            }
        }
        return d;
    }

    private static int nearestSampleIndex(BufferedImage image, int maxDistance) {
        int best = Integer.MAX_VALUE;
        int bestIdx = -1;
        for (int i = 0; i < SAMPLES.length; i++) {
            int diff = compareImages(image, SAMPLES[i]);
            if (diff < best) {
                best = diff;
                bestIdx = i;
            }
        }
        if (best > maxDistance)
            return -1;
        return bestIdx;
    }

    public static void main(String[] args) throws Exception {
        Font f = new Font("FreeMono", Font.PLAIN, 13);
        drawSamples(f);
        HashMap<Character, StringBuilder> res = new LinkedHashMap<Character, StringBuilder>();
        for (char c : TEST_CHARS)
            res.put(c, new StringBuilder(String.valueOf(c)));
        int maxDistance = 5;
        for (int i = 0x80; i <= 0xFFFF; i++) {
            char c = (char)i;
            if (f.canDisplay(c)) {
                int n = nearestSampleIndex(drawGlyph(f, String.valueOf(c)), maxDistance);
                if (n != -1) {
                    char nc = TEST_CHARS[n];
                    res.get(nc).append(c);
                }
            }
        }
        for (Map.Entry<Character, StringBuilder> entry : res.entrySet())
            if (entry.getValue().length() > 1)
                System.out.println(entry.getValue());
    }
}

输出：

AÀÁÂÃÄÅĀĂĄǍǞȀȦΆΑΛАѦӒẠẢἈἉᾸᾹᾺᾼ₳Å
BƁƂΒБВЬḂḄḆ
CĆĈĊČƇΓЄГСὉℂⅭ
...

unicode - 在 Unicode 中查找字形相似的字符？

3 回答 3

输出

输出

Related

Reference