如何在 Java 脚本中找到有关 Unicode 字符(例如它所属的字符集)的信息?
例如
00e9 LATIN SMALL LETTER E WITH ACUTE
0bf2 TAMIL NUMBER ONE THOUSAND
我知道一种使用该unicodedata
库在 python 中查找有关 Unicode 代码点详细信息的方法。有没有办法在 JS 中找到这些信息?
PS:我正在使用它进行 chrome 扩展开发,所以使用他们的 API 的解决方案也很好。
如何在 Java 脚本中找到有关 Unicode 字符(例如它所属的字符集)的信息?
例如
00e9 LATIN SMALL LETTER E WITH ACUTE
0bf2 TAMIL NUMBER ONE THOUSAND
我知道一种使用该unicodedata
库在 python 中查找有关 Unicode 代码点详细信息的方法。有没有办法在 JS 中找到这些信息?
PS:我正在使用它进行 chrome 扩展开发,所以使用他们的 API 的解决方案也很好。
英语文本以拉丁文、通用文和继承文体的代码点为主,在某些语料库中,还有希腊文。
例如,PubMed Open Access 集合是所有英语文本的一个非常大的集合,其中填充了非 ASCII 代码点。其中 90% 仅由 36 个不同的代码点占,如下所示:
rank percent cumulative code glyph GC=?? Name
---------------------------------------------------------------------
1 18.553% 18.553% U+02013 ‹–› GC=Pd EN DASH
2 7.422% 25.974% U+000A0 ‹ › GC=Zs NO-BREAK SPACE
3 7.033% 33.007% U+000B1 ‹±› GC=Sm PLUS-MINUS SIGN
4 5.461% 38.469% U+02212 ‹−› GC=Sm MINUS SIGN
5 4.196% 42.664% U+02003 ‹ › GC=Zs EM SPACE
6 3.682% 46.346% U+003BC ‹μ› GC=Ll GREEK SMALL LETTER MU
7 3.619% 49.965% U+003B2 ‹β› GC=Ll GREEK SMALL LETTER BETA
8 3.568% 53.534% U+003B1 ‹α› GC=Ll GREEK SMALL LETTER ALPHA
9 3.426% 56.959% U+0200A ‹ › GC=Zs HAIR SPACE
10 3.221% 60.181% U+000B0 ‹°› GC=So DEGREE SIGN
11 2.931% 63.112% U+02009 ‹ › GC=Zs THIN SPACE
12 2.620% 65.732% U+02019 ‹’› GC=Pf RIGHT SINGLE QUOTATION MARK
13 2.506% 68.238% U+02032 ‹′› GC=Po PRIME
14 2.441% 70.679% U+000D7 ‹×› GC=Sm MULTIPLICATION SIGN
15 2.042% 72.722% U+0201D ‹”› GC=Pf RIGHT DOUBLE QUOTATION MARK
16 2.039% 74.761% U+0201C ‹“› GC=Pi LEFT DOUBLE QUOTATION MARK
17 1.536% 76.296% U+00394 ‹Δ› GC=Lu GREEK CAPITAL LETTER DELTA
18 1.415% 77.712% U+000B5 ‹µ› GC=Ll MICRO SIGN
19 1.337% 79.049% U+003B3 ‹γ› GC=Ll GREEK SMALL LETTER GAMMA
20 1.210% 80.259% U+000E9 ‹é› GC=Ll LATIN SMALL LETTER E WITH ACUTE
21 1.152% 81.410% U+02014 ‹—› GC=Pd EM DASH
22 1.135% 82.546% U+02018 ‹‘› GC=Pi LEFT SINGLE QUOTATION MARK
23 0.998% 83.543% U+000A9 ‹©› GC=So COPYRIGHT SIGN
24 0.710% 84.253% U+02265 ‹≥› GC=Sm GREATER-THAN OR EQUAL TO
25 0.600% 84.853% U+000F6 ‹ö› GC=Ll LATIN SMALL LETTER O WITH DIAERESIS
26 0.599% 85.452% U+000B7 ‹·› GC=Po MIDDLE DOT
27 0.597% 86.049% U+02022 ‹•› GC=Po BULLET
28 0.594% 86.644% U+0223C ‹∼› GC=Sm TILDE OPERATOR
29 0.573% 87.217% U+003BA ‹κ› GC=Ll GREEK SMALL LETTER KAPPA
30 0.569% 87.785% U+000FC ‹ü› GC=Ll LATIN SMALL LETTER U WITH DIAERESIS
31 0.493% 88.278% U+02264 ‹≤› GC=Sm LESS-THAN OR EQUAL TO
32 0.440% 88.718% U+000AE ‹®› GC=So REGISTERED SIGN
33 0.433% 89.152% U+000E4 ‹ä› GC=Ll LATIN SMALL LETTER A WITH DIAERESIS
34 0.422% 89.573% U+02020 ‹†› GC=Po DAGGER
35 0.407% 89.980% U+003B4 ‹δ› GC=Ll GREEK SMALL LETTER DELTA
检测这些的一种方法是使用 Unicode 正则表达式,该表达式表示字符必须来自拉丁文、希腊文、通用或继承脚本。
在这个语料库中,前四个包含超过 99% 的代码点。然而,在这个数据集中也有很多超低频码点不在这四种文字(例如西里尔文、韩文、假名、韩文等)之外。如果您将输入限制在前面列出的四个超常见脚本中,您会将它们作为误报丢弃。该数据集中有 239 个此类不同的代码点,其中最常见的前 50 个代码点如下:
rank percent cumulative code glyph GC=?? Name
---------------------------------------------------------------------
295 0.002% 99.828% U+00424 ‹Ф› GC=Lu CYRILLIC CAPITAL LETTER EF
381 0.001% 99.916% U+0043A ‹к› GC=Ll CYRILLIC SMALL LETTER KA
454 0.000% 99.949% U+00413 ‹Г› GC=Lu CYRILLIC CAPITAL LETTER GHE
491 0.000% 99.959% U+0AD6D ‹국› GC=Lo HANGUL SYLLABLE GUG
499 0.000% 99.961% U+003EC ‹Ϭ› GC=Lu COPTIC CAPITAL LETTER SHIMA
513 0.000% 99.965% U+00406 ‹І› GC=Lu CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I
528 0.000% 99.968% U+00416 ‹Ж› GC=Lu CYRILLIC CAPITAL LETTER ZHE
534 0.000% 99.969% U+00430 ‹а› GC=Ll CYRILLIC SMALL LETTER A
539 0.000% 99.970% U+0041F ‹П› GC=Lu CYRILLIC CAPITAL LETTER PE
545 0.000% 99.971% U+00421 ‹С› GC=Lu CYRILLIC CAPITAL LETTER ES
553 0.000% 99.972% U+0D55C ‹한› GC=Lo HANGUL SYLLABLE HAN
555 0.000% 99.972% U+00404 ‹Є› GC=Lu CYRILLIC CAPITAL LETTER UKRAINIAN IE
566 0.000% 99.974% U+0C5B4 ‹어› GC=Lo HANGUL SYLLABLE EO
567 0.000% 99.974% U+0041A ‹К› GC=Lu CYRILLIC CAPITAL LETTER KA
568 0.000% 99.974% U+0041B ‹Л› GC=Lu CYRILLIC CAPITAL LETTER EL
571 0.000% 99.975% U+0B2C8 ‹니› GC=Lo HANGUL SYLLABLE NI
575 0.000% 99.975% U+0AE4C ‹까› GC=Lo HANGUL SYLLABLE GGA
578 0.000% 99.976% U+00428 ‹Ш› GC=Lu CYRILLIC CAPITAL LETTER SHA
579 0.000% 99.976% U+00454 ‹є› GC=Ll CYRILLIC SMALL LETTER UKRAINIAN IE
585 0.000% 99.977% U+00418 ‹И› GC=Lu CYRILLIC CAPITAL LETTER I
587 0.000% 99.977% U+0B2E4 ‹다› GC=Lo HANGUL SYLLABLE DA
600 0.000% 99.978% U+00440 ‹р› GC=Ll CYRILLIC SMALL LETTER ER
610 0.000% 99.980% U+00457 ‹ї› GC=Ll CYRILLIC SMALL LETTER YI
614 0.000% 99.980% U+0C74C ‹음› GC=Lo HANGUL SYLLABLE EUM
623 0.000% 99.981% U+0BD80 ‹부› GC=Lo HANGUL SYLLABLE BU
624 0.000% 99.981% U+0C545 ‹악› GC=Lo HANGUL SYLLABLE AG
625 0.000% 99.981% U+0C778 ‹인› GC=Lo HANGUL SYLLABLE IN
640 0.000% 99.982% U+0C5D0 ‹에› GC=Lo HANGUL SYLLABLE E
641 0.000% 99.983% U+0C744 ‹을› GC=Lo HANGUL SYLLABLE EUL
645 0.000% 99.983% U+00438 ‹и› GC=Ll CYRILLIC SMALL LETTER I
664 0.000% 99.984% U+0041C ‹М› GC=Lu CYRILLIC CAPITAL LETTER EM
665 0.000% 99.984% U+00436 ‹ж› GC=Ll CYRILLIC SMALL LETTER ZHE
674 0.000% 99.985% U+0C774 ‹이› GC=Lo HANGUL SYLLABLE I
678 0.000% 99.985% U+00431 ‹б› GC=Ll CYRILLIC SMALL LETTER BE
679 0.000% 99.986% U+00435 ‹е› GC=Ll CYRILLIC SMALL LETTER IE
689 0.000% 99.986% U+0B300 ‹대› GC=Lo HANGUL SYLLABLE DAE
690 0.000% 99.986% U+0BD84 ‹분› GC=Lo HANGUL SYLLABLE BUN
691 0.000% 99.986% U+0C678 ‹외› GC=Lo HANGUL SYLLABLE OE
696 0.000% 99.987% U+005DB ‹כ› GC=Lo HEBREW LETTER KAF
703 0.000% 99.987% U+0B85C ‹로› GC=Lo HANGUL SYLLABLE RO
711 0.000% 99.988% U+0041D ‹Н› GC=Lu CYRILLIC CAPITAL LETTER EN
712 0.000% 99.988% U+004D9 ‹ә› GC=Ll CYRILLIC SMALL LETTER SCHWA
725 0.000% 99.988% U+0B294 ‹는› GC=Lo HANGUL SYLLABLE NEUN
726 0.000% 99.988% U+0B9CC ‹만› GC=Lo HANGUL SYLLABLE MAN
727 0.000% 99.988% U+0C11C ‹서› GC=Lo HANGUL SYLLABLE SEO
728 0.000% 99.989% U+0C2B5 ‹습› GC=Lo HANGUL SYLLABLE SEUB
729 0.000% 99.989% U+0C601 ‹영› GC=Lo HANGUL SYLLABLE YEONG
741 0.000% 99.989% U+00441 ‹с› GC=Ll CYRILLIC SMALL LETTER ES
742 0.000% 99.989% U+00444 ‹ф› GC=Ll CYRILLIC SMALL LETTER EF
743 0.000% 99.989% U+004B0 ‹Ұ› GC=Lu CYRILLIC CAPITAL LETTER STRAIGHT U WITH STROKE
在这 239 个不同的 trans-ASCII 码点中,其中 59 个也在 Unicode 的基本多语言平面之外,因此任何处理都必须能够处理 Unicode 的全部范围。除了其中一个之外,所有这些都是数学字母。这些是其中的前 20 名:
rank percent cumulative code glyph GC=?? Name
---------------------------------------------------------------------
227 0.004% 99.660% U+1D49E ‹› GC=Lu MATHEMATICAL SCRIPT CAPITAL C
240 0.003% 99.704% U+1D4AF ‹› GC=Lu MATHEMATICAL SCRIPT CAPITAL T
252 0.003% 99.738% U+1D4AE ‹› GC=Lu MATHEMATICAL SCRIPT CAPITAL S
275 0.002% 99.791% U+1D49F ‹› GC=Lu MATHEMATICAL SCRIPT CAPITAL D
279 0.002% 99.799% U+1D4B3 ‹› GC=Lu MATHEMATICAL SCRIPT CAPITAL X
289 0.002% 99.818% U+1D4A9 ‹› GC=Lu MATHEMATICAL SCRIPT CAPITAL N
291 0.002% 99.821% U+1D4AB ‹› GC=Lu MATHEMATICAL SCRIPT CAPITAL P
292 0.002% 99.823% U+1D4A2 ‹› GC=Lu MATHEMATICAL SCRIPT CAPITAL G
313 0.001% 99.854% U+1D49C ‹› GC=Lu MATHEMATICAL SCRIPT CAPITAL A
316 0.001% 99.858% U+1D53C ‹› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL E
341 0.001% 99.884% U+1D4AA ‹› GC=Lu MATHEMATICAL SCRIPT CAPITAL O
430 0.000% 99.941% U+1D4A5 ‹› GC=Lu MATHEMATICAL SCRIPT CAPITAL J
450 0.000% 99.948% U+1D4A6 ‹› GC=Lu MATHEMATICAL SCRIPT CAPITAL K
458 0.000% 99.950% U+1D4B1 ‹› GC=Lu MATHEMATICAL SCRIPT CAPITAL V
461 0.000% 99.951% U+1D4B2 ‹› GC=Lu MATHEMATICAL SCRIPT CAPITAL W
468 0.000% 99.953% U+1D4B4 ‹› GC=Lu MATHEMATICAL SCRIPT CAPITAL Y
469 0.000% 99.954% U+1D4B5 ‹› GC=Lu MATHEMATICAL SCRIPT CAPITAL Z
500 0.000% 99.962% U+1D4B0 ‹› GC=Lu MATHEMATICAL SCRIPT CAPITAL U
518 0.000% 99.966% U+1D4AC ‹› GC=Lu MATHEMATICAL SCRIPT CAPITAL Q
560 0.000% 99.973% U+1D54A ‹› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL S
其他语料库会有所不同。你必须知道你的数据集。
不幸的是,这几乎是不可能的。很难定义每种语言使用哪些字符。(例如,英语肯定会使用 \u0000 到 \u007F 之外的许多字符,例如在许多源自法语的单词中的破折号和“é”。您在哪里划定界限。)在 CLDR 数据库中为语言定义了一些字符集合,但那里的选择可能会受到质疑。对于许多语言,集合是如此庞大和稀疏(就 Unicode 编码空间而言),以至于它们的任何正则表达式都会很长。
所以硬编码范围甚至是不够的;你需要一组范围加上单个字符。
也许最重要的问题是:你会用这个做什么?这些技术需要据此进行评估。总的来说,JavaScript 在国际化方面非常原始且有限。
正则表达式中有强大的 Unicode 支持:http ://www.regular-expressions.info/unicode.html
但是这些特性仅在 es6 之后才在 JavaScript 中得到支持。即使在 Chrome 中也没有实现。也许,它会在您完成代码时实现。
此外,即使是英语,事情也不是那么简单:café、naïve、coördinator。