java - 为什么 Apache Commons 将“१२३”视为数字？

Question

根据 Apache Commons Lang 的文档StringUtils.isNumeric()，字符串 '१२३' 是数字。

由于我认为这可能是文档中的错误，因此我进行了测试以验证该声明。我发现根据 Apache Commons 它是数字的。

为什么这个字符串是数字的？这些字符代表什么？

score 199 · Accepted Answer

199

于 2016-10-20T08:03:33.737 回答

score 59 · Accepted Answer

59

于 2016-10-20T08:01:40.887 回答

score 26 · Accepted Answer

You can use Character#getType to check the character's general category:

System.out.println(Character.DECIMAL_DIGIT_NUMBER == Character.getType('१'));

This will print true, which is an "evidence" that '१' is a digit number.

Now let's examine the unicode value of the '१' character:

System.out.println(Integer.toHexString('१'));
// 967

This number is on the range of Devanagari digits - which is: \u0966 through \u096F.

Also try:

Character.UnicodeBlock block = Character.UnicodeBlock.of('१');
System.out.println(block.toString());
// DEVANAGARI

Devanagari is:

is an abugida (alphasyllabary) alphabet of India and Nepal

"१२३" is a "123" (Basic Latin unicode).

Reading:

score 23 · Accepted Answer

If you ever want to know what properties a particular "character" has (and there are quite a few), go directly to the source: Unicode.org. They have research tools that can show you most anything you would care to know.

If you want to see all of the properties of a specific character, try the following:

http://unicode.org/cldr/utility/character.jsp?a=१</a>

or:

http://unicode.org/cldr/utility/character.jsp?a=%E0%A5%A7
If you want to see all characters classified as "decimal digits" (i.e. with number values of 0 through 9), try the following:

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Numeric_Type=Decimal:]
^{( 550 Code Points -- currently / as of Unicode 9.0 )}
If you want to see all characters classified as "non-decimal digit numbers" (i.e. fractions, circled, etc), try the following:

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Numeric_Type=Numeric:]
^{( 836 Code Points -- currently / as of Unicode 9.0 )}
If you want to see all characters classified as "decimal digits" (i.e. with number values of 0 through 9), but only up through Unicode 6.0 (which .NET uses), try the following:

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Numeric_Type=Decimal:]%26[:Age=6.0:]
^{( 420 Code Points -- and shouldn't change )}
If you want to see all characters classified as "decimal digits" (i.e. with number values of 0 through 9), but only up through Unicode 6.0 (which .NET uses), and only in the Base-Multilingual Plane / no Supplementary Characters (i.e. nothing above Code Point 65535 / U+0xFFFF), try the following:

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Numeric_Type=Decimal:]%26[:Age=6.0:]%26[:bmp=Yes:]
^{( 350 Code Points -- and shouldn't change )}

KEEP IN MIND: The Unicode Consortium produces a specification, not software. This means that it is up to each software vendor to implement the specification as accurately as they can. So just like HTML, JavaScript, CSS, SQL, etc, there is variation between different platforms, languages, and so on. For example, I found a bug in Microsoft's .NET Framework whereby circled Latin letters A-Z and a-z -- Code Points 0x24B6 through 0x24E9 -- do not properly register as being char.IsLetter = true (bug report here). And that leads to unexpected behavior in related functionality, such as when calling the TextInfo.ToTitleCase() method (bug report here).

score 19 · Accepted Answer

Symbols '१२३' are actually derived from Hindi language(Basically from Sanskrit language i.e Devanagiri) which represent numeric values just like:

१ represent 1

२ represent 2

and like wise

java - 为什么 Apache Commons 将“१२३”视为数字？

5 回答 5

Related

Reference