java - 检查字符串是否为 Java 中 ISO 语言的 ISO 国家/地区的更简洁方法

Question

假设有两个字符String，应该代表ISO 639国家或语言名称。

你知道，Localeclass 有两个函数getISOLanguages，它们分别返回一个包含所有 ISO 语言和 ISO 国家getISOCountries的数组。String

要检查特定String对象是否是有效的 ISO 语言或 ISO 国家/地区，我应该在该数组中查找匹配的String. 好的，我可以通过使用二进制搜索（例如Arrays.binarySearchApacheCommons ArrayUtils.contains）来做到这一点。

问题是：是否存在任何提供更简洁方式的实用程序（例如来自Guava或Apache Commons库），例如返回 aboolean以验证 aString作为有效 ISO 639 语言或 ISO 639 Country的函数？

例如：

public static boolean isValidISOLanguage(String s)
public static boolean isValidISOCountry(String s)

score 29 · Accepted Answer

我不会费心使用二进制搜索或任何第三方库 -HashSet这很好：

public final class IsoUtil {
    private static final Set<String> ISO_LANGUAGES = Set.of(Locale.getISOLanguages());
    private static final Set<String> ISO_COUNTRIES = Set.of(Locale.getISOCountries());

    private IsoUtil() {}

    public static boolean isValidISOLanguage(String s) {
        return ISO_LANGUAGES.contains(s);
    }

    public static boolean isValidISOCountry(String s) {
        return ISO_COUNTRIES.contains(s);
    }
}

您可以先检查字符串长度，但我不确定我是否会打扰 - 至少不会，除非您想保护自己免受性能攻击，因为您会收到大量字符串，这些字符串需要很长时间才能散列。

编辑：如果您确实想使用 3rd 方库，ICU4J是最有可能的竞争者 - 但它的列表可能比支持的列表更新Locale，因此您希望在任何地方都使用 ICU4J，大概。

score 0 · Accepted Answer

据我所知，任何库中都没有这样的方法，但至少你可以自己声明它：

import static java.util.Arrays.binarySearch;
import java.util.Locale;

/**
 * Validator of country code.
 * Uses binary search over array of sorted country codes.
 * Country code has two ASCII letters so we need at least two bytes to represent the code.
 * Two bytes are represented in Java by short type. This is useful for us because we can use Arrays.binarySearch(short[] a, short needle)
 * Each country code is converted to short via countryCodeNeedle() function.
 *
 * Average speed of the method is 246.058 ops/ms which is twice slower than lookup over HashSet (523.678 ops/ms).
 * Complexity is O(log(N)) instead of O(1) for HashSet.
 * But it consumes only 520 bytes of RAM to keep the list of country codes instead of 22064 (> 21 Kb) to hold HashSet of country codes.
 */
public class CountryValidator {
  /** Sorted array of country codes converted to short */
  private static final short[] COUNTRIES_SHORT = initShortArray(Locale.getISOCountries());

  public static boolean isValidCountryCode(String countryCode) {
    if (countryCode == null || countryCode.length() != 2 || countryCodeIsNotAlphaUppercase(countryCode)) {
      return false;
    }
    short needle = countryCodeNeedle(countryCode);
    return binarySearch(COUNTRIES_SHORT, needle) >= 0;
  }

  private static boolean countryCodeIsNotAlphaUppercase(String countryCode) {
    char c1 = countryCode.charAt(0);
    if (c1 < 'A' || c1 > 'Z') {
      return true;
    }
    char c2 = countryCode.charAt(1);
    return c2 < 'A' || c2 > 'Z';
  }

  /**
   * Country code has two ASCII letters so we need at least two bytes to represent the code.
   * Two bytes are represented in Java by short type. So we should convert two bytes of country code to short.
   * We can use something like:
   * short val = (short)((hi << 8) | lo);
   * But in fact very similar logic is done inside of String.hashCode() function.
   * And what is even more important is that each string object already has cached hash code.
   * So for us the conversion of two letter country code to short can be immediately.
   * We can relay on String's hash code because it's specified in JLS
   **/
  private static short countryCodeNeedle(String countryCode) {
    return (short) countryCode.hashCode();
  }

  private static short[] initShortArray(String[] isoCountries) {
    short[] countriesShortArray = new short[isoCountries.length];
    for (int i = 0; i < isoCountries.length; i++) {
      String isoCountry = isoCountries[i];
      countriesShortArray[i] = countryCodeNeedle(isoCountry);
    }
    return countriesShortArray;
  }
}

将Locale.getISOCountries()始终创建一个新数组，因此我们应该将其存储到静态字段中以避免不必要的分配。同时HashSet或TreeSet消耗大量内存，因此此验证器将对数组使用二进制搜索。这是速度和内存之间的权衡。

java - 检查字符串是否为 Java 中 ISO 语言的 ISO 国家/地区的更简洁方法

2 回答 2

Related

Reference