来自RFC5646/BCP47:
Language-Tag = langtag ; normal language tags
/ privateuse ; private use tag
/ grandfathered ; grandfathered tags
langtag = language
["-" script]
["-" region]
*("-" variant)
*("-" extension)
["-" privateuse]
language = 2*3ALPHA ; shortest ISO 639 code
["-" extlang] ; sometimes followed by
; extended language subtags
/ 4ALPHA ; or reserved for future use
/ 5*8ALPHA ; or registered language subtag
privateuse = "x" 1*("-" (1*8alphanum))
grandfathered = irregular ; non-redundant tags registered
/ regular ; during the RFC 3066 era
看起来大多数 BCP-47 代码的第一段应该是有效的 ISO-639 代码,尽管它们可能不是三个字母的变体。BCP-47 语言代码有一些不是 ISO-639 代码的变体——即那些以x-
或i-
以及许多与grandfathered
语法部分匹配的遗留代码开头的变体:
irregular = "en-GB-oed" ; irregular tags do not match
/ "sgn-BE-FR" ; also includes i- prefixed codes
/ "sgn-BE-NL"
/ "sgn-CH-DE"
regular = "art-lojban" ; these tags match the 'langtag'
/ "cel-gaulish" ; production, but their subtags
/ "no-bok" ; are not extended language
/ "no-nyn" ; or variant subtags: their meaning
/ "zh-guoyu" ; is defined by their registration
/ "zh-hakka" ; and all of these are deprecated
/ "zh-min" ; in favor of a more modern
/ "zh-min-nan" ; subtag or sequence of subtags
/ "zh-xiang"
一个好的开始应该是这样的:
def extract_iso_code(bcp_identifier):
language, _ = bcp_identifier.split('-', 1)
if 2 <= len(language) <=3:
# this is a valid ISO-639 code or is grandfathered
else:
# handle non-ISO codes
raise ValueError(bcp_identifier)
从 2 字符变体到 3 字符变体的转换应该很容易处理,因为映射是众所周知的。