我想验证 php 中的域 url,它可能是国际化域名格式,如希腊域名 = http://παράδειγμα.δοκιμή 他们有什么方法可以使用正则表达式验证它吗?
3 回答
If you want to create your own library, you need to use the table of permitted codepoints (IANA — Repository of IDN Practices, IDN Character Validation Guidance, IDNA Parameters) and the table of Unicode Script properties (UNIDATA/Scripts.txt).
Gmail adopts the Unicode Consortium’s “<a href="http://www.unicode.org/reports/tr39/#Restriction_Level_Detection" rel="nofollow noreferrer">Highly Restricted” specification (Protecting Gmail in a global world). The following combinations of Unicode Scripts are permitted.
- Single script
- Latin + Han + Hiragana + Katakana
- Latin + Han + Bopomofo
- Latin + Han + Hangul
You may need to pay attention to special script property values (Common, Inherited, Unknown) since some of characters has multiple properties or wrong properties.
For example, U+3099 (COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK) has two properties ("Katakana" and "Hiragana") and PCRE function classify it as "Inherited". Another example is U+x2A708. Although the right script property of U+2A708(combination of U+30C8 KATAKANA LETTER TO and U+30E2 KATAKANA LETTER MO) is "Katakana", The Unicode Specification misclassify it as "Han".
You may need to consider IDN homograph attack. Google Chrome's IDN policy adopts the blacklist chars.
My recommendation is to use Zend\Validator\Hostname. This library uses the table of permitted code points for Japanese and Chinese.
If you use Symfony, consider upgrade the app of version to 2.5 which adopts egulias/email-validatornd (Manual). You need extra validation whether the string is well-formed byte sequence. See my reporta> for the detail.
Don't forget XSS and SQL injection. The following address is valid email address based RFC5322.
// From Japanese tutorial
// http://blog.tokumaru.org/2013/11/xsssqlrfc5322.html
"><script>alert('or/**/1=1#')</script>"@example.jp
I think it's doubtful for using idn_to_ascii for validation since idn_to_ascii passes almost all characters.
for ($i = 0; $i < 0x110000; ++$i) {
$c = utf8_chr($i);
if ($c !== '' && false !== idn_to_ascii($c)) {
$number = strtoupper(dechex($i));
$length = strlen($number);
if ($i < 0x10000) {
$number = str_repeat('0', 4 - $length).$number;
}
$idn = $c.'example.com';
echo 'U+'.$number.' ';
echo ' '.$idn.' '. idn_to_ascii($idn);
echo PHP_EOL;
}
}
function utf8_chr($code_point) {
if ($code_point < 0 || 0x10FFFF < $code_point || (0xD800 <= $code_point && $code_point <= 0xDFFF)) {
return '';
}
if ($code_point < 0x80) {
$hex[0] = $code_point;
$ret = chr($hex[0]);
} else if ($code_point < 0x800) {
$hex[0] = 0x1C0 | $code_point >> 6;
$hex[1] = 0x80 | $code_point & 0x3F;
$ret = chr($hex[0]).chr($hex[1]);
} else if ($code_point < 0x10000) {
$hex[0] = 0xE0 | $code_point >> 12;
$hex[1] = 0x80 | $code_point >> 6 & 0x3F;
$hex[2] = 0x80 | $code_point & 0x3F;
$ret = chr($hex[0]).chr($hex[1]).chr($hex[2]);
} else {
$hex[0] = 0xF0 | $code_point >> 18;
$hex[1] = 0x80 | $code_point >> 12 & 0x3F;
$hex[2] = 0x80 | $code_point >> 6 & 0x3F;
$hex[3] = 0x80 | $code_point & 0x3F;
$ret = chr($hex[0]).chr($hex[1]).chr($hex[2]).chr($hex[3]);
}
return $ret;
}
If you want to validate domain by Unicode Script properties, use PCRE functions.
The following code show how to get the name of Unicode script property. If you want to the the Unicode Script properties in JavaScript, use mathiasbynens/unicode-data.
function get_unicode_script_name($c) {
// http://php.net/manual/regexp.reference.unicode.php
$names = [
'Arabic', 'Armenian', 'Avestan', 'Balinese', 'Bamum', 'Batak', 'Bengali',
'Bopomofo', 'Brahmi', 'Braille', 'Buginese', 'Buhid', 'Canadian_Aboriginal',
'Carian', 'Chakma', 'Cham', 'Cherokee', 'Common', 'Coptic', 'Cuneiform',
'Cypriot', 'Cyrillic', 'Deseret', 'Devanagari', 'Egyptian_Hieroglyphs',
'Ethiopic', 'Georgian', 'Glagolitic', 'Gothic', 'Greek', 'Gujarati',
'Gurmukhi', 'Han', 'Hangul', 'Hanunoo', 'Hebrew', 'Hiragana', 'Imperial_Aramaic',
'Inherited', 'Inscriptional_Pahlavi', 'Inscriptional_Parthian', 'Javanese',
'Kaithi', 'Kannada', 'Katakana', 'Kayah_Li', 'Kharoshthi', 'Khmer', 'Lao', 'Latin',
'Lepcha', 'Limbu', 'Linear_B', 'Lisu', 'Lycian', 'Lydian', 'Malayalam', 'Mandaic',
'Meetei_Mayek', 'Meroitic_Cursive', 'Meroitic_Hieroglyphs', 'Miao', 'Mongolian',
'Myanmar', 'New_Tai_Lue', 'Nko', 'Ogham', 'Old_Italic', 'Old_Persian',
'Old_South_Arabian', 'Old_Turkic', 'Ol_Chiki', 'Oriya', 'Osmanya', 'Phags_Pa',
'Phoenician', 'Rejang', 'Runic', 'Samaritan', 'Saurashtra', 'Sharada', 'Shavian',
'Sinhala', 'Sora_Sompeng', 'Sundanese', 'Syloti_Nagri', 'Syriac', 'Tagalog',
'Tagbanwa', 'Tai_Le', 'Tai_Tham', 'Tai_Viet', 'Takri', 'Tamil', 'Telugu', 'Thaana',
'Thai', 'Tibetan', 'Tifinagh', 'Ugaritic', 'Vai', 'Yi'
];
$ret = [];
foreach ($names as $name) {
$pattern = '/\p{'.$name.'}/u';
if (preg_match($pattern, $c)) {
return $name;
}
}
return '';
}
这是一个所谓的IDN 域。支持 IDN 域的客户端使用RFC 5890中指定的 IDNA2008 标准对其进行规范化,然后在提交 DNS 解析之前使用RFC 3492中定义的Punycode编码替换剩余的 unicode 字符。
根据规范,从字面上看,UTF-8 字符集中的每个字符都可以在 IDN 域中使用,但每个顶级域权限都可以在 Unicode 字符集中定义有效字符,因此很难创建和维护真正的正则表达式。
如果您想在您的应用程序中接受 IDN 域,您应该在内部使用编码版本。PHP 扩展 intl带来两个函数来编码和解码 IDN 域名
echo idn_to_ascii('täst.de');
xn--tst-qla.de
编码后,域将通过任何传统的正则表达式检查
简单验证:
$url = "http://example.com/";
if (preg_match('/^(http|https|ftp):\/\/([A-Z0-9][A-Z0-9_-]*(?:\.[A-Z0-9][A-Z0-9_-]*)+):?(\d+)?\/?/i', $url)) {
echo 'OK';
} else {
echo 'Invalid URL.';
}
编辑:
如果您想要真正的 DNS 验证,您可以使用dns_get_record (PHP 5) 或gethostbyaddr
例如
$domain = 'ελληνικά.idn.icann.org';
$idnDomain = idn_to_ascii( $domain );
if ( $dnsResult = dns_get_record( $idnDomain, DNS_ANY ) )
{
echo $idnDomain , "\n";
print_r( $dnsResult );
}
else
{
echo "failed to lookup domain\n";
}
结果:
xn--hxargifdar.idn.icann.org
Array
(
[0] => Array
(
[host] => xn--hxargifdar.idn.icann.org
[class] => IN
[ttl] => 21456
[type] => A
[ip] => 199.7.85.10
)
[1] => Array
(
[host] => xn--hxargifdar.idn.icann.org
[class] => IN
[ttl] => 21600
[type] => AAAA
[ipv6] => 2620::2830:230:0:0:0:10
)
)
这是idn域,我会先将其转换为微不足道的代码版本,然后验证域。
但是,如果您真的想通过正则表达式验证
<?php
$domain = 'παράδειγμα.gr';
$regex = '#^([\w-]+://?|www[\.])?([^\-\s\,\;\:\+\/\\\?\^\`\=\&\%\"\'\*\#\<\>]*)\.[a-z]{2,7}$#';
if (preg_match($regex, $domain)) {
echo "VALID";
}
但这会让你在误报中运行,因为验证一个 IDN 域真的很复杂,我试图验证其中没有无效字符,但列表不完整。
更好地将bevore转换为punny代码
$regex = '#^([\w-]+://?|www[\.])?[a-z0-9]+[a-z0-9\-\.]*[a-z0-9]+\.[a-z]{2,7}$#';
if (preg_match($regex, idn_to_ascii($domain))) {
echo "VALID";
}
如果您还想测试是否可以解析域,请尝试:
$regex = '#^([\w-]+://?|www[\.])?[a-z0-9]+[a-z0-9\-\.]*[a-z0-9]+\.[a-z]{2,7}$#';
$punny_domain = idn_to_ascii($domain);
if (preg_match($regex, $punny_domain)) {
if (gethostbyname($punny_domain) != $punny_domain) {
echo "VALID";
}
}