url - 从 URL 获取子域

Question

从 URL 获取子域一开始听起来很容易。

http://www.domain.example

扫描第一个句点，然后返回“http://”之后的任何内容......

然后你记得

http://super.duper.domain.example

哦。所以你想，好吧，找到最后一个句号，回过头来得到之前的一切！

然后你记得

http://super.duper.domain.co.uk

你又回到了原点。除了存储所有 TLD 的列表之外，任何人都有什么好主意吗？

score 75 · Accepted Answer

除了存储所有 TLD 的列表之外，任何人都有什么好主意吗？

不，因为每个 TLD 在子域、二级域等方面都不同。

请记住，有顶级域、二级域和子域。从技术上讲，除了 TLD 之外的所有内容都是子域。

在 domain.com.uk 示例中，“domain”是子域，“com”是二级域，“uk”是 TLD。

所以这个问题仍然比乍看起来更复杂，这取决于每个 TLD 的管理方式。您需要一个包含所有 TLD 的数据库，其中包括它们的特定分区，以及什么是二级域和子域。不过，没有太多 TLD，因此该列表是可以合理管理的，但收集所有这些信息并非易事。可能已经有这样的列表可用。

看起来http://publicsuffix.org/就是这样一个列表 - 适合搜索的列表中的所有常见后缀（.com、.co.uk 等）。解析它仍然不容易，但至少您不必维护列表。

“公共后缀”是互联网用户可以直接注册姓名的后缀。公共后缀的一些示例是“.com”、“.co.uk”和“pvt.k12.wy.us”。公共后缀列表是所有已知公共后缀的列表。

公共后缀列表是 Mozilla 基金会的一项倡议。它可用于任何软件，但最初是为了满足浏览器制造商的需求而创建的。它允许浏览器，例如：

避免为高级域名后缀设置损害隐私的“超级cookies”

在用户界面中突出显示域名最重要的部分

按站点准确排序历史条目

翻看列表，你会发现这不是一个小问题。我认为列表是完成此任务的唯一正确方法...

score 25 · Accepted Answer

正如亚当所说，这并不容易，目前唯一实用的方法是使用列表。

即使这样也有例外 - 例如，.uk有少数域在该级别立即有效但不在中.co.uk，因此必须将它们添加为例外。

目前主流浏览器都是这样做的——有必要确保example.co.uk不能设置 Cookie，.co.uk然后将其发送到.co.uk.

好消息是http://publicsuffix.org/已经有一个列表。

IETF还开展了一些工作来创建某种标准，以允许 TLD 声明其域结构的外观。尽管 . 之类的操作有点复杂.uk.com，但它的操作就像它是一个公共后缀一样，但不是由.com注册表出售的。

score 22 · Accepted Answer

Publicsuffix.org 似乎是这样做的。有很多实现可以轻松解析 publicsuffix 数据文件的内容：

Perl:域::PublicSuffix
Java：http: //sourceforge.net/projects/publicsuffix/
PHP: php 域解析器
C#/.NET：https ://github.com/danesparza/domainname-parser
Python： http: //pypi.python.org/pypi/publicsuffix
红宝石：domainatrix，public_suffix

score 9 · Accepted Answer

正如亚当和约翰已经说过的那样，publicsuffix.org是正确的方法。但是，如果由于某种原因您不能使用这种方法，这里有一个基于适用于所有领域 99% 的假设的启发式方法：

有一个属性可以区分（不是全部，而是几乎全部）“真实”域与子域和 TLD，这就是 DNS 的 MX 记录。您可以创建一个算法来搜索这个：逐个删除主机名的部分并查询 DNS，直到找到 MX 记录。例子：

super.duper.domain.co.uk => no MX record, proceed
duper.domain.co.uk       => no MX record, proceed
domain.co.uk             => MX record found! assume that's the domain

这是php中的一个示例：

function getDomainWithMX($url) {
    //parse hostname from URL 
    //http://www.example.co.uk/index.php => www.example.co.uk
    $urlParts = parse_url($url);
    if ($urlParts === false || empty($urlParts["host"])) 
        throw new InvalidArgumentException("Malformed URL");

    //find first partial name with MX record
    $hostnameParts = explode(".", $urlParts["host"]);
    do {
        $hostname = implode(".", $hostnameParts);
        if (checkdnsrr($hostname, "MX")) return $hostname;
    } while (array_shift($hostnameParts) !== null);

    throw new DomainException("No MX record found");
}

score 2 · Accepted Answer

对于 C 库（在 Python 中生成数据表），我编写了http://code.google.com/p/domain-registry-provider/，它既快速又节省空间。

该库使用约 30kB 的数据表和约 10kB 的 C 代码。由于表是在编译时构建的，因此没有启动开销。有关详细信息，请参阅http://code.google.com/p/domain-registry-provider/wiki/DesignDoc 。

为了更好地理解表生成代码（Python），从这里开始：http ://code.google.com/p/domain-registry-provider/source/browse/trunk/src/registry_tables_generator/registry_tables_generator.py

要更好地理解 C API，请参阅：http ://code.google.com/p/domain-registry-provider/source/browse/trunk/src/domain_registry/domain_registry.h

score 2 · Accepted Answer

如前所述，公共后缀列表只是正确解析域的一种方法。对于 PHP，您可以尝试TLDExtract。这是示例代码：

$extract = new LayerShifter\TLDExtract\Extract();

$result = $extract->parse('super.duper.domain.co.uk');
$result->getSubdomain(); // will return (string) 'super.duper'
$result->getSubdomains(); // will return (array) ['super', 'duper']
$result->getHostname(); // will return (string) 'domain'
$result->getSuffix(); // will return (string) 'co.uk'

score 1 · Accepted Answer

刚刚根据 publicsuffix.org 的信息在 clojure 中为此编写了一个程序：

https://github.com/isaksky/url_dom

例如：

(parse "sub1.sub2.domain.co.uk") 
;=> {:public-suffix "co.uk", :domain "domain.co.uk", :rule-used "*.uk"}

score 1 · Accepted Answer

shell和bash版本

除了亚当戴维斯的正确答案，我想发布我自己的这个操作的解决方案。

由于列表很大，因此有许多不同的测试解决方案中的三种......

首先以这种方式准备您的 TLD 列表：

wget -O - https://publicsuffix.org/list/public_suffix_list.dat |
    grep '^[^/]' |
    tac > tld-list.txt

注意：tac将反转列表以确保测试.co.uk 之前 .uk。

posix外壳版本

splitDom() {
    local tld
    while read tld;do
        [ -z "${1##*.$tld}" ] &&
            printf "%s : %s\n" $tld ${1%.$tld} && return
    done <tld-list.txt
}

测试：

splitDom super.duper.domain.co.uk
co.uk : super.duper.domain

splitDom super.duper.domain.com
com : super.duper.domain

重击版本

为了减少分叉（避免myvar=$(function..)语法），我更喜欢在 bash 函数中设置变量而不是将输出转储到标准输出：

tlds=($(<tld-list.txt))
splitDom() {
    local tld
    local -n result=${2:-domsplit}
    for tld in ${tlds[@]};do
        [ -z "${1##*.$tld}" ] &&
            result=($tld ${1%.$tld}) && return
    done
}

然后：

splitDom super.duper.domain.co.uk myvar
declare -p myvar
declare -a myvar=([0]="co.uk" [1]="super.duper.domain")

splitDom super.duper.domain.com
declare -p domsplit
declare -a domsplit=([0]="com" [1]="super.duper.domain")

更快的bash版本：

用同样的准备，然后：

declare -A TLDS='()'
while read tld ;do
    if [ "${tld##*.}" = "$tld" ];then
        TLDS[${tld##*.}]+="$tld"
      else
        TLDS[${tld##*.}]+="$tld|"
    fi
done <tld-list.txt

这一步要慢得多，但splitDom函数会变得更快：

shopt -s extglob 
splitDom() {
    local domsub=${1%%.*(${TLDS[${1##*.}]%\|})}
    local -n result=${2:-domsplit}
    result=(${1#$domsub.} $domsub)
}

在我的树莓派上进行测试：

两个bash脚本都经过了以下测试：

for dom in dom.sub.example.{,{co,adm,com}.}{com,ac,de,uk};do
    splitDom $dom myvar
    printf "%-40s %-12s %s\n" $dom ${myvar[@]}
done

posix版本已经过详细 for的循环测试，但是

所有测试脚本产生相同的输出：

dom.sub.example.com                      com          dom.sub.example
dom.sub.example.ac                       ac           dom.sub.example
dom.sub.example.de                       de           dom.sub.example
dom.sub.example.uk                       uk           dom.sub.example
dom.sub.example.co.com                   co.com       dom.sub.example
dom.sub.example.co.ac                    ac           dom.sub.example.co
dom.sub.example.co.de                    de           dom.sub.example.co
dom.sub.example.co.uk                    co.uk        dom.sub.example
dom.sub.example.adm.com                  com          dom.sub.example.adm
dom.sub.example.adm.ac                   ac           dom.sub.example.adm
dom.sub.example.adm.de                   de           dom.sub.example.adm
dom.sub.example.adm.uk                   uk           dom.sub.example.adm
dom.sub.example.com.com                  com          dom.sub.example.com
dom.sub.example.com.ac                   com.ac       dom.sub.example
dom.sub.example.com.de                   com.de       dom.sub.example
dom.sub.example.com.uk                   uk           dom.sub.example.com

包含文件读取和splitDom循环的完整脚本使用 posix 版本大约需要 2m，使用基于$tlds数组的第一个 bash 脚本大约需要 1m29s，但~22s使用基于$TLDS 关联数组的最后一个 bash 脚本。

                Posix version     $tldS (array)      $TLDS (associative array)
File read   :       0.04164          0.55507           18.65262
Split loop  :     114.34360         88.33438            3.38366
Total       :     114.34360         88.88945           22.03628

因此，如果填充关联数组是一项更艰巨的工作，那么splitDom函数会变得更快！

score 0 · Accepted Answer

它并没有完全解决，但是您可以通过尝试逐个获取域并检查响应来获得有用的答案，即获取“ http://uk ”，然后获取“ http://co.uk ” ，然后是“ http://domain.co.uk ”。当您收到非错误响应时，您已经获得了域，其余的是子域。

有时你只需要尝试一下:)

编辑：

Tom Leys 在评论中指出，某些域仅设置在 www 子域上，这将在上述测试中给我们一个不正确的答案。好点子！也许最好的方法是使用“ http://www ”和“http://”来检查每个部分，然后将其中任何一个的命中都算作域名该部分的命中？我们仍然会错过一些“替代”安排，例如“web.domain.com”，但我已经有一段时间没有遇到其中的一个了 :)

score 0 · Accepted Answer

使用 URIBuilder 然后获取 URIBUilder.host 属性将其拆分为“。”上的数组。你现在有一个域拆分出来的数组。

score 0 · Accepted Answer

echo tld('http://www.example.co.uk/test?123'); // co.uk

/**
 * http://publicsuffix.org/
 * http://www.alandix.com/blog/code/public-suffix/
 * http://tobyinkster.co.uk/blog/2007/07/19/php-domain-class/
 */
function tld($url_or_domain = null)
{
    $domain = $url_or_domain ?: $_SERVER['HTTP_HOST'];
    preg_match('/^[a-z]+:\/\//i', $domain) and 
        $domain = parse_url($domain, PHP_URL_HOST);
    $domain = mb_strtolower($domain, 'UTF-8');
    if (strpos($domain, '.') === false) return null;

    $url = 'http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1';

    if (($rules = file($url)) !== false)
    {
        $rules = array_filter(array_map('trim', $rules));
        array_walk($rules, function($v, $k) use(&$rules) { 
            if (strpos($v, '//') !== false) unset($rules[$k]);
        });

        $segments = '';
        foreach (array_reverse(explode('.', $domain)) as $s)
        {
            $wildcard = rtrim('*.'.$segments, '.');
            $segments = rtrim($s.'.'.$segments, '.');

            if (in_array('!'.$segments, $rules))
            {
                $tld = substr($wildcard, 2);
                break;
            }
            elseif (in_array($wildcard, $rules) or 
                    in_array($segments, $rules))
            {
                $tld = $segments;
            }
        }

        if (isset($tld)) return $tld;
    }

    return false;
}

score 0 · Accepted Answer

您可以使用这个 lib tld.js: JavaScript API 来处理复杂的域名、子域和 URI。

tldjs.getDomain('mail.google.co.uk');
// -> 'google.co.uk'

如果您在浏览器中获取根域。您可以使用这个 lib AngusFu/browser-root-domain。

var KEY = '__rT_dM__' + (+new Date());
var R = new RegExp('(^|;)\\s*' + KEY + '=1');
var Y1970 = (new Date(0)).toUTCString();

module.exports = function getRootDomain() {
  var domain = document.domain || location.hostname;
  var list = domain.split('.');
  var len = list.length;
  var temp = '';
  var temp2 = '';

  while (len--) {
    temp = list.slice(len).join('.');
    temp2 = KEY + '=1;domain=.' + temp;

    // try to set cookie
    document.cookie = temp2;

    if (R.test(document.cookie)) {
      // clear
      document.cookie = temp2 + ';expires=' + Y1970;
      return temp;
    }
  }
};

使用 cookie 很棘手。

score 0 · Accepted Answer

如果您希望从任意 URL 列表中提取子域和/或域，此 python 脚本可能会有所帮助。不过要小心，它并不完美。一般来说，这是一个很难解决的问题，如果您有一个您期望的域白名单，这将非常有帮助。

从 publicsuffix.org 获取顶级域名

导入请求

url = 'https://publicsuffix.org/list/public_suffix_list.dat'
page = requests.get(url)

域 = []
对于 page.text.splitlines() 中的行：
    如果 line.startswith('//'):
        继续
    别的：
        域 = line.strip()
        如果域：
            域.附加（域）

domain = [d[2:] if d.startswith('*.') else d for d in domain]
print('找到 {} 个域'.format(len(domains)))

构建正则表达式

重新进口

_正则表达式 = ''
对于域中的域：
    _regex += r'{}|'.format(domain.replace('.', '\.'))

subdomain_regex = r'/([^/]*)\.[^/.]+\.({})/.*$'.format(_regex)
domain_regex = r'([^/.]+\.({}))/.*$'.format(_regex)

在 URL 列表上使用正则表达式

FILE_NAME = '' # 将 CSV 文件名放在这里
URL_COLNAME = '' # 把 URL 列名放在这里

将熊猫导入为 pd

df = pd.read_csv(FILE_NAME)
urls = df[URL_COLNAME].astype(str) + '/' # 注意：添加 / 作为帮助正则表达式的技巧

df['sub_domain_extracted'] = urls.str.extract(pat=subdomain_regex, expand=True)[0]
df['domain_extracted'] = urls.str.extract(pat=domain_regex, expand=True)[0]

df.to_csv('extracted_domains.csv', index=False)

score 0 · Accepted Answer

为此，我编写了一个 bash 函数，它依赖于publicsuffix.org数据和一个简单的正则表达式。

在 Ubuntu 18 上安装 publicsuffix.org 客户端：

sudo apt install psl

获取域后缀（最长后缀）：

domain=example.com.tr
output=$(psl --print-unreg-domain $domain)

output是：

example.com.tr: com.tr

剩下的就是简单的 bash。从中提取后缀 (com.tr)domain并测试它是否仍然有多个点。

# split output by colon
arr=(${output//:/ })
# remove the suffix from the domain
name=${1/${arr[1]}/}
# test
if [[ $name =~ \..*\. ]]; then
  echo "Yes, it is subdomain."
fi

bash 函数中的所有内容：

is_subdomain() {
  local output=$(psl --print-unreg-domain $1)
  local arr=(${output//:/ })
  local name=${1/${arr[1]}/}
  [[ $name =~ \..*\. ]]
}

用法：

d=example.com.tr
if is_subdomain $d; then
  echo "Yes, it is."
fi

score 0 · Accepted Answer

private String getSubDomain(Uri url) throws Exception{
                        String subDomain =url.getHost();
                        String fial=subDomain.replace(".","/");
                        String[] arr_subDomain =fial.split("/");
                        return arr_subDomain[0];
                    }

第一个索引将始终是 subDomain

score 0 · Accepted Answer

此代码段返回正确的域名。

InternetDomainName foo = InternetDomainName.from("foo.item.shopatdoor.co.uk").topPrivateDomain(); System.out.println(foo.topPrivateDomain());

score -1 · Accepted Answer

要与 http:// 一起删除的常见后缀（.co.uk、.com 等）列表，然后您将只有“sub.domain”可以使用，而不是“ http://sub”。 domain.suffix "，或者至少我可能会这样做。

最大的问题是可能的后缀列表。毕竟有很多。

score -3 · Accepted Answer

快速查看 publicsuffix.org 列表后，您似乎可以通过从最后一个段为两个字符长的域中删除最后三个段（“段”在这里表示两个点之间的部分）来做出合理的近似，假设它是一个国家代码并将进一步细分。如果最后一段是“us”并且倒数第二段也是两个字符，则删除最后四个段。在所有其他情况下，删除最后两个段。例如：

http://www.domain.example

“example”不是两个字符，所以去掉“domain.example”，留下“www”

http://super.duper.domain.example

“example”不是两个字符，所以去掉“domain.example”，留下“super.duper”

http://super.duper.domain.co.uk

“uk”是两个字符（但不是“us”），所以删除“domain.co.uk”，留下“super.duper”

http://foo.pvt.k12.wy.us

“us”是两个字符，就是“us”，加上“wy”也是两个字符，所以去掉“pvt.k12.wy.us”，留下“foo”。

请注意，尽管这适用于迄今为止我在回复中看到的所有示例，但它仍然只是一个合理的近似值。这并不完全正确，尽管我怀疑它与您在没有制作/获取实际列表以供参考的情况下可能得到的一样接近。