c# - Get only the prefix from a host name in a given URL

Question

I need to get the domain name without the top level domain suffix of a given url.

e.g

Url :www.google.com then output=google
Url :http://www.google.co.uk/path1/path2 then output=google
Url :http://google.co.uk/path1/path2 then output=google
Url :http://google.com then output=google
Url :http://google.co.in then output=google
Url :http://mail.google.co.in then output=google

For that i try this code

 var uri = new Uri("http://www.google.co.uk/path1/path2");
 var sURL = uri.Host;
 string[] aa = sURL.Split('.');
 MessageBox.Show(aa[1]);

But every time i can't get correct output(specialty url without www). after that i search no google and try to solve it but it's help less. i also see the question on stackoverflow but it can't work for me.

score 1 · Accepted Answer

这个答案只是为了完整性，因为我认为这将是一种有效的方法，如果它不会那么复杂并且基本上滥用 DNS 系统。请注意，这也不是 100% 万无一失的（并且需要访问 DNS）。

提取 URL 的完整域名。让我们http://somepart.subdomain.example.org/some/files举个例子。我们会得到somepart.subdomain.example.org.
在点处拆分域名：{"somepart", "subdomain", "example", "org"}.
取最右边的部分 ( org) 并查看它是否是已知（顶级）域名。
- 如果是，则左侧的下一部分是您要查找的域名。
- 如果不是，请尝试为此检索 IP。
- 如果有 IP，最后添加的部分是您的域名。
- 如果也没有 IP，请将下一部分添加到左侧并重复这些检查（在此示例中，您现在将测试example.org）。

score 1 · Accepted Answer

您的问题的正确答案是：不，您不能。

几乎可以以肮脏且不易维护的方式实现它的唯一解决方案是拥有一个包含所有现有 TopLevelDomain 的列表（您可以在此SO 答案中找到一个不完整的列表）

var allTld = new[] {".com", ".it",".co.uk"}; //there you have find a really big list of all TLD
string urlToCheck = "www.google.com";//sports-ak.espn.go.com/nfl/  http://www.google.co.uk/path1/path2
if (!urlToCheck.StartsWith("http", StringComparison.OrdinalIgnoreCase))
{
    urlToCheck = string.Concat("http://", urlToCheck);
}
var uri = new Uri(urlToCheck);

string domain = string.Empty;
for (int i = 0; i < allTld.Length; i++)
{
    var index = uri.Host.LastIndexOf(allTld[i], StringComparison.OrdinalIgnoreCase);
    if (index>-1)
    {
        domain = uri.Host.Substring(0, index);
        index = domain.LastIndexOf(".", StringComparison.Ordinal);
        if (index>-1)
        {
            domain = domain.Substring(index + 1);break;
        }
    }
}
if (string.IsNullOrEmpty(domain))
{
    throw new Exception(string.Format("TLD of url {0} is missing", urlToCheck));
}

恕我直言，您应该问自己：我真的需要没有 TLD 的名称吗？

score 0 · Accepted Answer

我已经在您的所有案例中使用以下正则表达式进行了测试，并且可以正常工作。

string url = "http://www.google.co.uk/path1/path2";
Regex rgx = new Regex(@"(http(s?)://)?(www.)?((?<content>.*?)\.){1}([\w]+\.?)+");
Match MatchResult = rgx.Match(url);
string result = MatchResult.Groups["content"].Value; //google

score 0 · Accepted Answer

这是你能得到的最好的。这不是一个可维护的解决方案，也不是一个“快速”的解决方案。（GetDomain.GetDomainFromUrl应该优化）。

使用GetDomain.GetDomainFromUrl
另外TldPatterns.EXACT（"co.uk"我不知道为什么它首先不存在）
其他一些小的字符串操作

这应该是这样的：

using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;

        class TldPatterns
        {
            private TldPatterns()
            {
                // Prevent instantiation.
            }

            /**
             * If a hostname is contained in this set, it is a TLD.
             */
            static public string[] EXACT = new string[] {
             "gov.uk",
             "mil.uk",
             "co.uk",
             //...

    public class Program
    {

        static void Main(string[] args)
        {
            string[] urls = new[] {"www.google.com", "http://www.google.co.uk/path1/path2 ", "http://google.co.uk/path1/path2 ",
            "http://google.com", "http://google.co.in"};
            foreach (var item in urls)
            {
                string url = item;
                if (!Regex.IsMatch(item, "^\\w+://"))
                    url = "http://" + item;
                var domain = GetDomain.GetDomainFromUrl(url);
                Console.WriteLine("Original    : " + item);
                Console.WriteLine("URL         : " + url);
                Console.WriteLine("Domain      : " + domain);
                Console.WriteLine("Domain Part : " + domain.Substring(0, domain.IndexOf('.')));
                Console.WriteLine();
            }
        }
    }

输出：

Original    : www.google.com
URL         : http://www.google.com
Domain      : google.com
Domain Part : google

Original    : http://www.google.co.uk/path1/path2
URL         : http://www.google.co.uk/path1/path2
Domain      : google.co.uk
Domain Part : google

Original    : http://google.co.uk/path1/path2
URL         : http://google.co.uk/path1/path2
Domain      : google.co.uk
Domain Part : google

Original    : http://google.com
URL         : http://google.com
Domain      : google.com
Domain Part : google

Original    : http://google.co.in
URL         : http://google.co.in
Domain      : google.co.in
Domain Part : google

c# - Get only the prefix from a host name in a given URL

4 回答 4

Related

Reference