java - 如何只解析域名

Question

我只想解析 JAVA 中的域名。例如，

http://facebook.com/bartsf
http://www.facebook.com/pages/Shine-Communications/169790283042195
http://graph.facebook.com/100002306245454/picture?width=150&height=150
http://maps.google.com/maps?hl=en&q=37.78353+-122.39579
http://www.google.com/url?sa=X&amp;q=http://www.onlinehaendler-news.de/interviews/1303-abba24-im-spagat-zwischen-haendler-und-kaeuferinteressen.html&amp;ct=ga&amp;cad=CAEQARgAIAAoATABOAFAnqSQjwVIAVAAWABiAmRl&amp;cd=xa_cHWHNG70&amp;usg=AFQjCNFMgnkzqN0fNKMFKz1NTKK1n9Gg9A

这是我正在编写 map reduce 代码的代码。

 String[] whiteList={"www.facebook.com","www.google.com"};
 UrlValidator urlValidator=new UrlValidator(schemes);
 Readfile line by line

for line in file
{
            String sCurrentLine=line;
            if(sCurrentLine.length()>=3)
            {
                String tempString=sCurrentLine.substring(0,3);

                if(!tempString.equals("192") && !tempString.equals("172") && !tempString.equals("10."))
                {

                    sCurrentLine="http://"+sCurrentLine;
                    if(urlValidator.isValid(sCurrentLine))//domain filter should be here
                    {
                           System.out.println(sCurrentLine);
                    }
                }
                tempString="";
            }
 }

我想过滤域名是 facebook.com 还是 google.com，上面的所有 url 都会被过滤掉。

score 8 · Accepted Answer

用于java.net.URI将字符串解析为 URI。这里没有必要重新发明轮子。

URI foo = new URI("http://facebook.com/bartsf");
String host = foo.getHost(); // "facebook.com"

score 2 · Accepted Answer

或者您可以使用 URL 类：

URL url = new URL("http://www.facebook.com/pages/Shine-Communications/169790283042195");
String host = url.getHost();
// 'indexOf' is required since the root domain is all you care about. This handles
//  bob.facebook.com as well as facebook.com 
if (host.indexOf("facebook.com") >= 0 || host.indexOf("google.com") >= 0) {
    ... got one of those ...
} else {
    ... got something else ...
}

您必须添加一些try ... catch东西来处理将字符串传递给可能根本不是 URL 的 URL 构造函数。

file://另外，请注意，如果您通过 a或 amailto:如果这是一个问题，这可能不会完全符合您的要求。

我在使用这个类时看到的最大可能的问题是，javadocs 中没有任何地方定义了所有的术语。例如，路径是什么？它由getPath()具有 javadoc 表示“获取此 URL 的路径部分”的方法返回。你可能想知道这到底包括什么。我想知道是否包含 URL 的最后一部分，之前?或#如果有的话。（答案是否定的。它只是上升到URL的?or或 end 之前的最后一个斜杠。）#

继续扩大问题

我不喜欢这条线：

String tempString=sCurrentLine.substring(0,3);
if (!tempString.equals("192") && !tempString.equals("172") && !tempString.equals("10."))

但我确实喜欢这样：

if(!sCurrentLine.startsWith("192.168.") && !sCurrentLine.beginsWith("172.") && !sCurrentLine.startsWith("10."))

我怀疑如果你的白名单只有“facebook.com”和“google.com”会更好，因为“www”并不那么重要，而且两家公司都有很多子域。

上面的代码将进入您的UrlValidator课程。

java - 如何只解析域名

2 回答 2

Related

Reference