1

I'm trying to build a Web Crawler in Java, and I'm wondering if there is any way I can get the relative path from an absolute path given the base url. I'm trying to replace any absolute paths in the html under the same domain.

As the http urls contains unsafe characters, I was not able to use Java URI as described in How to construct a relative path in Java from two absolute paths (or URLs)?.

I'm using jsoup to parse my html and it seems that it is able to get absolute path from relative, but not the other way round.

E.g. In a particular html of the following html,

"http://www.example.com/mysite/base.html"

In the page source of base.html, it can contains:

'<a href="http://www.example.com/myanothersite/new.html"> Another site of mine </a>

I am trying to cache this base.html, and edit it such that it now contains:

'<a href="../myanothersite/new.html">Another site of mine</a>
4

2 回答 2

2

一种不需要给定 baseUrl 并使用更高级方法的不同方法。

    String sourceUrl = "http://www.example.com/mysite/whatever/somefolder/bar/unsecure!+?#whätyöühäv€it/site.html"; // your current site
    String targetUrl = "http://www.example.com/mysite/whatever/otherfolder/other.html"; // the link target
    String expectedTarget = "../../../otherfolder/other.html";
    String[] sourceElements = sourceUrl.split("/");
    String[] targetElements = targetUrl.split("/"); // keep in mind that the arrays are of different length!
    StringBuilder uniquePart = new StringBuilder();
    StringBuilder relativePart = new StringBuilder();
    boolean stillSame = true;
    for(int ii = 0; ii < sourceElements.length || ii < targetElements.length; ii++) {
        if(ii < targetElements.length && ii < sourceElements.length && 
                stillSame && sourceElements[ii].equals(targetElements[ii]) && stillSame) continue;
        stillSame = false;
        if(targetElements.length > ii)
          uniquePart.append("/").append(targetElements[ii]);
        if(sourceElements.length > ii +1)
            relativePart.append("../");
    }

    String result = relativePart.toString().substring(0, relativePart.length() -1) + uniquePart.toString();
    System.out.println("result: " + result);
于 2013-09-30T12:19:48.743 回答
0

这应该这样做。请记住,您可以通过测量源 URL 和目标 URL 的相同程度来计算 baseUrl!

    String baseUrl = "http://www.example.com/mysite/whatever/"; // the base of your site
    String sourceUrl = "http://www.example.com/mysite/whatever/somefolder/bar/unsecure!+?#whätyöühäv€it/site.html"; // your current site
    String targetUrl = "http://www.example.com/mysite/whatever/otherfolder/other.html"; // the link target
    String expectedTarget = "../../../otherfolder/other.html";
    // cut away the base.
    if(sourceUrl.startsWith(baseUrl))
        sourceUrl = sourceUrl.substring(baseUrl.length());
    if(!sourceUrl.startsWith("/"))
        sourceUrl = "/" + sourceUrl;

    // construct the relative levels up
    StringBuilder bar = new StringBuilder();
    while(sourceUrl.startsWith("/"))
    {
        if(sourceUrl.indexOf("/", 1) > 0) {
            bar.append("../");
            sourceUrl = sourceUrl.substring(sourceUrl.indexOf("/", 1));
        } else {
            break;
        }
        System.out.println("foo: " + sourceUrl);
    }

    // add the unique part of the target
    targetUrl = targetUrl.substring(baseUrl.length());
    bar.append(targetUrl);

    System.out.println("expectation: " + expectedTarget.equals(bar.toString()));
    System.out.println("bar: " + bar);
于 2013-09-30T11:59:52.490 回答