html - 在相对部分使用多个斜杠解析 URI

Question

我必须在 perl 中编写一个从 html 解析 uris 的脚本。无论如何，真正的问题是如何解决相对 uris。

我有基本 URI（html 中的基本 href），例如 http://a/b/c/d;p?q （让我们通过rfc3986）和其他不同的 URI：

/g, //g, ///g, ////g, h//g, g////h, h///g:f

在这个 RFC 的第 5.4.1 节（上面的链接）中，只有 //g 的示例：

"//g" = "http://g"

那么所有其他情况呢？据我从rfc 3986, section 3.3了解，允许使用多个斜杠。那么，以下分辨率是否正确？

"///g" = "http://a/b/c///g"

或者应该是什么？有没有人可以更好地解释它并用没有过时的 rfc 或文档来证明它？

更新＃1：尝试查看此工作网址 - https:///stackoverflow.com////////a////10161264/////6618577

这里发生了什么？

score 5 · Accepted Answer

我将首先确认您提供的所有 URI 都是有效的，并提供您提到的 URI 解析的结果（以及我自己的几个结果）：

$ perl -MURI -e'
   for my $rel (qw( /g //g ///g ////g h//g g////h h///g:f )) {
      my $uri = URI->new($rel)->abs("http://a/b/c/d;p?q");
      printf "%-20s + %-7s = %-20s   host: %-4s   path: %s\n",
         "http://a/b/c/d;p?q", $rel, $uri, $uri->host, $uri->path;
   }

   for my $base (qw( http://host/a/b/c/d http://host/a/b/c//d )) {
      my $uri = URI->new("../../e")->abs($base);
      printf "%-20s + %-7s = %-20s   host: %-4s   path: %s\n",
         $base, "../../e", $uri, $uri->host, $uri->path;
   }
'
http://a/b/c/d;p?q   + /g      = http://a/g             host: a      path: /g
http://a/b/c/d;p?q   + //g     = http://g               host: g      path:
http://a/b/c/d;p?q   + ///g    = http:///g              host:        path: /g
http://a/b/c/d;p?q   + ////g   = http:////g             host:        path: //g
http://a/b/c/d;p?q   + h//g    = http://a/b/c/h//g      host: a      path: /b/c/h//g
http://a/b/c/d;p?q   + g////h  = http://a/b/c/g////h    host: a      path: /b/c/g////h
http://a/b/c/d;p?q   + h///g:f = http://a/b/c/h///g:f   host: a      path: /b/c/h///g:f
http://host/a/b/c/d  + ../../e = http://host/a/e        host: host   path: /a/e
http://host/a/b/c//d + ../../e = http://host/a/b/e      host: host   path: /a/b/e

接下来，我们将查看相对 URI 的语法，因为这就是您的问题所围绕的内容。

relative-ref  = relative-part [ "?" query ] [ "#" fragment ]

relative-part = "//" authority path-abempty
              / path-absolute
              / path-noscheme
              / path-empty

path-abempty  = *( "/" segment )
path-absolute = "/" [ segment-nz *( "/" segment ) ]
path-noscheme = segment-nz-nc *( "/" segment )
path-rootless = segment-nz *( "/" segment )

segment       = *pchar         ; 0 or more <pchar>
segment-nz    = 1*pchar        ; 1 or more <pchar>   nz = non-zero

这些规则中回答您问题的关键点：

绝对路径 ( path-absolute) 不能以 . 开头//。第一段（如果提供）的长度必须非零。如果相对 URI 以开头//，则后面的内容必须是authority.
//否则可能会出现在路径中，因为段的长度可以为零。

现在，让我们依次看看您提供的每个解决方案。

/g是绝对路径path-absolute，因此是有效的相对 URI ( relative-ref)，因此是有效的 URI ( URI-reference)。

解析 URI（例如，使用附录 B 中的正则表达式）为我们提供了以下信息：

Base.scheme:    "http"       R.scheme:    undef
Base.authority: "a"          R.authority: undef
Base.path:      "/b/c/d;p"   R.path:      "/g"
Base.query:     "q"          R.query:     undef
Base.fragment:  undef        R.fragment:  undef

按照 §5.2.2 中的算法，我们得到：

T.path:         "/g"      ; remove_dot_segments(R.path)
T.query:        undef     ; R.query
T.authority:    "a"       ; Base.authority
T.scheme:       "http"    ; Base.scheme
T.fragment:     undef     ; R.fragment

按照 §5.3 中的算法，我们得到：
```
http://a/g
```

//g是不同的。//g 不是绝对路径 ( path_absolute)，因为绝对路径不能以空段 ( "/" [ segment-nz *( "/" segment ) ]) 开头。

相反，它遵循以下模式：

"//" authority path-abempty

解析 URI（例如，使用附录 B 中的正则表达式）为我们提供了以下信息：

Base.scheme:    "http"       R.scheme:    undef
Base.authority: "a"          R.authority: "g"
Base.path:      "/b/c/d;p"   R.path:      ""
Base.query:     "q"          R.query:     undef
Base.fragment:  undef        R.fragment:  undef

按照 §5.2.2 中的算法，我们得到以下结果：

T.authority:    "g"           ; R.authority
T.path:         ""            ; remove_dot_segments(R.path)
T.query:        ""            ; R.query
T.scheme:       "http"        ; Base.scheme
T.fragment:     undef         ; R.fragment

按照 §5.3 中的算法，我们得到以下结果：
```
http://g
```

注意：此联系人服务器g！

///g类似于//g，只是权限为空！这是令人惊讶的有效。

解析 URI（例如，使用附录 B 中的正则表达式）为我们提供了以下信息：

Base.scheme:    "http"       R.scheme:    undef
Base.authority: "a"          R.authority: ""
Base.path:      "/b/c/d;p"   R.path:      "/g"
Base.query:     "q"          R.query:     undef
Base.fragment:  undef        R.fragment:  undef

按照 §5.2.2 中的算法，我们得到以下结果：

T.authority:    ""        ; R.authority
T.path:         "/g"      ; remove_dot_segments(R.path)
T.query:        undef     ; R.query
T.scheme:       "http"    ; Base.scheme
T.fragment:     undef     ; R.fragment

按照 §5.3 中的算法，我们得到以下结果：
```
http:///g
```

注意：虽然有效，但此 URI 无用，因为服务器名称 ( T.authority) 为空白！

////g///g与除了R.pathis相同//g，所以我们得到

    http:////g

注意：虽然有效，但此 URI 无用，因为服务器名称 ( T.authority) 为空白！

最后三个 ( h//g, g////h, h///g:f) 都是相对路径 ( path-noscheme)。

解析 URI（例如，使用附录 B 中的正则表达式）为我们提供了以下信息：

Base.scheme:    "http"       R.scheme:    undef
Base.authority: "a"          R.authority: undef
Base.path:      "/b/c/d;p"   R.path:      "h//g"
Base.query:     "q"          R.query:     undef
Base.fragment:  undef        R.fragment:  undef

按照 §5.2.2 中的算法，我们得到以下结果：

T.path:         "/b/c/h//g"    ; remove_dot_segments(merge(Base.path, R.path))
T.query:        undef          ; R.query
T.authority:    "a"            ; Base.authority
T.scheme:       "http"         ; Base.scheme
T.fragment:     undef          ; R.fragment

按照 §5.3 中的算法，我们得到以下结果：

http://a/b/c/h//g         # For h//g
http://a/b/c/g////h       # For g////h
http://a/b/c/h///g:f      # For h///g:f

不过，我认为这些示例不适合回答我认为您真正想知道的内容。

看一下以下两个 URI。它们不是等价的。

http://host/a/b/c/d     # Path has 4 segments: "a", "b", "c", "d"

和

http://host/a/b/c//d    # Path has 5 segments: "a", "b", "c", "", "d"

大多数服务器会以同样的方式对待它们——这很好，因为服务器可以自由地以它们希望的任何方式解释路径——但在应用相对路径时会有所不同。例如，如果这些是的基本 URI ../../e，你会得到

http://host/a/b/c/d + ../../e = http://host/a/e

和

http://host/a/b/c//d + ../../e = http://host/a/b/e

score 1 · Accepted Answer

我很好奇Mojo::URL会做什么，所以我检查了。有一个很大的警告，因为它没有声称严格遵守：

Mojo::URL 实现了 RFC 3986、RFC 3987 和统一资源定位器的 URL 生活标准的子集，支持 IDNA 和 IRI。

这是程序。

my @urls = qw(/g //g ///g ////g h//g g////h h///g:f
    https:///stackoverflow.com////////a/////10161264/////6618577
    );
my @parts = qw(scheme host port path query);
my $template = join "\n", map { "$_: %s" } @parts;

my $base_url = Mojo::URL->new( 'http://a/b/c/d;p?q' );

foreach my $u ( @urls ) {
    my $url = Mojo::URL->new( $u )->base( $base_url )->to_abs;

    no warnings qw(uninitialized);
    say '-' x 40;
    printf "%s\n$template", $u, map { $url->$_() } @parts
    }

这是输出：

----------------------------------------
/g
scheme: http
host: a
port:
path: /g
query: ----------------------------------------
//g
scheme: http
host: g
port:
path:
query: ----------------------------------------
///g
scheme: http
host: a
port:
path: /g
query: ----------------------------------------
////g
scheme: http
host: a
port:
path: //g
query: ----------------------------------------
h//g
scheme: http
host: a
port:
path: /b/c/h/g
query: ----------------------------------------
g////h
scheme: http
host: a
port:
path: /b/c/g/h
query: ----------------------------------------
h///g:f
scheme: http
host: a
port:
path: /b/c/h/g:f
query: ----------------------------------------
https:///stackoverflow.com////////a/////10161264/////6618577
scheme: https
host:
port:
path: /stackoverflow.com////////a/////10161264/////6618577
query:

score -1 · Accepted Answer

不 -///g似乎更等同于/g. “点段”用于通过URL在层次结构中..上下导航。另请参阅处理 URI 中的路径的URI模块。.http

html - 在相对部分使用多个斜杠解析 URI

3 回答 3

Related

Reference