parsing - 如何在 Web 服务器上的请求 URI 中解码保留的转义字符？

Question

很明显，Web 服务器必须解码任何转义的未保留字符（例如字母数字等）才能进行 URI 比较。例如，http://www.example.com/~user/index.htm应与相同http://www.example.com/%7Euser/index.htm。

我的问题是，我们将如何处理转义的保留字符？

一个例子是%2F或/。如果%2F请求 URI 中有一个，Web 服务器的解析器是否应该将其替换为/? 在上面的例子中，这意味着这http://www.example.com/~user%2Findex.htm将与http://www.example.com/~user/index.htm? 虽然我在 Apache 服务器（2.2.17 Unix）上尝试过它，但它看起来给出了“404 Not Found”错误。

那么这是否意味着%2F其他转义的保留字符应该被单独留下（至少在 URI 比较之前）？

背景资料：

RFC 2616 (HTTP 1.1) 中有两处提到转义解码问题：

Request-URI 以第 3.2.1 节中指定的格式传输。如果使用“% HEX HEX”编码 [42] 对 Request-URI 进行编码，则源服务器必须解码 Request-URI 以正确解释请求。服务器应该使用适当的状态码来响应无效的请求 URI。

和

“保留”和“不安全”集中的字符（参见 RFC 2396 [42]）中的字符等价于它们的““%”HEX HEX 编码。

（根据http://trac.tools.ietf.org/wg/httpbis/trac/ticket/2 “不安全”是一个错误，应从规范中删除。所以我们在这里只看“保留”。）

仅供参考，RFC 2396 中此类字符的定义：

保留=“;” | "/" | “？” | “：” | "@" | "&" | “=” | "+" | "$" | ","

无保留 = 字母数字 | 标记

标记 = "-" | "_" | “。” | “！” | "～" | "*" | "'" | "(" | ")"

score 3 · Accepted Answer

tl;dr:

Decode percent-encoded unreserved characters,
keep percent-encoded reserved characters.

The URI standard is STD 66, which currently is RFC 3986.

Section 6 is about Normalization and Comparison, where section 6.2.2.2 explains what to do with percent-encoded octets:

These URIs should be normalized by decoding any percent-encoded octet that corresponds to an unreserved character […]

As explicitly stated in section 2 (bold emphasis mine):

Unreserved characters:

URIs that differ in the replacement of an unreserved character with its corresponding percent-encoded US-ASCII octet are equivalent
Reserved characters:

URIs that differ in the replacement of a reserved character with its corresponding percent-encoded octet are not equivalent.

parsing - 如何在 Web 服务器上的请求 URI 中解码保留的转义字符？

背景资料：

1 回答 1

Related

Reference