java - HTTP headers encoding/decoding in Java

Question

A custom HTTP header is being passed to a Servlet application for authentication purposes. The header value must be able to contain accents and other non-ASCII characters, so must be in a certain encoding (ideally UTF-8).

I am provided with this piece of Java code by the developers who control the authentication environment:

String firstName = request.getHeader("my-custom-header"); 
String decodedFirstName = new String(firstName.getBytes(),"UTF-8");

But this code doesn't look right to me: it presupposes the encoding of the header value, when it seemed to me that there was a proper way of specifying an encoding for header values (from MIME I believe).

Here is my question: what is the right way (tm) of dealing with custom header values that need to support a UTF-8 encoding:

on the wire (how the header looks like over the wire)
from the decoding point of view (how to decode it using the Java Servlet API, and can we assume that request.getHeader() already properly does the decoding)

Here is an environment independent code sample to treat headers as UTF-8 in case you can't change your service:

String valueAsISO = request.getHeader("my-custom-header"); 
String valueAsUTF8 = new String(firstName.getBytes("ISO8859-1"),"UTF-8");

score 7 · Accepted Answer

再次重申：RFC 2047 在实践中并未实施。HTTP/1.1 的下一个修订版将删除对它的任何提及。

所以，如果你需要传输非 ASCII 字符，最安全的方法是将它们编码成一个 ASCII 序列，例如 Atom Publishing Protocol 中的“Slug”标头。

score 6 · Accepted Answer

正如已经提到的，第一眼应该总是去HTTP 1.1 规范（RFC 2616）。它表示，如果标头值中的文本包含来自 ISO-8859-1 以外的字符集的字符，则它必须使用RFC 2047定义的 MIME 编码。

所以这对你来说是一个加分项。如果 ISO-8859-1 字符集涵盖了您的要求，那么您只需将您的字符放入您的请求/响应消息中。否则 MIME 编码是唯一的选择。

只要用户代理根据这些规则将值发送到您的自定义标头，您就不必担心解码它们。这就是 Servlet API 应该做的。

但是，还有一个更基本的原因可以解释为什么您的代码片段没有按照应有的方式进行。第一行获取标头值作为 Java 字符串。正如我们所知，它在内部表示为 UTF8，因此此时 HTTP 请求消息解析已经完成并完成。

下一行获取该字符串的字节数组。由于没有指定编码（恕我直言，这种没有参数的方法早就应该被弃用），使用当前系统默认编码，通常不是 UTF8，然后数组再次转换为 UTF8 编码。出局。

score 5 · Accepted Answer

The HTTPbis working group is aware of the issue, and the latest drafts get rid of all the language with respect to TEXT and RFC 2047 encoding -- it is not used in practice over HTTP.

See http://trac.tools.ietf.org/wg/httpbis/trac/ticket/74 for the whole story.

score 4 · Accepted Answer

有关规则，请参阅HTTP 规范，在第 2.2 节中说明

TEXT 规则仅用于不打算由消息解析器解释的描述性字段内容和值。仅当根据 RFC 2047 [14] 的规则进行编码时，*TEXT 的字可能包含来自 ISO-8859-1 [22] 以外的字符集的字符。

上面的代码不会正确解码 RFC2047 编码字符串，让我相信服务没有正确遵循规范，他们只是在标头中嵌入了原始 utf-8 数据。

score 3 · Accepted Answer

感谢您的回答。似乎理想的情况是按照 RFC 2047 遵循正确的 HTTP 标头编码。在线路上 UTF-8 中的标头值如下所示：

=?UTF-8?Q?...?=

现在有趣的是：似乎 Tomcat 5.5 或 6 都没有按照 RFC 2047 正确解码 HTTP 标头！Tomcat 代码在任何地方都假定标头值使用 ISO-8859-1。

因此，对于 Tomcat，特别是，我将通过编写一个过滤器来解决这个问题，该过滤器处理对标头值的正确解码。

java - HTTP headers encoding/decoding in Java

5 回答 5

Related

Reference