html - Why is "®" being rendered as "®" without the bounding semicolon

Question

I've been running into a problem that was revealed through our Google adwords-driven marketing campaign. One of the standard parameters used is "region". When a user searches and clicks on a sponsored link, Google generates a long URL to track the click and sends a bunch of stuff along in the referrer. We capture this for our records, and we've noticed that the "Region" parameter is coming through incorrectly. What should be

http://ravercats.com/meow?foo=bar&region=catnip

is instead coming through as:

http://ravercats.com/meow?foo=bar®ion=catnip

I've verified that this occurs in all browsers. It's my understanding that HTML entity syntax is defined as follows:

&VALUE;

where the leading boundary is the ampersand and the closing boundary is the semicolon. Seems straightforward enough. The problem is that this isn't being respected for the ® entity, and it's wreaking all kinds of havoc throughout our system.

Does anyone know why this is occurring? Is it a bug in the DTD? (I'm looking for the current HTML DTD to see if I can make sense of it) I'm trying to figure out what would be common across browsers to make this happen, thus my looking for the DTD.

Here is a proof you can use. Take this code, make an HTML file out of it and render it in a browser:

<html>
<a href="http://foo.com/bar?foo=bar&region=US&register=lowpass&reg_test=fail&trademark=correct">http://foo.com/bar?foo=bar&region=US&register=lowpass&reg_test=fail&trademark=correct</a>
</html>

EDIT: To everyone who's suggesting that I need to escape the entire URL, the example URLs above are exactly that, examples. The real URL is coming directly from Google and I have no control over how it is constructed. These suggestions, while valid, don't answer the question: "Why is this happening".

score 43 · Accepted Answer

尽管有效的字符引用总是以分号结尾，但出于向后兼容性的原因，一些不带分号的无效命名字符引用被现代浏览器的 HTML 解析器识别。

要么您知道整个列表是什么，要么您遵循 HTML5 规则，即何时有效而不被转义（例如&，当后面跟着一个空格时），或者总是在有疑问时转义。&&

作为参考，无需分号即可识别的命名字符引用的完整列表为：

AElig，AMP，Aacute，Acirc，Agrave，Aring，Atilde，Auml，COPY，Ccedil，ETH，Eacute，Ecirc，Egrave，Euml，GT，Iacute，Icirc，Igrave，Iuml，LT，Ntilde，Oacute，Ocirc，Ograve， Oslash，Otilde，Ouml，QUOT，REG，THORN，Uacute，Ucirc，Ugrave，Uuml，Yacute，aacute，acirc，acute，aelig，agrave，amp，aring，atilde，auml，brvbar，ccedil，cedil，cent，复制，当前，deg，divide，eacute，ecirc，egrave，eth，euml，frac12，frac14，frac34，gt，iacute，icirc，iexcl，igrave，iquest，iuml，laquo，lt，macr，微，middot，nbsp，不， ntilde, oacute, ocirc, ograve, ordf, ordm, oslash, otilde, ouml, para, plusmn, pound, quot, raquo, reg, sect, shy, sup1, sup2, sup3, szlig, thorn, times, uacute, ucirc, ugrave, uml, uuml, yacute, 日元, yuml

但是，应该注意的是，只有在属性值中，如果下一个字符是一个=或字母数字的 ASCII 字符，则符合 HTML5 解析器的命名字符引用不会被这样处理。

有关带或不带结束分号的命名字符引用的完整列表，请参见此处。

score 13 · Accepted Answer

这是一项非常混乱的业务，并且取决于上下文（文本内容与属性值）。

正式地，根据直到并包括 HTML 4.01 的 HTML 规范，如果下一个字符不是名称字符，则实体引用可能会出现不带分号的结尾。所以 eg&region=在语法上是正确的但未定义，因为实体region尚未定义。XHTML 要求尾随分号。

不过，浏览器传统上遵循其他规则。由于查询 URL 的通用语法，它们解析例如href="http://ravercats.com/meow?foo=bar&region=catnip"，因此&region不将其视为实体引用，而仅视为文本数据。作者大多使用这样的结构，即使它们在形式上是不正确的。

与问题似乎在说什么相反，href="http://ravercats.com/meow?foo=bar&region=catnip"实际上效果很好。当字符串不在属性值中而是在文本内容中时会出现问题，这种情况相当少见：我们通常不会在文本中编写 URL。在文本中，&region=被处理以便&reg被识别为实体引用（对于“®”），其余的只是字符数据。这种奇怪的行为正在 HTML5 CR 中正式发布，其中第8.2.4.69 节标记字符引用描述了“双重标准”：

如果字符引用作为属性的一部分使用，并且匹配的最后一个字符不是“;” (U+003B) 字符，并且下一个字符是“=” (U+003D) 字符或在 ASCII 数字、大写 ASCII 字母或小写 ASCII 字母范围内，然后，由于历史原因，所有字符在 U+0026 AMPERSAND 字符 (&) 之后匹配的字符必须未使用，并且不返回任何内容。

因此，在属性 value中， even&reg=不会被视为包含字符引用，更不会被视为包含&region=。（但reg_test=情况不同，由于下划线字符。）

在文本内容中，适用其他规则。该构造&region=会导致解析错误（根据 HTML5 CR 规则），但具有明确定义的错误处理：&reg被识别为字符引用。

score 9 · Accepted Answer

也许尝试更换你的&as &？和号也是必须在 HTML 中转义的字符，因为它们被保留用作实体的一部分。

score 4 · Accepted Answer

1：首先以下标记无效（使用W3C Markup Validation Service进行验证）：

<a href="http://foo.com/bar?foo=bar&region=US&register=lowpass&reg_test=fail&trademark=correct"></a>

在上面的示例中，&字符应编码为&，如下所示：

<a href="http://foo.com/bar?foo=bar&amp;region=US&amp;register=lowpass&amp;reg_test=fail&amp;trademark=correct"></a>

2：浏览器宽容；他们试图从破碎的 HTML 中找出意义。在您的情况下，所有可能有效的 HTML 实体都将转换为 HTML 实体。

score 4 · Accepted Answer

这是一个简单的解决方案，它可能不适用于所有情况。

所以从这里：

http://ravercats.com/meow?status=Online&region=Atlantis

对此：

http://ravercats.com/meow?region=Atlantis&status=Online

因为&reg我们知道触发特殊字符®

警告：如果您无法控制 URL 查询字符串参数的顺序，则必须将变量名称更改为其他名称。

score 1 · Accepted Answer

逃避你的输出！

简单地说，您需要将 url 格式编码为 html 格式以进行准确表示（理想情况下，您可以使用模板引擎变量转义函数来执行此操作，但除非使用phphtmlspecialchars($url)或htmlentities($url)在 php 中）。

查看您的测试用例，然后在此 jsfiddle 中查看正确编码的 html：http: //jsfiddle.net/tchalvakspam/Fp3W6/

这里的非活动代码：

<div>
Unescaped:
<br>
<a href="">http://foo.com/bar?foo=bar&region=US&register=lowpass&reg_test=fail&trademark=correct</a>
</div>

<div>
Correctly escaped:
<br>
http://foo.com/bar?foo=bar&amp;region=US&amp;register=lowpass&amp;reg_test=fail&amp;trademark=correct
</div>

score 1 · Accepted Answer

在我看来，您从 google 收到的不是实际的 URL，而是引用 url（查询字符串）的变量。所以，这就是为什么它在渲染时被解析为注册标记。

我会说，你应该对它进行 url 编码并在处理它时对其进行解码。像任何其他包含特殊实体的变量一样。

score -4 · Accepted Answer

为防止这种情况发生，您应该对 urls 进行编码，它将 url 中的 & 号等字符替换为 % 和其后面的十六进制数字。

html - Why is "®" being rendered as "®" without the bounding semicolon

8 回答 8

逃避你的输出！

Related

Reference