1

默认情况下,XML 中的所有内容都是解析的字符数据(#PCDATA),那么为什么我们需要在 DTD 中指定#PCDATA。有人请解释一下。谢谢。

4

1 回答 1

4

I'm not sure which of the following questions you are asking.

Question 1: What is the point of having a #PCDATA keyword in content models?

As @mzjin has already pointed out, the #PCDATA keyword is used when declaring mixed content; it (or something logically equivalent to it) is needed in order to allow declarations to distinguish between elements which can contain character data, like

<!ELEMENT a (#PCDATA) >
<!ELEMENT p (#PCDATA | emph | term | list)* >

and elements which contain other elements, optionally separated by insignificant whitespace, but not character data, like

<!ELEMENT text (front?, body, back?) >
<!ELEMENT a (x | y | z)* >

When you say "by default everything in XML is parsed character data", what do you mean? There is no 'default' declaration defined in XML for elements not declared in the DTD. Some processors may assume a declaration of that form for undeclared elements, in order to attempt to keep going while reading an invalid document, and that can be useful. But it's not a rule defined by XML.

Question 2: why is it called 'parsed' character data, when all character data in an XML document passed through a parser and is thus necessarily 'parsed'?

The keyword PCDATA, inherited from ISO 8879 (which defines SGML), does indeed stand for 'parsed character data', but its denotation is narrower than you appear to be thinking. It means character data in which all potential delimiters will be recognized, including

  • <! for comments and CDATA sections (and, in SGML, also for conditional sections)
  • < for start-tags and sole-tags
  • </ for end-tags
  • &# for numeric character references
  • & for entity references

This property distinguishes parsed character data (in the technical sense) from two other kinds of character data, denoted by the keywords RCDATA (replaceable character data) and CDATA (just character data), in which different sets of delimiters are recognized. (RCDATA is part of SGML, but not of XML.)

In a CDATA marked section, for example, the only delimiter recognized is the end of the marked section, ]]>.

In an attribute declared CDATA, the only delimiters recognized are &, &#, and the closing quotation mark of the attribute-value specification (either " or ').

In an SGML document, marked sections can occur with the keyword RCDATA; in them, entity references (&, numeric character references (&#), and the marked-section end delimiter (]]>) will be recognized, but not start- and end-tag open delimiters (and, if I'm reading 8879 right, also not marked-section open delimiters <![).

You may make the case that the terminology chosen in 8879 is perhaps not as clear as it might be, and that clearer terminology might have been possible and helpful. If so, you would not be the first to say so.

于 2016-04-10T18:52:37.713 回答