java - XML Character was found in the element content of the document

Question

I am generating pdf from a SQL query on a java app. I have 4M pdf to print.

On the 15092th pdf. I am encountering this error

Invalid xml character (unicode 0xc) was found in the element content of the document

I tried to replace as what other blogs are saying.

    html = html.replaceAll("\000"," ");
    html = html.replaceAll("/\u000c+/g", "");

I don't know which is which I just placed them to my html.

Anyone with an idea?

Thanks!

score 1 · Accepted Answer

有几种方法可以进行替换。我更详细地描述了它们，因为我认为理解它们比仅仅复制代码更重要。

一个简单的逐个字符替换。这适用于您的情况，因为您只想替换某个字符的出现。由于您的角色是控制角色，因此您不能（通常）直接插入它，而是通过以下方式之一：
- 统一码参考：html=html.replace('\u000c', ' '); // hexadecimal value == 0xc
- 八进制参考：html=html.replace('\14', ' '); // 0xc == 014
- 按其含义： html=html.replace('\f', ' '); // character 0xc is a form-feed
Unicode 引用有点棘手，因为它们是在 Java 解析器之前处理的，因此它们不适用于对 Java 语言具有特殊含义的字符。但是使用换页符它可以工作。
使用正则表达式。这是此任务的一个过大的解决方案，但它的工作原理是执行单个字符的精确匹配是正则表达式语法的有效子集。因此，您可以使用上述所有变体并通过将方法名称替换为replaceAll并将参数更改为字符串来构造基于正则表达式的解决方案，例如html=html.replaceAll("\14", " "); 在这种情况下，字符引用仍然由编译器生成，对正则没有特殊含义表达引擎。积极使用正则表达式引擎时，您可以选择与 Java 语言的字符引用类似的选择：
- 统一码参考：html=html.replaceAll("\\u000c", " ");
- 十六进制参考：html=html.replaceAll("\\x0c", " "); // no Java equivalent
- 八进制参考：html=html.replaceAll("\\014", " "); // note the subtle difference
- 按其含义： html=html.replaceAll("\\f", " ");

不同之处在于，这些序列在 Java 语言级别（通过双反斜杠）插入反斜杠，形成由正则表达式引擎处理的正则表达式。因此，Unicode 参考适用于此处的所有字符。此处描述了整个语法：http: //docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

但正如已经为您的任务所说的那样，简单的字符匹配就足够了。

那么为什么你的例子不起作用呢？

html = html.replaceAll("\000"," "); 该序列\0被解释为对控制字符 0x0 的引用，其后跟两个零。所以它试图找到NUL后跟两个零的控制字符序列。

html = html.replaceAll("/\u000c+/g", ""); 该序列由字符组成'/' '\f'（通过正确的 unicode 序列定义）'+' '/' 'g'。只有加号在 Java 的正则表达式中具有特殊含义。它的意思是“至少一个”和“尽可能多的”。因此，此代码会查找您的字符 0xc 的序列，但前提是它们由斜线框起来并后跟'g'.

java - XML Character was found in the element content of the document

1 回答 1

Related

Reference