我有一种情况,我需要读取多个 xml 文件并从中构建一个模型。可悲的是,这些文件是由我绝对无法更改的遗留系统生成的。
给我带来麻烦的 XML 文件之一看起来或多或少像这样(更改为删除专有数据):
<resource lang="en" dataId="900">
numbered content here, 900-919 ...
<string name="920-name">Document Shredder</string>
<string name="920-desc">A machine ideal for destroying documents that deserve it. It can cross-shred anything from tissue paper to small netbooks with minimal noise. Remember, hackers can't access the documents if you've shredded the drives.</string>
<string name="920-cat">office,appliance</string>
<string name="921-name">Plastic Ladle</string>
<string name="921-desc">This is a big plastic ladle, ideal for soups and sauces.</string>
<string name="921-cat">kitchen,utensils</string>
... similar numbered content here, 922-934 ...
<string name="935-name">Green Laser Pointer</string>
<string name="935-desc">A High-Powered green laser pointer, ideal for irritating cats.</string>
<string name="935-cat">office,tool</string>
<string name="936-name">Black Metal Filing Cabinet</string>
<string name="936-desc">A large, metal cabinet (black) built to store hanging file folders.</string>
<string name="936-cat">office,storage</string>
... similar numbered content here, 937-994
</resource>
我将其解析为 a List<CString>
,其中CString.java
是:
public class CString {
public String name;
public String desc;
@Override
public String toString() {
return "CString {!name: " + name + " !body: " + body + "}\n";
}
}
我试过使用DocumentBuilder
, 并且,当它不能正常工作时,只是一个普通的SaxParser
. 但是,无论我怎么做,当我回顾我CString
的 s 时,我有一些正文实际上包含文档不同部分的未解析标签。例如,打印出我前面提到的List<CString>
可能会产生类似的结果:
[ CStrings for 900-919 ...
, CString {!name: 920-name !body: Document Shredder}
, CString {!name: 920-desc !body: irritating cats.</string>
<string name="935-cat">office,tool</string>
<string name="936-name">Black Metal Filing Cabinet</e. Remember, hackers can't access the documents if you've shredded the drives.}
, CString {!name: 920-cat !body: office,appliance}
, CString {!name: 921-name !body: Plastic Ladle}
, CString {!name: 921-desc !body: This is a big plastic ladle, ideal for soups and sauces.}
, CString {!name: 921-cat !body: kitchen,utensils}
... CStrings for 922-934 ...
, CString {!name: 935-name !body: Green Laser Pointer}
, CString {!name: 935-desc !body: A High-Powered green laser pointer, ideal for irritating cats.}
, CString {!name: 935-cat !body: office,tool}
, CString {!name: 936-name !body: Black Metal Filing Cabinet}
, CString {!name: 936-desc !body: A large, metal cabinet (black) built to store hanging file folders.}
, CString {!name: 936-cat !body: office,storage}
... CStrings for 937-994
]
在SaxParser
我的代码版本中,我有以下characters
方法DefaultHandler
:
public void characters(char ch[], int start, int length) throws SAXException {
String value = new String(ch, start, length).trim();
switch(currentQName.toString()) { // currentQName is a StringBuilder that holds just the current xml element's name
case "string":
if (value.contains("</string")) {
System.err.println("!!! Parse Error !!! " + value);
}
}
正如您可能已经猜到的那样,这会产生:
!!! Parse Error !!! irritating cats.</string>
<string name="935-cat">office,tool</string>
<string name="936-name">Black Metal Filing Cabinet</e. Remember, hackers can't access the documents if you've shredded the drives.
我通常不会问这个深奥的问题,尤其是当我无法提供具体的数据和代码时,但谷歌搜索似乎没有产生任何我能够确定的东西,当然代码不是抛出(或抑制)任何异常。
我注意到的一件事是,当有错误的数据时,如上面针对 920-desc 的 CString 所示,在这种情况下,错误数据的长度为 138 个字符,而且并非巧合的是,好的数据恰好将 139 个字符提取到它的内容中应该。这让我觉得这是某种缓冲问题。但是,无论我是让DocumentBuilder
管理缓冲区,还是尝试使用直接更手动地管理它们,每次SaxParser
我仍然会在相同的地方得到完全相同的错误文本。最后,在处理较短的字符串、name 和 cat 时,我从未注意到任何错误的文本,我认为这也指向 char 缓冲区问题。
任何想法都会有所帮助!