.net - 从 Lotus Notes XML 富文本元素中提取文本

Question

我要将 Lotus Notes 数据库的内容迁移到 SharePoint。整个数据库导出为 XML 文件（此要求无法更改），我必须解析这些 XML 文件并将数据插入 SharePoint。

让我失望的是包含富文本的元素。XML 元素包含在 Lotus Notes 中使用 DXL 的字段中使用的确切富文本格式的 XML 表示，如http://publib.boulder.ibm.com/infocenter/domhelp/v8r0/index.jsp?topic=%中所述2Fcom.ibm.designer.domino.main.doc%2FH_PARAGRAPH_DEFINITIONS_ELEMENT_XML.html

我不需要保留文本的实际格式（除非这与检索纯文本同样容易），但如果我只是提取包含富文本的 XML 元素的值（使用 LinqToXML），我会得到纯文本没有不可接受的换行符。此外，嵌入的图像在检索到的文本中显示为 base64 编码的字符串（它们嵌入在 XML 中）。

谁能指导我如何从 XML 元素中提取文本，或者作为可以插入到 RTF 文件中的正确 RTF 格式，或者作为包含正确换行符且不包含嵌入图像的纯文本？

score 1 · Accepted Answer

显然，您处理的 XML 是 DXL。更优雅的方法是使用 XSL 转换将其转换为 HTML。您可能会发现随PD4ML 工具提供的所需 XSLT 样式表。从 HTML 格式可以将文档转换为 PDF、RTF 或使用 PD4ML 的图像（或者可能使用其他工具转换为另一种格式）

score 0 · Accepted Answer

您可以将富文本项目内容转换为 HTML/MIME，这是另一种支持的富文本项目格式。

或者，您可以创建在 HTTP URL 中显示富文本内容的 XPage 或表单，并在导出 XML 中引用该内容。

帕努

score 0 · Accepted Answer

我（现在）刚刚使用带有以下表达式的正则表达式剥离了所有 XML 标记和不需要的嵌入元素的富文本 xml 元素：

        //Removes all attachmentref elements
        newString = new Regex(@"(<attachmentref(.|\n)*</attachmentref>)").Replace(newString, "");
        //Removes all formula elements
        newString = new Regex(@"(<formula(.|\n)*</formula>)").Replace(newString, "");
        //Removes all xml tags (<par>, <pardef>, <table> etc). Be aware that this also removes any content in the table
        newString = new Regex("<(.)*/>").Replace(newString, "");
        newString = new Regex("<(.)*>").Replace(newString, "");
        newString = new Regex("</(.)*>").Replace(newString, ""); 

        //Trims the text to tidy up the many \n, \r and white-spaces introduced by removing the xml tags. 
        newString = new Regex(@"\r").Replace(newString, "\n");
        newString = new Regex(@"[ \f\r\t\v]+\n").Replace(newString, "\n");
        newString = new Regex(@"\n{2,}").Replace(newString, "\n");

        //makes < and > appear correctly in the text.
        newString = newString.Replace("&lt;", "<").Replace("&gt;", ">");

它不漂亮，但至少文本是可读的并且保留了一些换行符。

.net - 从 Lotus Notes XML 富文本元素中提取文本

3 回答 3

Related

Reference