5

I'm trying to read the contents of a PDF using Apache's PDFBox and encode it in base64 so I can stream it to elsewhere. To encode it I use the Apache commons Base64OutputStream class. Like so,

ByteArrayOutputStream byteOutput = new ByteArrayOutputStream();
Base64OutputStream base64Output = new Base64OutputStream(byteOutput);
List pages = pdfDocument.getDocumentCatalog().getAllPages();
Iterator iter = pages.iterator();
while (iter.hasNext()) {
  PDPage page = (PDPage) iter.next();
  PDResources resources = page.getResources();
  Map<String, PDXObjectImage> pageImages = resources.getImages();
  if (pageImages != null) {
    Iterator imageIter = pageImages.keySet().iterator();
    while (imageIter.hasNext()) {
      String key = (String) imageIter.next();
      PDXObjectImage image = (PDXObjectImage) pageImages
          .get(key);
      image.write2OutputStream(base64Output);
    }
  }
}
String base64 = new String(byteOutput.toByteArray());

It seems to be encoding it but I need to verify it by writing a junit test to validate the base64 string. The following doesnt seem to pass it. Any thoughts ?

assertTrue(content
        .matches("^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{4}|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)$"));

Thanks in advance

4

1 回答 1

3

默认情况下Base64OutputStream使用 CHUNK_SIZE = 76,并且 CHUNK_SEPARATOR = {'\r', '\n'}。您用于测试给定字符串是否为 BASE64 编码的正则表达式不考虑这一点。

匹配分块 BASE64(给定块大小 64 和分隔符 \r\n)字符串的正则表达式可能如下所示:

"^(([\\w+/]{4}){19}\r\n)*(([\\w+/]{4})*([\\w+/]{4}|[\\w+/]{3}=|[\\w+/]{2}==))$"
于 2013-05-06T15:10:21.673 回答