itext - iText PDFSweep RegexBasedCleanupStrategy 在某些情况下不起作用

Question

我正在尝试使用 iText PDFSweep RegexBasedCleanupStrategy 来编辑 pdf 中的一些单词，但是我只想编辑该单词而不是出现在其他单词中，例如。我想将“al”编辑为单个单词，但我不想编辑“矿物”中的“al”。所以我在 Regex 中添加单词边界（“\b”）作为 RegexBasedCleanupStrategy 的参数，

  new RegexBasedCleanupStrategy("\\bal\\b")

但是，如果单词位于行尾，则 pdfAutoSweep.cleanUp 不起作用。

score 1 · Accepted Answer

简而言之

此问题的原因是，将提取的文本块扁平化为单个String以应用正则表达式的例程没有插入任何换行符指示符。因此，一行中String的最后一个字母紧随其后的是下一行的第一个字母，它隐藏了单词边界。String可以通过在出现换行符的情况下添加适当的字符来修复该行为。

有问题的代码

将提取的文本块扁平化为单个的例程String在CharacterRenderInfo.mapString(List<CharacterRenderInfo>)package 中com.itextpdf.kernel.pdf.canvas.parser.listener。如果只是水平间隙，此例程会插入一个空格字符，但在垂直偏移的情况下，即换行符，它不会向生成表示的地方添加任何额外StringBuilder内容String：

if (chunk.sameLine(lastChunk)) {
    // we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
    if (chunk.getLocation().isAtWordBoundary(lastChunk.getLocation()) && !chunk.getText().startsWith(" ") && !chunk.getText().endsWith(" ")) {
        sb.append(' ');
    }
    indexMap.put(sb.length(), i);
    sb.append(chunk.getText());
} else {
    indexMap.put(sb.length(), i);
    sb.append(chunk.getText());
}

一个可能的修复

可以扩展上面的代码以在换行的情况下插入换行符：

if (chunk.sameLine(lastChunk)) {
    // we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
    if (chunk.getLocation().isAtWordBoundary(lastChunk.getLocation()) && !chunk.getText().startsWith(" ") && !chunk.getText().endsWith(" ")) {
        sb.append(' ');
    }
    indexMap.put(sb.length(), i);
    sb.append(chunk.getText());
} else {
    sb.append('\n');
    indexMap.put(sb.length(), i);
    sb.append(chunk.getText());
}

此CharacterRenderInfo.mapString方法仅从RegexBasedLocationExtractionStrategy方法getResultantLocations()（包com.itextpdf.kernel.pdf.canvas.parser.listener）调用，并且仅用于提到的任务，即应用相关的正则表达式。因此，使其能够正确识别单词边界不应该破坏任何东西，但确实应该被认为是一种修复。

' '如果不想处理与水平间隙不同的垂直间隙，则可能仅考虑为换行符添加不同的字符，例如纯空格。因此，对于一般修复，可以考虑将此字符作为策略的可设置属性。

版本

我使用 iText 7.1.4-SNAPSHOT 和 PDFSweep 2.0.3-SNAPSHOT 进行了测试。

itext - iText PDFSweep RegexBasedCleanupStrategy 在某些情况下不起作用

1 回答 1

简而言之

有问题的代码

一个可能的修复

版本

Related

Reference