我在 Google DLP 库中看到让我感到困惑的行为,我希望得到一些澄清。我正在使用 Java 包装库,google-cloud-dlp 版本 0.34.0-beta。给定输入:
Collection<String> input = Lists.newArrayList("Jenny Tutone  2665 Agua Vista Dr Los Gatos CA 95030 (408) 867-5309 or 408.867.5309x100" 
我看到了输出:
███  █ ████ or █
如果我传入与子字符串集合相同的字符串:
Collection<String> input = Lists.newArrayList("Jenny Tutone", "2665 Agua Vista Dr", "Los Gatos", "CA 95030", "(408) 867-5309", "or", "408.867.5309x100");
我看到了非常不同的结果:
███, 2665 █, █ Gatos, █ 95030, █, or, █
我正在使用InfoType我能找到的所有类型,总共有 67 种。我在这里做错了吗?这是调用 Google DLP 库的代码的核心:
private Collection<String> redactContent(Collection<String> input,
                                String replacement,
                                Likelihood minLikelihood,
                                List<InfoType> infoTypes) {
    // Replace select info types with chosen replacement string
    final Collection<RedactContentRequest.ReplaceConfig> replaceConfigs = infoTypes.stream()
            .map(it -> RedactContentRequest.ReplaceConfig.newBuilder().setInfoType(it).setReplaceWith(replacement).build())
            .collect(Collectors.toCollection(LinkedList::new));
    final InspectConfig inspectConfig =
            InspectConfig.newBuilder()
                    .addAllInfoTypes(infoTypes)
                    .setMinLikelihood(minLikelihood)
                    .build();
    long itemCount = 0;
    try (DlpServiceClient dlpClient = DlpServiceClient.create(settings)) {
        // Google's DLP library is limited to 100 items per request, so the requests need to be chunked if the
        // number of input items is greater.
        Stream.Builder<Stream<ContentItem>> streamBuilder = Stream.builder();
        for (long processed = 0; processed < input.size(); processed += maxItemsPerRequest) {
            Collection<ContentItem> items =
                    input.stream()
                            .skip(processed)
                            .limit(maxItemsPerRequest)
                            .filter(item -> item != null && !item.isEmpty())
                            .map(item ->
                                    ContentItem.newBuilder()
                                            .setType(MediaType.PLAIN_TEXT_UTF_8.toString())
                                            .setData(ByteString.copyFrom(item.getBytes(Charset.forName("UTF-8"))))
                                            .build()
                            )
                            .collect(Collectors.toCollection(LinkedList::new));
            RedactContentRequest request = RedactContentRequest.newBuilder()
                    .setInspectConfig(inspectConfig)
                    .addAllItems(Collections.unmodifiableCollection(items))
                    .addAllReplaceConfigs(replaceConfigs)
                    .build();
            RedactContentResponse contentResponse = dlpClient.redactContent(request);
            itemCount += contentResponse.getItemsCount();
            streamBuilder.add(contentResponse.getItemsList().stream());
        }
        return streamBuilder.build()
                        .flatMap(stream -> stream.map(item -> item.getData().toStringUtf8()))
                        .collect(Collectors.toCollection(LinkedList::new));
    }
}