google-cloud-platform - DLP API 的不同结果取决于输入是全部在一个字符串中还是作为子字符串的集合发送

Question

我在 Google DLP 库中看到让我感到困惑的行为，我希望得到一些澄清。我正在使用 Java 包装库，google-cloud-dlp 版本 0.34.0-beta。给定输入：

Collection<String> input = Lists.newArrayList("Jenny Tutone  2665 Agua Vista Dr Los Gatos CA 95030 (408) 867-5309 or 408.867.5309x100"

我看到了输出：

███  █ ████ or █

如果我传入与子字符串集合相同的字符串：

Collection<String> input = Lists.newArrayList("Jenny Tutone", "2665 Agua Vista Dr", "Los Gatos", "CA 95030", "(408) 867-5309", "or", "408.867.5309x100");

我看到了非常不同的结果：

███, 2665 █, █ Gatos, █ 95030, █, or, █

我正在使用InfoType我能找到的所有类型，总共有 67 种。我在这里做错了吗？这是调用 Google DLP 库的代码的核心：

private Collection<String> redactContent(Collection<String> input,
                                String replacement,
                                Likelihood minLikelihood,
                                List<InfoType> infoTypes) {
    // Replace select info types with chosen replacement string
    final Collection<RedactContentRequest.ReplaceConfig> replaceConfigs = infoTypes.stream()
            .map(it -> RedactContentRequest.ReplaceConfig.newBuilder().setInfoType(it).setReplaceWith(replacement).build())
            .collect(Collectors.toCollection(LinkedList::new));

    final InspectConfig inspectConfig =
            InspectConfig.newBuilder()
                    .addAllInfoTypes(infoTypes)
                    .setMinLikelihood(minLikelihood)
                    .build();

    long itemCount = 0;

    try (DlpServiceClient dlpClient = DlpServiceClient.create(settings)) {
        // Google's DLP library is limited to 100 items per request, so the requests need to be chunked if the
        // number of input items is greater.

        Stream.Builder<Stream<ContentItem>> streamBuilder = Stream.builder();

        for (long processed = 0; processed < input.size(); processed += maxItemsPerRequest) {
            Collection<ContentItem> items =
                    input.stream()
                            .skip(processed)
                            .limit(maxItemsPerRequest)
                            .filter(item -> item != null && !item.isEmpty())
                            .map(item ->
                                    ContentItem.newBuilder()
                                            .setType(MediaType.PLAIN_TEXT_UTF_8.toString())
                                            .setData(ByteString.copyFrom(item.getBytes(Charset.forName("UTF-8"))))
                                            .build()
                            )
                            .collect(Collectors.toCollection(LinkedList::new));
            RedactContentRequest request = RedactContentRequest.newBuilder()
                    .setInspectConfig(inspectConfig)
                    .addAllItems(Collections.unmodifiableCollection(items))
                    .addAllReplaceConfigs(replaceConfigs)
                    .build();

            RedactContentResponse contentResponse = dlpClient.redactContent(request);
            itemCount += contentResponse.getItemsCount();
            streamBuilder.add(contentResponse.getItemsList().stream());
        }

        return streamBuilder.build()
                        .flatMap(stream -> stream.map(item -> item.getData().toStringUtf8()))
                        .collect(Collectors.toCollection(LinkedList::new));
    }
}

score 2 · Accepted Answer

背景可以影响发现。同样在地址的情况下，地址的一部分可能会影响其他部分。例如，“Mountain View CA 94043”可能匹配为 LOCATION，但仅“94043”本身可能不匹配。在运行此分析时，我们在决定上下文时不会跨越单元格边界，因此在您的第二个 ArrayList 示例中，每个字符串都被单独查看（在其自己的上下文中）。

注意：我是 DLP API 的 PM。

google-cloud-platform - DLP API 的不同结果取决于输入是全部在一个字符串中还是作为子字符串的集合发送

1 回答 1

Related

Reference