google-cloud-platform - Data Loss Prevention 在屏蔽电子邮件时发现多余的实体

Question

我正在调用 DLP API 以使用以下请求以文本形式屏蔽人名和电子邮件地址：

要求

{
  "item": {
    "value": "Eleanor Rigby\nPharmacist\neleanor.rigby@example.com"
  },
  "deidentifyConfig": {
    "infoTypeTransformations": {
      "transformations": [
        {
          "infoTypes": [ { "name": "EMAIL_ADDRESS" } ],
          "primitiveTransformation": {
            "characterMaskConfig": {
              "maskingCharacter": "#",
              "reverseOrder": false,
              "charactersToIgnore": [
                {
                  "charactersToSkip": ".@"
                }
              ]
            }
          }
        },
        {
          "infoTypes": [ { "name": "PERSON_NAME" } ],
          "primitiveTransformation": {
            "replaceConfig": {
              "newValue": {
                "stringValue": "(person)"
              }
            }
          }
        }
      ]
    }
  },
  "inspectConfig": {
    "infoTypes": [ { "name": "EMAIL_ADDRESS" }, { "name": "PERSON_NAME" } ]
  }
}

API 调用

curl -s \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  https://dlp.googleapis.com/v2/projects/$PROJECT_ID/content:deidentify \
  -d @gcp-dlp/input/text-request.json

{
  "item": {
    "value": "(person)\nPharmacist\n(person)#######.#####@#######.###(person)"
  },
  "overview": {
    "transformedBytes": "50",
    "transformationSummaries": [
      {
        "infoType": {
          "name": "EMAIL_ADDRESS"
        },
        "transformation": {
          "characterMaskConfig": {
            "maskingCharacter": "#",
            "charactersToIgnore": [
              {
                "charactersToSkip": ".@"
              }
            ]
          }
        },
        "results": [
          {
            "count": "1",
            "code": "SUCCESS"
          }
        ],
        "transformedBytes": "25"
      },
      {
        "infoType": {
          "name": "PERSON_NAME"
        },
        "transformation": {
          "replaceConfig": {
            "newValue": {
              "stringValue": "(person)"
            }
          }
        },
        "results": [
          {
            "count": "3",
            "code": "SUCCESS"
          }
        ],
        "transformedBytes": "25"
      }
    ]
  }
}

请求（仅文本）

Eleanor Rigby
Pharmacist
eleanor.rigby@example.com

响应（仅文本）

(person)
Pharmacist
(person)#######.#####@#######.###(person)

输入文本包含人名和电子邮件地址。两者都按预期检测和屏蔽。但是，(person)在被屏蔽的电子邮件地址之前和之后会添加额外的标签。

这是一个非常简单的示例，但我在以这种方式处理的每个文档中都观察到了这种行为。

为什么多次检测到人员实体？

score 1 · Accepted Answer

Google Public Issue Tracker 报告了此问题，此类请求未编入索引，但这是报告问题或请求新功能的好方法。请按照此案例进行更新。

谷歌建议了一个解决方法：

在这种情况下，当发现重叠时，我们会出现一些未定义的行为。人来自用户的配置，用人替换人名。

他们可以省略重叠。

有关更多信息，请查看文档修改 infoType 检测器以优化扫描结果部分如果还与 EMAIL_ADDRESS 检测器匹配，则忽略 PERSON_NAME 检测器上的匹配：

以下几种语言的 JSON 代码段和代码说明了如何使用InspectConfigPERSON_NAME向 Cloud DLP 指示在检测器匹配与检测器匹配重叠的情况下，它应该只返回一个匹配项EMAIL_ADDRESS。这样做是为了避免电子邮件地址（例如“james@example.com”）在检测器PERSON_NAME和EMAIL_ADDRESS 检测器上都匹配的情况。
...
    "inspectConfig":{
      "ruleSet":[
        {
          "infoTypes":[
            {
              "name":"PERSON_NAME"
            }
          ],
          "rules":[
            {
              "exclusionRule":{
                "excludeInfoTypes":{
                  "infoTypes":[
                    {
                      "name":"EMAIL_ADDRESS"
                    }
                  ]
                },
                "matchingType": "MATCHING_TYPE_PARTIAL_MATCH"
              }
            }
          ]
        }
      ]
    } 
...

google-cloud-platform - Data Loss Prevention 在屏蔽电子邮件时发现多余的实体

1 回答 1

Related

Reference