google-cloud-platform - 使用 Google Cloud DLP API 时如何获取扫描文件的位置？

Question

我正在扫描云存储桶中的嵌套目录。尽管我打开了 include_quote，但结果不包含匹配的值（引号）。另外，如何获取匹配的文件的名称以及匹配的值？我正在使用 Python。这就是我到目前为止所拥有的。如您所见，API 找到了匹配项，但我没有得到关于哪些单词（和文件）被标记的详细信息。

inspect_job = {
  'inspect_config': {
      'info_types': info_types,
      'min_likelihood': MIN_LIKELIHOOD,
      'include_quote': True,
      'limits': {
          'max_findings_per_request': MAX_FINDINGS
      },
  },
  'storage_config': {
      'cloud_storage_options': {
          'file_set': {
              'url':
                  'gs://{bucket_name}/{dir_name}/**'.format(
                      bucket_name=STAGING_BUCKET, dir_name=DIR_NAME)
          }
      }
  }


operation = dlp.create_dlp_job(parent, inspect_job)
dlp.get_dlp_job(operation.name)

结果如下：

result {
processed_bytes: 64
total_estimated_bytes: 64
info_type_stats {
  info_type {
    name: "EMAIL_ADDRESS"
  }
  count: 1
}
info_type_stats {
  info_type {
    name: "PHONE_NUMBER"
  }
  count: 1
}
info_type_stats {
  info_type {
    name: "FIRST_NAME"
  }
  count: 2
}

score 1 · Accepted Answer

您需要遵循 https://cloud.google.com/dlp/docs/inspecting-storage 中的“检索检查结果”部分并指定保存结果操作https://cloud.google.com/dlp/docs/reference/rest /v2/InspectJobConfig#SaveFindings

score 0 · Accepted Answer

我认为您没有得到报价值，因为您的 inspectConfig 不太正确：根据位于https://cloud.google.com/dlp/docs/reference/rest/v2/InspectConfig的文档，您应该设置

  "includeQuote": true

编辑：添加有关获取文件的信息：遵循此示例：https ://cloud.google.com/solutions/automating-classification-of-data-uploaded-to-cloud-storage

云函数 resolve_DLP 的代码从作业详细信息中获取文件名，如下所示

def resolve_DLP(data, context):
...
job = dlp.get_dlp_job(job_name)
...
file_path = (
      job.inspect_details.requested_options.job_config.storage_config
      .cloud_storage_options.file_set.url)
  file_name = os.path.basename(file_path)
...

编辑2：现在我看到最新的python api客户端使用'include_quote'：作为dict键......所以不是......

编辑 3：来自 python api 代码：

message Finding {
  // The content that was found. Even if the content is not textual, it
  // may be converted to a textual representation here.
  // Provided if `include_quote` is true and the finding is
  // less than or equal to 4096 bytes long. If the finding exceeds 4096 bytes
  // in length, the quote may be omitted.
  string quote = 1;

所以也许较小的文件会产生引号

score 0 · Accepted Answer

隆多，感谢您的意见。我相信您提到的云存储示例只为每个作业扫描一个文件。它不使用 savefindings 对象。

乔希，你是对的。似乎需要将输出定向到 Bigquery 或 Pub/sub 才能看到完整的结果。

来自https://cloud.google.com/dlp/docs/inspecting-storage#retrieving-inspection-results：

对于完整的检查作业结果，您有两种选择。根据您选择的操作，检查作业是：

保存到指定表中的 BigQuery（SaveFindings 对象）。在查看或分析结果之前，首先使用 projects.dlpJobs.get 方法确保作业已完成，如下所述。请注意，您可以使用 OutputSchema 对象指定用于存储结果的架构。发布到 Cloud Pub/Sub 主题（PublishToPubSub 对象）。该主题必须已向运行 DlpJob 发送通知的 Cloud DLP 服务帐户授予发布访问权限。

我通过修改解决方案如何使用 DLP 扫描 BigQuery 表以查找敏感数据？.

这是我的最终工作脚本：

import google.cloud.dlp
dlp = google.cloud.dlp.DlpServiceClient()

inspect_job_data = {
    'storage_config': {
      'cloud_storage_options': {
          'file_set': {
              'url':
                  'gs://{bucket_name}/{dir_name}/**'.format(
                      bucket_name=STAGING_BUCKET, dir_name=DIR_NAME)
          }
      }
  },
'inspect_config': {
    'include_quote': include_quote,
    'info_types': [
        {'name': 'ALL_BASIC'},
    ],
},
'actions': [
    {
        'save_findings': {
            'output_config':{
                'table':{
                    'project_id': GCP_PROJECT_ID,
                    'dataset_id': DATASET_ID,
                    'table_id': '{}_DLP'.format(TABLE_ID)
                }
            }

        },
    },
]

}

operation = dlp.create_dlp_job(parent=dlp.project_path(GCP_PROJECT_ID), 
inspect_job=inspect_job_data)

google-cloud-platform - 使用 Google Cloud DLP API 时如何获取扫描文件的位置？

3 回答 3

Related

Reference