3

我是 Google Cloud DLP 的新手,我运行了 POST https://dlp.googleapis.com/v2beta1/inspect/operations来扫描.parquetGoogle Cloud Storage 目录中的文件,并使用cloudStorageOptions它来保存.csv输出。

.parquet文件为 53.93 M。

当我对.parquet文件进行 API 调用时,我得到:

"processedBytes": "102308122",
"infoTypeStats": [{
   "infoType": {
      "name": "AMERICAN_BANKERS_CUSIP_ID"
   },
   "count": "1"
}, {
   "infoType": {
      "name": "IP_ADDRESS"
   },
   "count": "17"
}, {
   "infoType": {
      "name": "US_TOLLFREE_PHONE_NUMBER"
   },
   "count": "148"
}, {
   "infoType": {
      "name": "EMAIL_ADDRESS"
   },
   "count": "30"
}, {
   "infoType": {
      "name": "US_STATE"
   },
   "count": "22"
}]

当我将.parquet文件转换.csv为 360.58 MB 文件时。然后,如果我对.csv文件进行 API 调用,我会得到:

"processedBytes": "377530307",
"infoTypeStats": [{
   "infoType": {
      "name": "CREDIT_CARD_NUMBER"
   },
   "count": "56546"
}, {
   "infoType": {
      "name": "EMAIL_ADDRESS"
   },
   "count": "372527"
}, {
   "infoType": {
      "name": "NETHERLANDS_BSN_NUMBER"
   },
   "count": "5"
}, {
   "infoType": {
      "name": "US_TOLLFREE_PHONE_NUMBER"
   },
   "count": "1331321"
}, {
   "infoType": {
      "name": "AUSTRALIA_TAX_FILE_NUMBER"
   },
   "count": "52269"
}, {
   "infoType": {
      "name": "PHONE_NUMBER"
   },
   "count": "28"
}, {
   "infoType": {
      "name": "US_DRIVERS_LICENSE_NUMBER"
   },
   "count": "114"
}, {
   "infoType": {
      "name": "US_STATE"
   },
   "count": "141383"
}, {
   "infoType": {
      "name": "KOREA_RRN"
   },
   "count": "56144"
}],

显然,当我扫描文件时,与在我验证所有都被检测到的文件上运行扫描相比,.parquet并不是所有的都被检测到。infoTypes.csvEmailAddresses

我找不到任何关于压缩文件(如镶木地板)的文档,因此我假设 Google Cloud DLP 不提供此功能。

任何帮助将不胜感激。

4

1 回答 1

2

Parquet 文件目前被扫描为二进制对象,因为系统还没有智能地解析它们。在 V2 api 中,此处列出了支持的文件类型https://cloud.google.com/dlp/docs/reference/rpc/google.privacy.dlp.v2#filetype

于 2018-04-18T22:12:42.870 回答