google-bigquery - 从 2012 年 9 月 27 日的加载操作中看到重复的数据

Question

我们刚刚注意到，在 2012 年 9 月 27 日左右，我们的数据已经从 csv 文件上传（使用 Java API）中复制。日志显示上传期间没有错误，但我们已经确认当天的大部分行都被重复（每行有不同的时间戳，以微秒为单位）当天是否有任何已知的故障？我们不知道如何防止这种情况再次发生。

感谢您的任何反馈。

score 1 · Accepted Answer

首先：确保（通过检查加载作业历史记录）您实际上并没有最终运行加载作业两次。如果您使用的是bq 命令行客户端：

# Show all jobs for your selected project
bq ls -j

# Will result in a list such as:
...
job_d8fc9d7eefb2e9243b1ffde484b3ab8a   load      FAILURE   29 Sep 00:35:26   0:00:00   
job_4704a91875d9e0c64f7aaa8de0458696   load      SUCCESS   29 Sep 00:28:45   0:00:05   
...

# Find the load jobs pertaining to the time of data loading. To show detailed information
# about which files you ingested in the load job, run a command on the individual jobs
# that might have been repeats:
bq --format prettyjson show -j job_d8fc9d7eefb2e9243b1ffde484b3ab8a

score 1 · Accepted Answer

Thanks for looking into this for us. It is hard (almost impossible) to believe that data got duplicated on the bigquery side. That said nothing we can see seems to indicate otherwise. As mentioned we have a microsecond timestamp value on every row. For the two job IDs referenced I picked a row at random and made sure that within all of the data we've ever imported it was a unique value. When I run the same query I get two (identical) rows in our bigquery table.

score 0 · Accepted Answer

我们不知道在导入期间数据会重复的任何原因。如果您向我们提供更多信息，例如您的工作 ID 和项目 ID，这将有助于诊断问题。

一般来说，正如迈克尔在他的回答中提到的，看到重复数据的人通常会两次运行相同的工作。（请注意，如果作业失败，则不应以任何方式修改表）。

防止此类冲突的一种方法是命名您的作业，因为我们在每个项目级别强制执行作业名称的唯一性。例如，如果您每天加载一次，您可能希望将您的作业 ID 命名为“job_2012_10_08_load1”之类的名称。这样，如果您尝试两次运行同一个作业，第二个作业将在启动时失败。

google-bigquery - 从 2012 年 9 月 27 日的加载操作中看到重复的数据

3 回答 3

Related

Reference