4

亲爱的大家

本月我开始使用 BigQuery 分析 GAE 数据存储区中的数据。首先,我通过 GAE 控制台的“Datastore Admin”页面将数据导出到 Google Cloud Storage。然后,我将数据从 Google Cloud Storage 导入 BigQuery。除了重复的结构化属性外,它的工作非常顺利。我预计导入的记录应采用以下格式:

    parent:"James",
    children: [{
        name: "name1",
        age: 5,
        gender: "M"
      }, {
        name: "name2",
        age: 50,
        gender: "F"
      }, {
        name: "name3",
        age: 33,
        gender: "M"
      },
    ]

我知道如何以上述格式展平数据。但 BigQuery 中的实际数据格式似乎采用以下格式:

    parent: "James",
    children.name:["name1", "name2", "name3"],
    children.age:[5, 50, 33],
    children.gender:["M", "F", "M"],    

我想知道是否可以在 BigQuery 中展平上述数据以进行进一步分析。我心目中理想的结果表格式是:

    parentName, children.name, children.age, children.gender
    James, name1, 5, "M"
    James, name2, 50, "F"
    James, name3, 33, "M"      

干杯!

4

2 回答 2

4

随着最近推出的BigQuery 标准 SQL - 事情变得更好了!
在下面尝试(确保取消Use Legacy SQL选中下面的复选框Show Options

WITH parents AS (
  SELECT 
    "James" AS parentName,
    STRUCT(
      ["name1", "name2", "name3"] AS name,
      [5, 50, 33] AS age,
      ["M", "F", "M"] AS gender
    ) AS children    
)
SELECT 
  parentName, childrenName, childrenAge, childrenGender
FROM 
  parents, 
  UNNEST(children.name) AS childrenName WITH OFFSET AS pos_name,
  UNNEST(children.age) AS childrenAge WITH OFFSET AS pos_age, 
  UNNEST(children.gender) AS childrenGender WITH OFFSET AS pos_gender
WHERE
  pos_name = pos_age AND pos_name = pos_gender

这里 - 原始表 - parents- 有以下数据

在此处输入图像描述

分别schema

[{
    "parentName": "James",
    "children": {
      "name": ["name1", "name2", "name3"],
      "age": ["5", "50", "33" ],
      "gender": ["M", "F", "M"]
    }
}]

而且output

在此处输入图像描述

注意:以上仅基于我在原始问题中看到的内容,并且很可能需要根据您的任何特定需求进行调整
希望这对前进的方向和从哪里开始有所帮助!

添加:

上面的查询使用基于行的 CROSS JOINS,这意味着首先组装相同父级的所有变体,而不是 WHERE 子句过滤掉“错误”的。

相反,在下面的版本中,使用 INNER JOIN 来消除这种“副作用”

WITH parents AS (
  SELECT 
    "James" AS parentName,
    STRUCT(
      ["name1", "name2", "name3"] AS name,
      [5, 50, 33] AS age,
      ["M", "F", "M"] AS gender
    ) AS children   
)
SELECT 
  parentName, childrenName, childrenAge, childrenGender
FROM 
  parents, UNNEST(children.name) AS childrenName WITH OFFSET AS pos_name
JOIN UNNEST(children.age) AS childrenAge WITH OFFSET AS pos_age 
  ON pos_name = pos_age
JOIN UNNEST(children.gender) AS childrenGender WITH OFFSET AS pos_gender 
  ON pos_age = pos_gender

直觉上,我希望第二个版本对于更大的桌子效率更高

于 2016-05-19T00:37:25.290 回答
1

您应该能够使用“大型查询结果”功能来生成新的扁平表。不幸的是,语法很可怕。基本原则是,你要展平每个字段并保存位置,然后过滤位置相同的位置。尝试类似:

SELECT parentName, children.name, children.age, children.gender, 
  position(children.name) as name_pos,
  position(children.age) as age_pos,
  position(children.gender) as gender_pos, 
    FROM table
SELECT
  parent,
  children.name,
  children.age,
  children.gender,
  pos
FROM (
  SELECT
    parent,
    children.name,
    children.age,
    children.gender,
    gender_pos,
    pos
  FROM (
      FLATTEN((
        SELECT
          parent,
          children.name,
          children.age,
          children.gender,
          pos,
          POSITION(children.gender) as gender_pos
        FROM (
          SELECT
            parent,
            children.name,
            children.age,
            children.gender,
            pos,              
          FROM (
              FLATTEN((
                SELECT
                  parent,
                  children.name,
                  children.age,
                  children.gender,
                  pos,
                  POSITION(children.age) AS age_pos
                FROM (
                    FLATTEN((
                      SELECT
                        parent,     
                        children.name,
                        children.age,
                        children.gender,
                        POSITION(children.name) AS pos
                      FROM table
                        ),
                      children.name))),
                children.age))
          WHERE
            age_pos = pos)),
        children.gender)))
WHERE
  gender_pos = pos;

要允许大结果,如果您使用 BigQuery 用户界面,则应点击“高级选项”按钮,指定目标表,然后选中“允许大结果”标志。

请注意,如果您的数据存储为具有类似 {name, age, gender} 的嵌套记录的实体,我们应该将其转换为 bigquery 中的嵌套记录,而不是并行数组。我会调查为什么会这样。

于 2013-06-21T15:12:11.943 回答