1

我正在尝试在 Dask 数据帧中展平 JSON 数组对象(没有文件 .json),因为我有很多数据并且我的 RAM 被不断运行的进程消耗,所以我需要一个并行形式的解决方案。

这就是我拥有的 JSON:

[ {
        "id": "0001",
        "name": "Stiven",
        "location": [{
                "country": "Colombia",
                "department": "Choco",
                "city": "Quibdo"
            }, {
                "country": "Colombia",
                "department": "Antioquia",
                "city": "Medellin"
            }, {
                "country": "Colombia",
                "department": "Cundinamarca",
                "city": "Bogota"
            }
        ]
    }, {
        "id": "0002",
        "name": "Jhon Jaime",
        "location": [{
                "country": "Colombia",
                "department": "Valle del Cauca",
                "city": "Cali"
            }, {
                "country": "Colombia",
                "department": "Putumayo",
                "city": "Mocoa"
            }, {
                "country": "Colombia",
                "department": "Arauca",
                "city": "Arauca"
            }
        ]
    }, {
        "id": "0003",
        "name": "Francisco",
        "location": [{
                "country": "Colombia",
                "department": "Atlantico",
                "city": "Barranquilla"
            }, {
                "country": "Colombia",
                "department": "Bolivar",
                "city": "Cartagena"
            }, {
                "country": "Colombia",
                "department": "La Guajira",
                "city": "Riohacha"
            }
        ]
    }
]

这就是我拥有的数据框:

index   id    name         location
0       0001  Stiven       [{'country':'Colombia', 'department': 'Choco', 'city': 'Quibdo'}, {'country':'Colombia', 'department': 'Antioquia', 'city': 'Medellin'}, {'country':'Colombia', 'department': 'Cundinamarca', 'city': 'Bogota'}]
1       0002  Jhon Jaime   [{'country':'Colombia', 'department': 'Valle del Cauca', 'city': 'Cali'}, {'country':'Colombia', 'department': 'Putumayo', 'city': 'Mocoa'}, {'country':'Colombia', 'department': 'Arauca', 'city': 'Arauca'}]
2       0003  Francisco    [{'country':'Colombia', 'department': 'Atlantico', 'city': 'Barranquilla'}, {'country':'Colombia', 'department': 'Bolivar', 'city': 'Cartagena'}, {'country':'Colombia', 'department': 'La Guajira', 'city': 'Riohacha'}] 

我需要将每个 id 转换为数据框,如下所示:

index   id    name         country   department       city
0       0001  Stiven       Colombia  Choco            Quibdo
1       0001  Stiven       Colombia  Antioquia        Medellin
2       0001  Stiven       Colombia  Cundinamarca     Bogota
3       0002  Jhon Jaime   Colombia  Valle del Cauca  Cali
4       0002  Jhon Jaime   Colombia  Putumayo         Mocoa
5       0002  Jhon Jaime   Colombia  Arauca           Arauca
6       0003  Francisco    Colombia  Atlantico        Barranquilla
7       0003  Francisco    Colombia  Bolivar          Cartagena 
8       0003  Francisco    Colombia  La Guajira       Riohacha   

所有进程必须与 Dask 并行。有什么推荐吗?

提前致谢。

4

1 回答 1

1

我建议首先使用 Pandas 数据帧解决这个问题,然后使用该.map_partitions函数将该函数应用于 Dask 数据帧中的所有 Pandas 分区。

于 2019-11-03T15:20:57.153 回答