python - 在 Pandas 中追加大量 Excel 文件的最快方法

Question

我有很多 excel 文件（大约 30K），每个文件都有项目的属性（一个 excel 每个有 3K 行）。所有 excel 文件中都存在一些列。此外，还有更多的列可能并不存在于所有列中。我想将它们合并到一个数据框中。

我尝试使用 pandas.read_excel 读取每个数据帧，然后通过 pandas.append 将它们合并，这不仅非常慢，而且对于某些文件也失败了。

使用的代码：

all_data = pd.DataFrame()
dfs = []
for f in glob.glob("sheets_*.xlsx"):
    temp = pd.read_excel(f, sheetname='ItemDetail', skiprows=[0, 2],index_col=0)
    temp = clean_data(temp) # Do some cleaning here.
    dfs.append(temp)

all_data = all_data.append(dfs,ignore_index=True)

例子：-

Excel 1

| Item Id   | Source  | country   | Item Name   |Item Weight   |Cost |
----------------------------------------------------------------------
|   1       |   x     | India     |   Pen       |     10       | 100 |
|   2       |   y     | Australia |   Pencil    |     15       | 50  | 
|   3       |   x     | Germany   |   Eraser    |      5       | 20  |
|   4       |   y     | India     |   Box       |     80       | 200 |
----------------------------------------------------------------------

Excel 2

| Item Id   | Source  | country   | Item Name   |Item Weight   |Length| Width |
|-----------------------------------------------------------------------------|
|   1       |   x     | Australia |   chair     |     100      | 20   |   26  |
|   2       |   y     | Australia |   cupboard  |     150      | 30   |   40  |
|   3       |   x     | Germany   |   Table     |      500     | 60   |   50  |
|   4       |   y     | Germany   |   Tool      |     360      | 20   |   80  |
|-----------------------------------------------------------------------------|

最终合并数据：

| Item Id   | Source  | country   | Item Name   |Item Weight   |Length| Width | Cost |
|------------------------------------------------------------------------------------|
|   10      |   x     | Australia |   chair     |     100      | 20   |   26  |  NAN |
|   26      |   y     | Australia |   cupboard  |     150      | 30   |   40  |  NAN |
|   38      |   x     | Germany   |   Table     |     500      | 60   |   50  |  NAN |
|   41      |   y     | Germany   |   Tool      |     360      | 20   |   80  |  NAN |
|   1       |   x     | India     |   Pen       |      10      | NAN  |  NAN  |  100 |
|   2       |   y     | Australia |   Pencil    |      15      | NAN  |  NAN  |  50  |
|   3       |   x     | Germany   |   Eraser    |       5      | NAN  |  NAN  |  20  |
|   4       |   y     | India     |   Box       |      80      | NAN  |  NAN  |  200 |
|------------------------------------------------------------------------------------|

请注意，在此示例中，所有列中都存在 Item Id 、 Source 和 country 等列，但可能并非所有列中都存在列。

原始数据中的列数也约为 150。每张表中的行数约为 3000，而我有大约 35K 个这样的表。所以我正在寻找将所有这些数据加载到熊猫中的最佳方法。

python - 在 Pandas 中追加大量 Excel 文件的最快方法

0 回答 0

Related

Reference