python - 如何提高处理大量数据的脚本的性能？

Question

我的机器学习脚本会产生大量数据（BTree一个根中包含数百万个 s BTree）并将其存储在 ZODB 中FileStorage，主要是因为所有这些数据都不适合 RAM。脚本还经常修改以前添加的数据。

当我增加了问题的复杂性，因此需要存储更多数据时，我注意到了性能问题 - 脚本现在计算数据的速度平均慢了 2 倍甚至 10 倍（唯一改变的是要存储的数据量后来检索到要更改）。

我尝试设置cache_size为 1000 到 50000 之间的各种值。老实说，速度上的差异可以忽略不计。

我想过切换到，RelStorage但不幸的是，在文档中他们只提到了如何配置 Zope 或 Plone 等框架。我只使用 ZODB。

我想知道RelStorage在我的情况下是否会更快。

这是我当前设置 ZODB 连接的方式：

import ZODB
connection = ZODB.connection('zodb.fs', ...)
dbroot = connection.root()

我很清楚 ZODB 目前是我脚本的瓶颈。我正在寻找有关如何解决此问题的建议。

我选择 ZODB 是因为我认为 NoSQL 数据库更适合我的情况，而且我喜欢类似于 Python 的dict.

代码和数据结构：

根数据结构：

if not hasattr(dbroot, 'actions_values'):
    dbroot.actions_values = BTree()

if not hasattr(dbroot, 'games_played'):
    dbroot.games_played = 0

actions_values在概念上构建如下：

actions_values = { # BTree
    str(state): { # BTree
        # contiains actions (coulmn to pick to be exact, as I'm working on agent playing Connect 4)
        # and their values(only actions previously taken by the angent are present here), e.g.:
        1: 0.4356
        5: 0.3456
    },
    # other states
}

state是一个表示游戏板的简单二维数组。它的字段的可能值是1,2或None:

board = [ [ None ] * cols for _ in xrange(rows) ]

（在我的情况下rows = 6和cols = 7）

主循环：

should_play = 10000000
transactions_freq = 10000
packing_freq = 50000

player = ReinforcementPlayer(dbroot.actions_values, config)

while dbroot.games_played < should_play:
    # max_epsilon at start and then linearly drops to min_epsilon:
    epsilon = max_epsilon - (max_epsilon - min_epsilon) * dbroot.games_played / (should_play - 1)

    dbroot.games_played += 1
    sys.stdout.write('\rPlaying game %d of %d' % (dbroot.games_played, should_play))
    sys.stdout.flush()

    board_state = player.play_game(epsilon)

    if(dbroot.games_played % transactions_freq == 0):
        print('Commiting...')
        transaction.commit()
    if(dbroot.games_played % packing_freq == 0):
        print('Packing DB...')
        connection.db().pack()

（packing 也需要很多时间，但这不是主要问题；我可以在程序完成后打包数据库）

dbroot在（内部）上运行的代码ReinforcementPlayer：

def get_actions_with_values(self, player_id, state):
    if player_id == 1:
        lookup_state = state
    else:
        lookup_state = state.switch_players()
    lookup_state_str = str(lookup_state)
    if lookup_state_str in self.actions_values:
        return self.actions_values[lookup_state_str]
    mirror_lookup_state_str = str(lookup_state.mirror())
    if mirror_lookup_state_str in self.actions_values:
        return self.mirror_actions(self.actions_values[mirror_lookup_state_str])
    return None

def get_value_of_action(self, player_id, state, action, default=0):
    actions = self.get_actions_with_values(player_id, state)
    if actions is None:
        return default
    return actions.get(action, default)

def set_value_of_action(self, player_id, state, action, value):
    if player_id == 1:
        lookup_state = state
    else:
        lookup_state = state.switch_players()
    lookup_state_str = str(lookup_state)
    if lookup_state_str in self.actions_values:
        self.actions_values[lookup_state_str][action] = value
        return
    mirror_lookup_state_str = str(lookup_state.mirror())
    if mirror_lookup_state_str in self.actions_values:
        self.actions_values[mirror_lookup_state_str][self.mirror_action(action)] = value
        return
    self.actions_values[lookup_state_str] = BTree()
    self.actions_values[lookup_state_str][action] = value

（名称为mirror的功能只是反转列（动作）。这样做是因为Connect 4个板，它们是彼此垂直反射的等效。）

550000 场比赛后len(dbroot.actions_values)是 6018450。

根据iotopIO操作需要90%的时间。

score 2 · Accepted Answer

使用任何（其他）数据库可能无济于事，因为它们受到与 ZODB 相同的磁盘 IO 和内存限制。如果您设法将计算卸载到数据库引擎本身（PostgreSQL + 使用 SQL 脚本），它可能会有所帮助，因为数据库引擎将有更多信息来做出明智的选择如何执行代码，但这里没有什么神奇的，同样的事情可以很可能很容易使用 ZODB 完成。

一些想法可以做什么：

拥有数据索引而不是加载完整对象（相当于 SQL“全表扫描”）。保留数据的智能预处理副本：索引、总和、部分。
使对象本身更小（Python 类有__slots__技巧）
以智能方式使用事务。不要试图在一个大块中处理所有数据。
并行处理 - 使用所有 CPU 内核而不是单线程方法
不要使用 BTrees - 也许对您的用例有更有效的方法

拥有脚本的一些代码示例、实际 RAM 和 Data.fs 大小等将有助于在这里提供进一步的想法。

score 1 · Accepted Answer

在这里要清楚一点，您实际使用的是哪个 BTree 类？OOB树？

关于这些 btree 的两个方面：

1）每个BTree由若干个Bucket组成。每个 Bucket 会在拆分之前容纳一定数量的项目。我不记得他们目前持有多少项目，但我曾经尝试为他们调整 C 代码并重新编译以持有更大的数字，因为选择的值是近二十年前选择的。

2) 有时可能会构建非常不平衡的 Btree。例如，如果您按排序顺序添加值（例如，时间戳只会不断增加），那么您最终会得到一棵树，该树最终会以 O(n) 进行搜索。几年前 Jarn 的人们编写了一个脚本，可以重新平衡 Zope 目录中的 BTree，这可能适合您。

3）您可以使用 OOBucket 而不是使用 OOBTree。这最终将只是 ZODB 中的一个泡菜，因此在您的用例中可能最终会太大，但是如果您在单个事务中完成所有写入操作，它可能会更快（以必须重新执行为代价）将整个 Bucket 写入更新）。

-马特

python - 如何提高处理大量数据的脚本的性能？

2 回答 2

Related

Reference