python - 一会儿用 Python 用 np.save 写一个大文件 True Loop

Question

我正在使用 while True 循环抓取网站，然后将所有数据保存到带有 np.savez 的文件中。我想处理 npz 文件，但文件更新速度比我复制它的速度快。这是我的代码：

while True:
  time.sleep(1.5)
  for post in new:
    all_posts.append(post)
  np.savez('records.npz', posts)
  new = other_site.get_next()

最初为了处理我正在抓取的数据，我只是复制文件，但现在文件太大而且每次都会损坏。我可以从头开始重新启动这个过程并减少保存的频率，这样我就有更多的时间来复制，但我想知道是否有办法恢复我写入的数据。我的另一个想法是截断文件的末尾，使其看起来仍然像一个 npz 文件，python 可以读取它，但我不知道这是否可能。

score 0 · Accepted Answer

为了避免您的文件被践踏或覆盖，为什么不编写一些 python 代码来避免这种情况呢？例如，您可以为每个站点保存到一个新文件，并将这些文件收集到一个目录中；

import os

os.mkdir('scraped_sites')

while True:
   time.sleep(1.5)
   for post in new:
      all_posts.append(post)

   # create a unique file path
   save_file = os.path.join('scraped_sites', 'records_%s.npz' % other_site)
   np.savez(save_file, all_posts)

   new = other_site.get_next()

这样您的文件将永远不会被践踏，因此您不必担心在再次写入之前对其进行处理。如果您不喜欢为文件命名的想法，请查看tempfile

此外，while True可能很危险，因为您的循环永远不会退出 - 我假设您只是为了简洁而编写了这个，但是最好有一个breakorwhile <conditional这样您就不会在中间文件写入时意外强制循环退出.

python - 一会儿用 Python 用 np.save 写一个大文件 True Loop

1 回答 1

Related

Reference