python - os.walk() caching/speeding up

Question

I have a prototype server[0] that's doing an os.walk()[1] for each query a client[0] makes.

I'm currently looking into ways of:

caching this data in memory,
speeding up queries, and
hopefully allowing for expansion into storing metadata and data persistence later on.

I find SQL complicated for tree structures, so I thought I would get some advice before actually committing to SQLite

Are there any cross-platform, embeddable or bundle-able non-SQL databases that might be able to handle this kind of data?

I have a small (10k-100k files) list.
I have an extremely small amount of connections (maybe 10-20).
I want to be able to scale to handling metadata as well.

[0] the server and client are actually the same piece of software, this is a P2P application, that's designed to share files over a local trusted network with out a main server, using zeroconf for discovery, and twisted for pretty much everything else

[1] query time is currently 1.2s with os.walk() on 10,000 files

Here is the related function in my Python code that does the walking:

def populate(self, string):
    for name, sharedir in self.sharedirs.items():
        for root, dirs, files, in os.walk(sharedir):
            for dir in dirs:
                if fnmatch.fnmatch(dir, string):
                    yield os.path.join(name, *os.path.join(root, dir)[len(sharedir):].split("/"))
            for file in files:
                if fnmatch.fnmatch(file, string): 
                    yield os.path.join(name, *os.path.join(root, ile)[len(sharedir):].split("/"))

score 3 · Accepted Answer

你不需要持久化一个树形结构——事实上，你的代码正忙着把目录树的自然树形结构拆成一个线性序列，那为什么下次还要从树形结构重新开始呢？

看起来你需要的只是一个有序的序列：

i   X    result of os.path.join for X

其中 X 是一个字符串，命名一个文件或目录（您将它们视为相同），i 是一个递增的整数（以保持顺序），结果列也是一个字符串，是os.path.join(name, *os.path.join(root,&c 的结果。

当然，这很容易放入 SQL 表中！

要第一次创建表，只需从您的填充函数中删除守卫if fnmatch.fnmatch（和参数），在 os.path.join 结果之前生成目录或文件，然后使用 a保存调用（或者，使用自增列，任君挑选）。要使用该表，本质上变为：stringcursor.executemanyenumeratepopulate

select result from thetable where X LIKE '%foo%' order by i

string在哪里foo。

score 3 · Accepted Answer

起初我误解了这个问题，但我认为我现在有了一个解决方案（并且与我的其他答案完全不同，需要一个新的答案）。基本上，您第一次在目录上运行 walk 时执行正常查询，但您存储产生的值。第二次，您只需生成那些存储的值。我已经包装了 os.walk() 调用，因为它很短，但是您可以轻松地将生成器包装为一个整体。

cache = {}
def os_walk_cache( dir ):
   if dir in cache:
      for x in cache[ dir ]:
         yield x
   else:
      cache[ dir ]    = []
      for x in os.walk( dir ):
         cache[ dir ].append( x )
         yield x
   raise StopIteration()

我不确定您的内存要求，但您可能需要考虑定期清理cache.

score 0 · Accepted Answer

你看过MongoDB吗？怎么样mod_python？ mod_python应该允许你做你的os.walk()，只是将数据存储在 Python 数据结构中，因为脚本在连接之间是持久的。

python - os.walk() caching/speeding up

3 回答 3

Related

Reference