3

I have a prototype server[0] that's doing an os.walk()[1] for each query a client[0] makes.

I'm currently looking into ways of:

  • caching this data in memory,
  • speeding up queries, and
  • hopefully allowing for expansion into storing metadata and data persistence later on.

I find SQL complicated for tree structures, so I thought I would get some advice before actually committing to SQLite

Are there any cross-platform, embeddable or bundle-able non-SQL databases that might be able to handle this kind of data?

  • I have a small (10k-100k files) list.
  • I have an extremely small amount of connections (maybe 10-20).
  • I want to be able to scale to handling metadata as well.

[0] the server and client are actually the same piece of software, this is a P2P application, that's designed to share files over a local trusted network with out a main server, using zeroconf for discovery, and twisted for pretty much everything else

[1] query time is currently 1.2s with os.walk() on 10,000 files

Here is the related function in my Python code that does the walking:

def populate(self, string):
    for name, sharedir in self.sharedirs.items():
        for root, dirs, files, in os.walk(sharedir):
            for dir in dirs:
                if fnmatch.fnmatch(dir, string):
                    yield os.path.join(name, *os.path.join(root, dir)[len(sharedir):].split("/"))
            for file in files:
                if fnmatch.fnmatch(file, string): 
                    yield os.path.join(name, *os.path.join(root, ile)[len(sharedir):].split("/"))
4

3 回答 3

3

你不需要持久化一个树形结构——事实上,你的代码正忙着把目录树的自然树形结构成一个线性序列,那为什么下次还要从树形结构重新开始呢?

看起来你需要的只是一个有序的序列:

i   X    result of os.path.join for X

其中 X 是一个字符串,命名一个文件或目录(您将它们视为相同),i 是一个递增的整数(以保持顺序),结果列也是一个字符串,是os.path.join(name, *os.path.join(root,&c 的结果。

当然,这很容易放入 SQL 表中!

要第一次创建表,只需从您的填充函数中删除守卫if fnmatch.fnmatch(和参数),在 os.path.join 结果之前生成目录或文件,然后使用 a保存调用(或者,使用自增列,任君挑选)。要使用该表,本质上变为:stringcursor.executemanyenumeratepopulate

select result from thetable where X LIKE '%foo%' order by i

string在哪里foo

于 2010-08-21T15:02:51.123 回答
3

起初我误解了这个问题,但我认为我现在有了一个解决方案(并且与我的其他答案完全不同,需要一个新的答案)。基本上,您第一次在目录上运行 walk 时执行正常查询,但您存储产生的值。第二次,您只需生成那些存储的值。我已经包装了 os.walk() 调用,因为它很短,但是您可以轻松地将生成器包装为一个整体。

cache = {}
def os_walk_cache( dir ):
   if dir in cache:
      for x in cache[ dir ]:
         yield x
   else:
      cache[ dir ]    = []
      for x in os.walk( dir ):
         cache[ dir ].append( x )
         yield x
   raise StopIteration()

我不确定您的内存要求,但您可能需要考虑定期清理cache.

于 2010-08-22T11:05:21.497 回答
0

你看过MongoDB吗?怎么样mod_pythonmod_python应该允许你做你的os.walk(),只是将数据存储在 Python 数据结构中,因为脚本在连接之间是持久的。

于 2010-08-21T10:53:47.097 回答