1

我已经设法使用 pymongo 为 mongoDB 编写了一个简单的索引器脚本。但我不知道为什么索引、添加文档和查询会占用我服务器上 96GB 的 RAM。

是因为我的查询没有优化吗?我怎样才能优化我的查询而不是 database.find_one({"eng":src})

我还能如何优化我的索引器脚本?

所以我的输入是这样的(实际的数据输入有200 万多行不同长度的句子):

#src文件

You will be aware from the press and television that there have been a number of bomb explosions and killings in Sri Lanka.
One of the people assassinated very recently in Sri Lanka was Mr Kumar Ponnambalam, who had visited the European Parliament just a few months ago.
Would it be appropriate for you, Madam President, to write a letter to the Sri Lankan President expressing Parliament's regret at his and the other violent deaths in Sri Lanka and urging her to do everything she possibly can to seek a peaceful reconciliation to a very difficult situation?
Yes, Mr Evans, I feel an initiative of the type you have just suggested would be entirely appropriate.
If the House agrees, I shall do as Mr Evans has suggested.

#trg文件

Wie Sie sicher aus der Presse und dem Fernsehen wissen, gab es in Sri Lanka mehrere Bombenexplosionen mit zahlreichen Toten.
Zu den Attentatsopfern, die es in jüngster Zeit in Sri Lanka zu beklagen gab, zählt auch Herr Kumar Ponnambalam, der dem Europäischen Parlament erst vor wenigen Monaten einen Besuch abgestattet hatte.
Wäre es angemessen, wenn Sie, Frau Präsidentin, der Präsidentin von Sri Lanka in einem Schreiben das Bedauern des Parlaments zum gewaltsamen Tod von Herrn Ponnambalam und anderen Bürgern von Sri Lanka übermitteln und sie auffordern würden, alles in ihrem Kräften stehende zu tun, um nach einer friedlichen Lösung dieser sehr schwierigen Situation zu suchen?
Ja, Herr Evans, ich denke, daß eine derartige Initiative durchaus angebracht ist.
Wenn das Haus damit einverstanden ist, werde ich dem Vorschlag von Herrn Evans folgen.

示例文档如下所示

{ 
    "_id" : ObjectId("50f5fe8916174763f6217994"), 
    "deu" : "Wie Sie sicher aus der Presse und dem Fernsehen wissen, gab es in Sri 
             Lanka mehrere Bombenexplosionen mit zahlreichen Toten.\n", 
    "uid" : 13, 
    "eng" : "You will be aware from the press and television that there have been a 
             number of bomb explosions and killings in Sri Lanka." 
}

我的代码

# -*- coding: utf8 -*-
import codecs, glob, os
from pymongo import MongoClient
from itertools import izip
from bson.code import Code

import sys
reload(sys)
sys.setdefaultencoding("utf-8")

# Gets first instance of matching key given a value and a dictionary.    
def getKey(dic, value):
  return [k for k,v in dic.items() if v == value]

def langiso (lang, isochar=3):
  languages = {"en":"eng",
               "da":"dan","de":"deu",
               "es":"spa",
               "fi":"fin","fr":"fre",
               "it":"ita",
               "nl":"nld",
               "zh":"mcn"}
  if len(lang) == 2 or isochar==3:
    return languages[lang]
  if len(lang) == 3 or isochar==2:
    return getKey(lang)

def txtPairs (bitextDir):
  txtpairs = {}
  for infile in glob.glob(os.path.join(bitextDir, '*')):
    #print infile
    k = infile[-8:-3]; lang = infile[-2:]
    try:
      txtpairs[k] = (txtpairs[k],infile) if lang == "en" else (infile,txtpairs[k]) 
    except:
      txtpairs[k] = infile
  for i in txtpairs:
    if len(txtpairs[i]) != 2:
      del txtpairs[i]
  return txtpairs

def indexEuroparl(sfile, tfile, database):   
  trglang = langiso(tfile[-2:]) #; srclang = langiso(sfile[-2:]) 

  maxdoc = database.find().sort("uid",-1).limit(1)
  uid = 1 if maxdoc.count() == 0 else maxdoc[0]

  counter = 0
  for src, trg in izip(codecs.open(sfile,"r","utf8"), \
                       codecs.open(tfile,"r","utf8")):
    quid = database.find_one({"eng":src})
    # If sentence already exist in db
    if quid != None:
      if database.find({trglang: {"$exists": True}}):
        print "Sentence uniqID",quid["uid"],"already exist."
        continue
      else:
        print "Reindexing uniqID",quid["uid"],"..."
        database.update({"uid":quid["uid"]}, {"$push":{trglang:trg}})
    else:
      print "Indexing uniqID",uid,"..."
      doc = {"uid":uid,"eng":src,trglang:trg}
      database.insert(doc)
      uid+=1
    if counter == 1000:
      for i in database.find():
        print i
      counter = 0
    counter+=1

connection = MongoClient()
db = connection["europarl"]
v7 = db["v7"]

srcfile = "eng-deu.en"; trgfile = "eng-deu.de"
indexEuroparl(srcfile,trgfile,v7)

# After indexing the english-german pair, i'll perform the same indexing on other language pairs
srcfile = "eng-spa.en"; trgfile = "eng-spa.es"
indexEuroparl(srcfile,trgfile,v7)
4

1 回答 1

0

经过几轮代码分析后,我找到了 RAM 泄漏的位置。

首先,如果我想"eng"经常查询该字段,我应该通过这样做为该字段创建一个索引:

v7.ensure_index([("eng",1),("unique",True)])

"eng"这解决了跨未索引字段的串行搜索所花费的时间。

其次,RAM 出血问题是由于这个代价高昂的函数调用造成的:

doc = {"uid":uid,"eng":src,trglang:trg}
if counter == 1000:
  for i in database.find():
    print i
  counter = 0
counter+=1

MongoDb 所做的是将结果存储到 RAM 中@Sammaye。每次我调用 database.find() 时,它都会在 RAM 中保存我添加到集合中的一整套文档。这就是我烧掉96GB RAM的方式。上面的代码片段需要更改为:

doc = {"uid":uid,"eng":src,trglang:trg}
if counter == 1000:
  print doc
counter+=1

通过消除 database.find() 并为"eng"字段创建索引,我只使用了 25GB,并且在不到 1 小时的时间内完成了 200 万个句子的索引。

于 2013-01-18T11:42:44.737 回答