4

我正在做一个信息检索任务。我建立了一个简单的搜索引擎。InvertedIndex 是一个 python 字典对象,它被序列化(用 python 术语腌制)到一个文件。这个文件的大小是 InvertedIndex,只有 6.5MB。

所以,我的代码只是解开它并搜索查询并根据 TF-IDF 分数对匹配的文档进行排名。听起来没什么大不了的吧?

它在 30 分钟前开始运行并且仍在运行。运行我的 100 行 python 脚本的私有字节和虚拟大小pythonw.exe分别为 88MB 和 168MB。

当我尝试使用较小尺寸的索引时,它很快。是python还是我的代码?为什么这么慢?

stopwords = ['a' , 'a\'s' , 'able' , 'about' , 'above' , 'according' , 'accordingly' , 'across' , 'actually' , 'after' , 'afterwards' , 'again' , 'against' , 'ain\'t' , 'all' , 'allow' , 'allows' , 'almost' , 'alone' , 'along' , 'already' , 'also' , 'although' , 'always' , 'am' , 'among' , 'amongst' , 'an' , 'and' , 'another' , 'any' , 'anybody' , 'anyhow' , 'anyone' , 'anything' , 'anyway' , 'anyways' , 'anywhere' , 'apart' , 'appear' , 'appreciate' , 'appropriate' , 'are' , 'aren\'t' , 'around' , 'as' , 'aside' , 'ask' , 'asking' , 'associated' , 'at' , 'available' , 'away' , 'awfully' , 'b' , 'be' , 'became' , 'because' , 'become' , 'becomes' , 'becoming' , 'been' , 'before' , 'beforehand' , 'behind' , 'being' , 'believe' , 'below' , 'beside' , 'besides' , 'best' , 'better' , 'between' , 'beyond' , 'both' , 'brief' , 'but' , 'by' , 'c' , 'c\'mon' , 'c\'s' , 'came' , 'can' , 'can\'t' , 'cannot' , 'cant' , 'cause' , 'causes' , 'certain' , 'certainly' , 'changes' , 'clearly' , 'co' , 'com' , 'come' , 'comes' , 'concerning' , 'consequently' , 'consider' , 'considering' , 'contain' , 'containing' , 'contains' , 'corresponding' , 'could' , 'couldn\'t' , 'course' , 'currently' , 'd' , 'definitely' , 'described' , 'despite' , 'did' , 'didn\'t' , 'different' , 'do' , 'does' , 'doesn\'t' , 'doing' , 'don\'t' , 'done' , 'down' , 'downwards' , 'during' , 'e' , 'each' , 'edu' , 'eg' , 'eight' , 'either' , 'else' , 'elsewhere' , 'enough' , 'entirely' , 'especially' , 'et' , 'etc' , 'even' , 'ever' , 'every' , 'everybody' , 'everyone' , 'everything' , 'everywhere' , 'ex' , 'exactly' , 'example' , 'except' , 'f' , 'far' , 'few' , 'fifth' , 'first' , 'five' , 'followed' , 'following' , 'follows' , 'for' , 'former' , 'formerly' , 'forth' , 'four' , 'from' , 'further' , 'furthermore' , 'g' , 'get' , 'gets' , 'getting' , 'given' , 'gives' , 'go' , 'goes' , 'going' , 'gone' , 'got' , 'gotten' , 'greetings' , 'h' , 'had' , 'hadn\'t' , 'happens' , 'hardly' , 'has' , 'hasn\'t' , 'have' , 'haven\'t' , 'having' , 'he' , 'he\'s' , 'hello' , 'help' , 'hence' , 'her' , 'here' , 'here\'s' , 'hereafter' , 'hereby' , 'herein' , 'hereupon' , 'hers' , 'herself' , 'hi' , 'him' , 'himself' , 'his' , 'hither' , 'hopefully' , 'how' , 'howbeit' , 'however' , 'i' , 'i\'d' , 'i\'ll' , 'i\'m' , 'i\'ve' , 'ie' , 'if' , 'ignored' , 'immediate' , 'in' , 'inasmuch' , 'inc' , 'indeed' , 'indicate' , 'indicated' , 'indicates' , 'inner' , 'insofar' , 'instead' , 'into' , 'inward' , 'is' , 'isn\'t' , 'it' , 'it\'d' , 'it\'ll' , 'it\'s' , 'its' , 'itself' , 'j' , 'just' , 'k' , 'keep' , 'keeps' , 'kept' , 'know' , 'knows' , 'known' , 'l' , 'last' , 'lately' , 'later' , 'latter' , 'latterly' , 'least' , 'less' , 'lest' , 'let' , 'let\'s' , 'like' , 'liked' , 'likely' , 'little' , 'look' , 'looking' , 'looks' , 'ltd' , 'm' , 'mainly' , 'many' , 'may' , 'maybe' , 'me' , 'mean' , 'meanwhile' , 'merely' , 'might' , 'more' , 'moreover' , 'most' , 'mostly' , 'much' , 'must' , 'my' , 'myself' , 'n' , 'name' , 'namely' , 'nd' , 'near' , 'nearly' , 'necessary' , 'need' , 'needs' , 'neither' , 'never' , 'nevertheless' , 'new' , 'next' , 'nine' , 'no' , 'nobody' , 'non' , 'none' , 'noone' , 'nor' , 'normally' , 'not' , 'nothing' , 'novel' , 'now' , 'nowhere' , 'o' , 'obviously' , 'of' , 'off' , 'often' , 'oh' , 'ok' , 'okay' , 'old' , 'on' , 'once' , 'one' , 'ones' , 'only' , 'onto' , 'or' , 'other' , 'others' , 'otherwise' , 'ought' , 'our' , 'ours' , 'ourselves' , 'out' , 'outside' , 'over' , 'overall' , 'own' , 'p' , 'particular' , 'particularly' , 'per' , 'perhaps' , 'placed' , 'please' , 'plus' , 'possible' , 'presumably' , 'probably' , 'provides' , 'q' , 'que' , 'quite' , 'qv' , 'r' , 'rather' , 'rd' , 're' , 'really' , 'reasonably' , 'regarding' , 'regardless' , 'regards' , 'relatively' , 'respectively' , 'right' , 's' , 'said' , 'same' , 'saw' , 'say' , 'saying' , 'says' , 'second' , 'secondly' , 'see' , 'seeing' , 'seem' , 'seemed' , 'seeming' , 'seems' , 'seen' , 'self' , 'selves' , 'sensible' , 'sent' , 'serious' , 'seriously' , 'seven' , 'several' , 'shall' , 'she' , 'should' , 'shouldn\'t' , 'since' , 'six' , 'so' , 'some' , 'somebody' , 'somehow' , 'someone' , 'something' , 'sometime' , 'sometimes' , 'somewhat' , 'somewhere' , 'soon' , 'sorry' , 'specified' , 'specify' , 'specifying' , 'still' , 'sub' , 'such' , 'sup' , 'sure' , 't' , 't\'s' , 'take' , 'taken' , 'tell' , 'tends' , 'th' , 'than' , 'thank' , 'thanks' , 'thanx' , 'that' , 'that\'s' , 'thats' , 'the' , 'their' , 'theirs' , 'them' , 'themselves' , 'then' , 'thence' , 'there' , 'there\'s' , 'thereafter' , 'thereby' , 'therefore' , 'therein' , 'theres' , 'thereupon' , 'these' , 'they' , 'they\'d' , 'they\'ll' , 'they\'re' , 'they\'ve' , 'think' , 'third' , 'this' , 'thorough' , 'thoroughly' , 'those' , 'though' , 'three' , 'through' , 'throughout' , 'thru' , 'thus' , 'to' , 'together' , 'too' , 'took' , 'toward' , 'towards' , 'tried' , 'tries' , 'truly' , 'try' , 'trying' , 'twice' , 'two' , 'u' , 'un' , 'under' , 'unfortunately' , 'unless' , 'unlikely' , 'until' , 'unto' , 'up' , 'upon' , 'us' , 'use' , 'used' , 'useful' , 'uses' , 'using' , 'usually' , 'uucp' , 'v' , 'value' , 'various' , 'very' , 'via' , 'viz' , 'vs' , 'w' , 'want' , 'wants' , 'was' , 'wasn\'t' , 'way' , 'we' , 'we\'d' , 'we\'ll' , 'we\'re' , 'we\'ve' , 'welcome' , 'well' , 'went' , 'were' , 'weren\'t' , 'what' , 'what\'s' , 'whatever' , 'when' , 'whence' , 'whenever' , 'where' , 'where\'s' , 'whereafter' , 'whereas' , 'whereby' , 'wherein' , 'whereupon' , 'wherever' , 'whether' , 'which' , 'while' , 'whither' , 'who' , 'who\'s' , 'whoever' , 'whole' , 'whom' , 'whose' , 'why' , 'will' , 'willing' , 'wish' , 'with' , 'within' , 'without' , 'won\'t' , 'wonder' , 'would' , 'would' , 'wouldn\'t' , 'x' , 'y' , 'yes' , 'yet' , 'you' , 'you\'d' , 'you\'ll' , 'you\'re' , 'you\'ve' , 'your' , 'yours' , 'yourself' , 'yourselves' , 'z' , 'zero']
import PorterStemmer
import math
import pickle

def TF(term,doc):
    #Term Frequency: No. of times `term` occured in `doc`
    global InvertedIndex
    idx = InvertedIndex[term].index(doc)
    count = 0
    while (idx < len(InvertedIndex[term])) and InvertedIndex[term][idx] == doc:
        count= count+1
        idx = idx+1
    return count

def DF(term):
    #Document Frequency: No. of documents containing `term`
    global InvertedIndex
    return len(set(InvertedIndex[term]))

def avgTF(term, doc):
    global docs
    TFs = []
    for term in docs[doc]:
        TFs.append(TF(term,doc))
    return sum(TFs)/len(TFs)

def maxTF(term, doc):
    global docs
    TFs = []    
    for term in docs[doc]:
        TFs.append(TF(term,doc))
    return max(TFs)

def getValues4Term(term, doc):
    TermFrequency = {}
    TermFrequency['natural'] = TF(term,doc)
    TermFrequency['log'] = 1+math.log( TF(term,doc) )
    TermFrequency['aug'] = 0.5+float(0.5*TF(term,doc)/maxTF(term,doc))
    TermFrequency['bool'] = 1 if TF(term,doc)>0 else 0
    TermFrequency['log_avg'] = float(1+math.log( TF(term,doc) ))/(1+math.log( avgTF(term,doc) ))

    DocumentFrequency = {}
    DocumentFrequency['no'] = 1
    DocumentFrequency['idf'] = math.log( len(docs)/DF(term) )
    DocumentFrequency['probIDF'] = max( [0, math.log( float(len(docs)-DF(term))/DF(term) )] )
    return [TermFrequency, DocumentFrequency]

def Cosine(resultDocVector, qVector, doc):
    #`doc` parameter is the document number corresponding to resultDocVector
    global qterms,docs
    # Defining Cosine similarity : cos(a) = A.B/|A||B|

    dotProduct = 0
    commonTerms_q_d = set(qterms).intersection(docs[doc]) #commonTerms in both query & document
    for cmnTerm in commonTerms_q_d:
       dotProduct =  dotProduct + resultDocVector[docs[doc].index(cmnTerm)] * qVector[qterms.index(cmnTerm)]

    resultSquares = []
    for k in resultDocVector:
        resultSquares.append(k*k)

    qSquares = []
    for k in qVector:
        qSquares.append(k*k)

    denominator = math.sqrt(sum(resultSquares)) * math.sqrt(sum(qSquares))
    return dotProduct/denominator

def load():
    #load index from a file
    global InvertedIndex, docIDs, docs
    PICKLE_InvertedIndex_FILE = open("InvertedIndex.db", 'rb')
    InvertedIndex = pickle.load(PICKLE_InvertedIndex_FILE)
    PICKLE_InvertedIndex_FILE.close()

    PICKLE_docIDs_FILE = open("docIDs.db", 'rb')
    docIDs = pickle.load(PICKLE_docIDs_FILE)
    PICKLE_docIDs_FILE.close()

    PICKLE_docs_FILE = open("docs.db", 'rb')
    docs = pickle.load(PICKLE_docs_FILE)
    PICKLE_docs_FILE.close()
########################
docs = []
docIDs = []
InvertedIndex = {}
load()

stemmer = PorterStemmer.PorterStemmer()
#<getting results for a query
query = 'Antarctica exploration'
qwords = query.strip().split()
qterms = []
qterms1 = []
for qword in qwords:
    qword = qword.lower()
    if qword in stopwords:
        continue
    qterm = stemmer.stem(qword,0,len(qword)-1) 
    qterms1.append(qterm)
qterms = list(set(qterms1))


#getting posting lists for each qterms & merging them
prev = set()
i = 0
for qterm in qterms:
    if InvertedIndex.has_key(qterm):
        if i == 0:
            prev = set(InvertedIndex[qterm])
            i = i+1
            continue
        prev = prev.intersection(set(InvertedIndex[qterm]))

results = list(prev)
#</getting results for a query

#We've got the results. Now lets rank them using Cosine similarity.
i = 0
docComponents = []
for doc in results:
        docComponents.append([])

i = 0    
for doc in results:
    for term in docs[doc]:
        vals = getValues4Term(term,doc)#[TermFrequency, DocumentFrequency]
        docComponents[i].append(vals)
    i = i+1
#Normalization = {}

# forming vectors for each document in the result
i = 0 #document iterator
j = 0 #term iterator
resultDocVectors = []#contains document vector for each result.

for doc in results:
        resultDocVectors.append([])

for i in range(0,len(results)):
    for j in range(0,len(docs[doc])):
        tf = docComponents[i][j][0]['natural']#0:TermFrequency
        idf = docComponents[i][j][1]['idf'] #1:DocumentFrequency        
        resultDocVectors[i].append(tf*idf)

#forming vector for query
qVector = []
qTF = []
qDF = []
for qterm in qterms:
    count = 0
    idx = qterms1.index(qterm)
    while idx < len(qterms1) and qterms1[idx] == qterm:
        count= count+1
        idx = idx+1
    qTF.append(count)
qVector = qTF    


#compuing Cosine similarities of all resultDocVectors w.r.t qVector
i = 0
CosineVals = []
for resultDocVector in resultDocVectors:
    doc = results[i]
    CosineVals.append(Cosine(resultDocVector, qVector, doc))
    i = i+1

#ranking as per Cosine Similarities
#this is not "perfect" sorting. As it may not give 100% correct results when it multiple docs have same cosine similarities.
CosineValsCopy = CosineVals
CosineVals.sort()
sortedCosineVals = CosineVals
CosineVals = CosineValsCopy
rankedResults = []
for cval in sortedCosineVals:    
    rankedResults.append(results[CosineVals.index(cval)])
rankedResults.reverse()

#<Evaluation of the system:>

#parsing qrels.txt & getting relevances
# qrels.txt contains columns of the form:
#       qid  iter  docno  rel
#2nd column `iter` can be ignored.
relevances = {}
fh = open("qrels.txt")
lines = fh.readlines()
for line in lines:
    cols = line.strip().split()
    if relevances.has_key(cols[0]):#queryID
        relevances[cols[0]].append(cols[2])#docID
    else:
        relevances[cols[0]] = [cols[2]]
fh.close()

#precision = no. of relevant docs retrieved/total no. of docs retrieved
no_of_relevant_docs_retrieved = set(rankedResults).intersection( set(relevances[queryID]) )
Precision = no_of_relevant_docs_retrieved/len(rankedResults)

#recall = no. of relevant docs retrieved/ total no. of relevant docs
Recall = no_of_relevant_docs_retrieved/len(relevances[queryID])
4

5 回答 5

18

这绝对是您的代码,但由于您选择向我们隐藏它,我们无法提供任何进一步的帮助。根据您选择提供的非常稀缺的信息,我只能告诉您,解开字典(以正确的方式)要快得多,并且索引到它(假设这就是您所说的“搜索它以进行查询”的意思)是惊人的快速地。正是从这些数据中,我推断出你的速度变慢的原因一定是你在代码中做的其他事情,或者做错了什么。

编辑:既然您已经发布了代码,我一眼就注意到,很多重要的代码都在模块顶层运行。非常可怕的做法,并且对性能有害:将所有重要的代码放入一个函数中,然后调用该函数——它本身会给你带来几十%的加速,而复杂度为零。我必须在我的 Stack Overflow 帖子中至少提到这个关键事实 20 次,更不用说“Python in a Nutshell”等——当然,如果你关心性能,你就不能轻率地忽略这些容易获得和广泛传播的信息吗?!

更容易修复的运行时间浪费:

import pickle

使用cPickle(如果你不是在 Python 2.6 或 2.7 上,而是在 3.1 上,那么可能还有其他原因导致性能问题——我不知道此时的 3.1 与 2.6 的出色性能相比有多精细,并且2.7)。

除了 in 之外的所有global语句都是无用的load(不会严重影响性能,但原则上应该消除冗余和无用的代码)。global如果你想从一个函数中绑定一个模块全局变量,你只需要一个,并且load是你唯一这样做的地方。

更多编辑

现在我们来看看更重要的东西:in 中的值InvertedIndex似乎是文档列表,因此要知道一个文档在一个文档中出现了多少次,您必须循环遍历它。为什么不将每个值改为从 doc 到出现次数的字典?没有循环(也没有循环,你现在在其中执行len(set(...))一个值InvertedIndex- 只是len将是等价的,你保存了set(...)隐式必须循环执行其工作的操作)。这是一个大 O 优化,而不是“只是”那种 20% 左右的加速,我到目前为止提到的事情可能解释了——也就是说,这更重要的东西,正如我所说。使用正确的数据结构和算法,许多次要的低效率可能变得相对不重要;使用错误的,随着输入大小的增长,无论您多么巧妙地微优化错误的数据结构和算法,都无法保存代码的性能;-)。

还有更多:你重复你的计算,每次都是“从头开始”——例如,看看你为每个给定的术语和文档调用了多少次TF(term, doc)每次调用都有我刚刚解释过的关键低效率)。作为解决这种巨大低效率的最快方法,使用记忆化——例如,使用这里memoized的装饰器。

好的,我已经很晚了,我最好去睡觉了——我希望上面的部分或全部建议对你有用!

于 2010-09-27T04:21:31.427 回答
5

Alex gave you good suggestions for algorithmic change. I'm just going to address writing fast python code. You should do both. If you were to just incorporate my changes, you would still (based on what Alex said) have a broken program, but I'm just not up for understanding your whole program right now and I am in the mood to micro-optimize. Even if you end up throwing a lot of these functions away, comparing a slow implementation with a fast implementation will help you write fast implementations of your new functions.

Take the following function:

def TF(term,doc):
    #Term Frequency: No. of times `term` occured in `doc`
    global InvertedIndex
    idx = InvertedIndex[term].index(doc)
    count = 0
    while (idx < len(InvertedIndex[term])) and InvertedIndex[term][idx] == doc:
        count= count+1
        idx = idx+1
    return count

rewrite it as

def TF(term, doc):
    idx = InvertedIndex[term].index(doc)        
    return next(i + 1 for i, item in enumerate(InvertedIndex[term][idx:])
                if item != doc)

# Above struck out because the count method does the same thing and there was a bug
# in the implementation anyways.
InvertedIndex[term].count(doc)

This creates a generator expression that generates the ordered set of indexes of documents that occur after the first index of doc and that are not equal to it. You can just use the next function to calculate the first element and this will be your count.

Some functions that you definitely want to look up in the docs.

some syntax that you want

  • generator expressions (just like list comprehensions but mo'bettah (unless you need a list ;))
  • list comprehensions

last but certainly not least, the most important (IMNAHAIPSBO) python module:

Here's another function

def maxTF(term, doc):
    global docs
    TFs = []    
    for term in docs[doc]:
        TFs.append(TF(term,doc))
    return max(TFs)

you can rewrite this using a generator expression:

def maxTF(term, doc):
    return max(TF(term, doc) for term in docs[doc])

generator expressions will often run close to twice as fast as a for loop.

Finally, here's your Cosine function:

def Cosine(resultDocVector, qVector, doc):
    #`doc` parameter is the document number corresponding to resultDocVector
    global qterms,docs
    # Defining Cosine similarity : cos(a) = A.B/|A||B|

    dotProduct = 0
    commonTerms_q_d = set(qterms).intersection(docs[doc]) #commonTerms in both query & document
    for cmnTerm in commonTerms_q_d:
       dotProduct =  dotProduct + resultDocVector[docs[doc].index(cmnTerm)] * qVector[qterms.index(cmnTerm)]

    resultSquares = []
    for k in resultDocVector:
        resultSquares.append(k*k)

    qSquares = []
    for k in qVector:
        qSquares.append(k*k)

let's rewrite this as:

def Cosine(resultDocVector, qVector, doc):
    doc = docs[doc]
    commonTerms_q_d = set(qterms).intersection(doc)
    dotProduct = sum(resultDocVector[doc.index(cmnTerm)] *qVector[qterms.index(cmnTerm)]
                     for cmnTerm in commonTerms_q_d)

    denominator = sum(k**2 for k in resultDocVector)
    denominator *= sum(k**2 for k in qVector)
    denominator = math.sqrt(denominator) 

    return dotProduct/denominator

Here, we've tossed out every for loop in sight. code of the form

lst = []
for item in other_lst:
    lst.append(somefunc(item))

is the slowest way to build a list. First off, for/while loops are slow to begin with and appending to a list is slow. You have the worst of both worlds. A good attitude is to code like there's a tax on loops (performance wise, there is). Only pay it if you just can't do something with map or a comprehension or it makes your code more readable and you know it's not a bottle neck. comprehensions are very readable once you get used to them.

于 2010-09-27T04:26:35.533 回答
4

本着上面@aaronasterling 的精神,这是更多的微观优化。不过,我认为这些观察值得考虑。

使用适当的数据类型

stopwords应该是一套。您不能反复搜索列表并期望它很快。

使用更多套装。它们是可迭代的,就像列表一样,但是当你必须搜索它们时,它们比列表快得多。


列表推导

resultSquares = [k*k for k in resultDocVector]
qSquares = [k*k for k in qVector]
TFs = [TF(term,doc) for term in docs[doc]]

发电机

转这个:

for qword in qwords:
    qword = qword.lower()
    if qword in stopwords:
        continue
    qterm = stemmer.stem(qword,0,len(qword)-1) 
    qterms1.append(qterm)
qterms = list(set(qterms1))

进入这个:

qworditer = (qword.lower() for qword in qwords if qword not in stopwords)
qtermiter = (stemmer.stem(qword,0,len(qword)-1) for qword in qworditer)
qterms1 = set([qterm for qterm in qtermiter])

使用生成器和reduce()

转这个:

prev = set()
i = 0
for qterm in qterms:
    if InvertedIndex.has_key(qterm):
        if i == 0:
            prev = set(InvertedIndex[qterm])
            i = i+1
            continue
        prev = prev.intersection(set(InvertedIndex[qterm]))

results = list(prev)

进入这个:

qtermiter = (set(InvertedIndex[qterm]) for qterm in qterms if qterm in InvertedIndex)
results = reduce(set.intersection, qtermiter)

使用列表推导

而不是这个:

i = 0
docComponents = []
for doc in results:
        docComponents.append([])

i = 0    
for doc in results:
    for term in docs[doc]:
        vals = getValues4Term(term,doc)#[TermFrequency, DocumentFrequency]
        docComponents[i].append(vals)
    i = i+1

写这个:

docComponents = [getValues4Term(term,doc) for doc in results for term in docs[doc]]

这段代码没有意义:

for doc in results:
        resultDocVectors.append([])

for i in range(0,len(results)):
    for j in range(0,len(docs[doc])):
        tf = docComponents[i][j][0]['natural']#0:TermFrequency
        idf = docComponents[i][j][1]['idf'] #1:DocumentFrequency        
        resultDocVectors[i].append(tf*idf)

该值len(docs[doc])取决于doc,而 的值doc是循环中最后到达的值for doc in results


利用collections.defaultdict

而不是这个:

relevances = {}
fh = open("qrels.txt")
lines = fh.readlines()
for line in lines:
    cols = line.strip().split()
    if relevances.has_key(cols[0]):#queryID
        relevances[cols[0]].append(cols[2])#docID
    else:
        relevances[cols[0]] = [cols[2]]

写这个(假设你的文件每行只有三个字段):

from collections import defaultdict
relevances = defaultdict(list)
with open("qrels.txt") as fh:
    lineiter = (line.strip().split() for line in fh)
    for queryID, _, docID in lineiter:
        relevances[queryID].append(docID)

正如许多其他人所说,记住你的计算。


2010-10-21:关于这stopwords件事的更新。

from datetime import datetime

stopwords = ['a' , 'a\'s' , 'able' , 'about' , 'above' , 'according' , 'accordingly' , 'across' , 'actually' , 'after' , 'afterwards' , 'again' , 'against' , 'ain\'t' , 'all' , 'allow' , 'allows' , 'almost' , 'alone' , 'along' , 'already' , 'also' , 'although' , 'always' , 'am' , 'among' , 'amongst' , 'an' , 'and' , 'another' , 'any' , 'anybody' , 'anyhow' , 'anyone' , 'anything' , 'anyway' , 'anyways' , 'anywhere' , 'apart' , 'appear' , 'appreciate' , 'appropriate' , 'are' , 'aren\'t' , 'around' , 'as' , 'aside' , 'ask' , 'asking' , 'associated' , 'at' , 'available' , 'away' , 'awfully' , 'b' , 'be' , 'became' , 'because' , 'become' , 'becomes' , 'becoming' , 'been' , 'before' , 'beforehand' , 'behind' , 'being' , 'believe' , 'below' , 'beside' , 'besides' , 'best' , 'better' , 'between' , 'beyond' , 'both' , 'brief' , 'but' , 'by' , 'c' , 'c\'mon' , 'c\'s' , 'came' , 'can' , 'can\'t' , 'cannot' , 'cant' , 'cause' , 'causes' , 'certain' , 'certainly' , 'changes' , 'clearly' , 'co' , 'com' , 'come' , 'comes' , 'concerning' , 'consequently' , 'consider' , 'considering' , 'contain' , 'containing' , 'contains' , 'corresponding' , 'could' , 'couldn\'t' , 'course' , 'currently' , 'd' , 'definitely' , 'described' , 'despite' , 'did' , 'didn\'t' , 'different' , 'do' , 'does' , 'doesn\'t' , 'doing' , 'don\'t' , 'done' , 'down' , 'downwards' , 'during' , 'e' , 'each' , 'edu' , 'eg' , 'eight' , 'either' , 'else' , 'elsewhere' , 'enough' , 'entirely' , 'especially' , 'et' , 'etc' , 'even' , 'ever' , 'every' , 'everybody' , 'everyone' , 'everything' , 'everywhere' , 'ex' , 'exactly' , 'example' , 'except' , 'f' , 'far' , 'few' , 'fifth' , 'first' , 'five' , 'followed' , 'following' , 'follows' , 'for' , 'former' , 'formerly' , 'forth' , 'four' , 'from' , 'further' , 'furthermore' , 'g' , 'get' , 'gets' , 'getting' , 'given' , 'gives' , 'go' , 'goes' , 'going' , 'gone' , 'got' , 'gotten' , 'greetings' , 'h' , 'had' , 'hadn\'t' , 'happens' , 'hardly' , 'has' , 'hasn\'t' , 'have' , 'haven\'t' , 'having' , 'he' , 'he\'s' , 'hello' , 'help' , 'hence' , 'her' , 'here' , 'here\'s' , 'hereafter' , 'hereby' , 'herein' , 'hereupon' , 'hers' , 'herself' , 'hi' , 'him' , 'himself' , 'his' , 'hither' , 'hopefully' , 'how' , 'howbeit' , 'however' , 'i' , 'i\'d' , 'i\'ll' , 'i\'m' , 'i\'ve' , 'ie' , 'if' , 'ignored' , 'immediate' , 'in' , 'inasmuch' , 'inc' , 'indeed' , 'indicate' , 'indicated' , 'indicates' , 'inner' , 'insofar' , 'instead' , 'into' , 'inward' , 'is' , 'isn\'t' , 'it' , 'it\'d' , 'it\'ll' , 'it\'s' , 'its' , 'itself' , 'j' , 'just' , 'k' , 'keep' , 'keeps' , 'kept' , 'know' , 'knows' , 'known' , 'l' , 'last' , 'lately' , 'later' , 'latter' , 'latterly' , 'least' , 'less' , 'lest' , 'let' , 'let\'s' , 'like' , 'liked' , 'likely' , 'little' , 'look' , 'looking' , 'looks' , 'ltd' , 'm' , 'mainly' , 'many' , 'may' , 'maybe' , 'me' , 'mean' , 'meanwhile' , 'merely' , 'might' , 'more' , 'moreover' , 'most' , 'mostly' , 'much' , 'must' , 'my' , 'myself' , 'n' , 'name' , 'namely' , 'nd' , 'near' , 'nearly' , 'necessary' , 'need' , 'needs' , 'neither' , 'never' , 'nevertheless' , 'new' , 'next' , 'nine' , 'no' , 'nobody' , 'non' , 'none' , 'noone' , 'nor' , 'normally' , 'not' , 'nothing' , 'novel' , 'now' , 'nowhere' , 'o' , 'obviously' , 'of' , 'off' , 'often' , 'oh' , 'ok' , 'okay' , 'old' , 'on' , 'once' , 'one' , 'ones' , 'only' , 'onto' , 'or' , 'other' , 'others' , 'otherwise' , 'ought' , 'our' , 'ours' , 'ourselves' , 'out' , 'outside' , 'over' , 'overall' , 'own' , 'p' , 'particular' , 'particularly' , 'per' , 'perhaps' , 'placed' , 'please' , 'plus' , 'possible' , 'presumably' , 'probably' , 'provides' , 'q' , 'que' , 'quite' , 'qv' , 'r' , 'rather' , 'rd' , 're' , 'really' , 'reasonably' , 'regarding' , 'regardless' , 'regards' , 'relatively' , 'respectively' , 'right' , 's' , 'said' , 'same' , 'saw' , 'say' , 'saying' , 'says' , 'second' , 'secondly' , 'see' , 'seeing' , 'seem' , 'seemed' , 'seeming' , 'seems' , 'seen' , 'self' , 'selves' , 'sensible' , 'sent' , 'serious' , 'seriously' , 'seven' , 'several' , 'shall' , 'she' , 'should' , 'shouldn\'t' , 'since' , 'six' , 'so' , 'some' , 'somebody' , 'somehow' , 'someone' , 'something' , 'sometime' , 'sometimes' , 'somewhat' , 'somewhere' , 'soon' , 'sorry' , 'specified' , 'specify' , 'specifying' , 'still' , 'sub' , 'such' , 'sup' , 'sure' , 't' , 't\'s' , 'take' , 'taken' , 'tell' , 'tends' , 'th' , 'than' , 'thank' , 'thanks' , 'thanx' , 'that' , 'that\'s' , 'thats' , 'the' , 'their' , 'theirs' , 'them' , 'themselves' , 'then' , 'thence' , 'there' , 'there\'s' , 'thereafter' , 'thereby' , 'therefore' , 'therein' , 'theres' , 'thereupon' , 'these' , 'they' , 'they\'d' , 'they\'ll' , 'they\'re' , 'they\'ve' , 'think' , 'third' , 'this' , 'thorough' , 'thoroughly' , 'those' , 'though' , 'three' , 'through' , 'throughout' , 'thru' , 'thus' , 'to' , 'together' , 'too' , 'took' , 'toward' , 'towards' , 'tried' , 'tries' , 'truly' , 'try' , 'trying' , 'twice' , 'two' , 'u' , 'un' , 'under' , 'unfortunately' , 'unless' , 'unlikely' , 'until' , 'unto' , 'up' , 'upon' , 'us' , 'use' , 'used' , 'useful' , 'uses' , 'using' , 'usually' , 'uucp' , 'v' , 'value' , 'various' , 'very' , 'via' , 'viz' , 'vs' , 'w' , 'want' , 'wants' , 'was' , 'wasn\'t' , 'way' , 'we' , 'we\'d' , 'we\'ll' , 'we\'re' , 'we\'ve' , 'welcome' , 'well' , 'went' , 'were' , 'weren\'t' , 'what' , 'what\'s' , 'whatever' , 'when' , 'whence' , 'whenever' , 'where' , 'where\'s' , 'whereafter' , 'whereas' , 'whereby' , 'wherein' , 'whereupon' , 'wherever' , 'whether' , 'which' , 'while' , 'whither' , 'who' , 'who\'s' , 'whoever' , 'whole' , 'whom' , 'whose' , 'why' , 'will' , 'willing' , 'wish' , 'with' , 'within' , 'without' , 'won\'t' , 'wonder' , 'would' , 'would' , 'wouldn\'t' , 'x' , 'y' , 'yes' , 'yet' , 'you' , 'you\'d' , 'you\'ll' , 'you\'re' , 'you\'ve' , 'your' , 'yours' , 'yourself' , 'yourselves' , 'z' , 'zero']
print len(stopwords)
dictfile = '/usr/share/dict/american-english-huge'
with open(dictfile) as f:
    words = [line.strip() for line in f]

print len(words)

s = datetime.now()
total = sum(1 for word in words if word in stopwords)
e = datetime.now()
elapsed = e - s
print elapsed, total

s = datetime.now()
stopwords_set = set(stopwords)
total = sum(1 for word in words if word in stopwords_set)
e = datetime.now()
elapsed = e - s
print elapsed, total

我得到这些结果:

# Using list
>>> print elapsed, total
0:00:06.902529 542

# Using set
>>> print elapsed, total
0:00:00.050676 542

相同数量的结果,但运行速度几乎快 140 倍。诚然,您可能没有那么多词stopwords可以与您的 . 但是,它确实强调了使用适当的数据结构可以加速您的代码。

于 2010-10-21T03:42:13.293 回答
2

很高兴您发布代码以回应人们的要求,但除非他们运行并分析它,否则他们能做的最好的事情就是猜测。我也可以猜测,但即使猜测是“好”或“受过教育”,它们也不是发现性能问题的好方法。

我宁愿向您推荐一种能够查明问题的技术。这比猜测或让别人猜测更有效。一旦你自己找到了问题的确切位置,你就可以决定是使用 memoization 还是其他任何方法来解决它。

通常有不止一个问题。如果您重复查找和消除性能问题的过程,您将接近真正的最佳状态。

于 2010-09-27T16:50:53.470 回答
1

Python缓存函数结果吗?我不这么认为。在这种情况下,像TF(term,doc)getValues4Term(). 当您将结果放入变量中时,您可能已经获得了巨大的速度提升。那结合

for doc in results:
    for term in docs[doc]:
        vals = getValues4Term(term,doc)

可能已经是最大的速度问题,你有。

于 2010-09-27T04:45:35.053 回答