1

我正在编写一个 python 程序来从维基百科转储中解决一些问题。

但是,在处理具有大量磁盘使用量的大型数据集时,我注意到的一件事是性能几乎总是随着时间的推移而下降。

我的电脑是核心 i7 2.6 GHz、16 Gb 内存(使用量达到约 5 Gb),配备 1 Tb 7200 RPM 硬盘。

注意:在这两种情况下,输出都以 10 秒为增量。

这是使用 Redis 和 Python 2.7

[ ] T:251.02 articles/second R: 4628 A: 15474
[ ] T:247.13 articles/second R: 5111 A: 17151
[ ] T:246.41 articles/second R: 5487 A: 19177
[ ] T:258.10 articles/second R: 6200 A: 22217
[ ] T:259.90 articles/second R: 6833 A: 24382
[ ] T:265.22 articles/second R: 7685 A: 26864
[ ] T:274.25 articles/second R: 8981 A: 29488
[ ] T:281.50 articles/second R: 10094 A: 32209
[ ] T:286.51 articles/second R: 11283 A: 34639
[ ] T:296.26 articles/second R: 13033 A: 37414
[ ] T:301.68 articles/second R: 14484 A: 39906
[ ] T:289.22 articles/second R: 14704 A: 40333
[ ] T:277.45 articles/second R: 14940 A: 40634
[ ] T:267.82 articles/second R: 15243 A: 41083
[ ] T:259.04 articles/second R: 15502 A: 41570
[ ] T:250.92 articles/second R: 15778 A: 42014
[ ] T:243.67 articles/second R: 16075 A: 42486
[ ] T:236.79 articles/second R: 16356 A: 42924
[ ] T:230.48 articles/second R: 16649 A: 43358
[ ] T:223.89 articles/second R: 16826 A: 43705
[ ] T:218.44 articles/second R: 17039 A: 44205
[ ] T:213.30 articles/second R: 17234 A: 44705
[ ] T:208.41 articles/second R: 17354 A: 45253
[ ] T:203.60 articles/second R: 17473 A: 45725
[ ] T:199.61 articles/second R: 17627 A: 46329
[ ] T:195.65 articles/second R: 17807 A: 46872
[ ] T:191.64 articles/second R: 17875 A: 47398
[ ] T:188.28 articles/second R: 18003 A: 48008
[ ] T:185.11 articles/second R: 18233 A: 48517

显然 Redis 可能是我的问题,这里有一些不使用 Redis 的结果。

[ ] T:1636.31 articles/second R:3938 A:12949
[ ] T:3716.77 articles/second R:19834 A:61210
[ ] T:2776.43 articles/second R:20213 A:68211
[ ] T:2128.70 articles/second R:20228 A:68867
[ ] T:1729.78 articles/second R:20251 A:69586
[ ] T:1462.91 articles/second R:20289 A:70338
[ ] T:1270.07 articles/second R:20309 A:71107
[ ] T:1124.34 articles/second R:20330 A:71857
[ ] T:1011.18 articles/second R:20376 A:72669
[ ] T:919.88 articles/second R:20391 A:73464
[ ] T:845.36 articles/second R:20406 A:74304
[ ] T:783.06 articles/second R:20417 A:75158
[ ] T:730.05 articles/second R:20427 A:75984
[ ] T:684.37 articles/second R:20436 A:76798
[ ] T:645.07 articles/second R:20451 A:77661
[ ] T:610.67 articles/second R:20475 A:78518

这不是“真正的”性能,因为我没有将数据存储在任何地方(只是增加文章和重定向的数量)。但随着时间的推移,我们可以看到同样的性能下降。

程序刚启动时的性能是真实的,还是尚未达到稳定?由于我没有写入任何日志文件或任何内容,因此性能似乎应该相对稳定,因为我不断地从硬盘驱动器读取(当然它会跳来跳去访问所有文件)。

我知道将大量数据放入队列中可能是一种糟糕的形式,但我认为让单个进程处理数据读取而不是将文件分发给其他 7 个进程读取会导致搜索风暴。我尝试了两种方式(将文件路径放入队列中,并将实际数据放入队列中)并将数据放入队列中更快一点。

from redis import Redis
import time
import re
from multiprocessing import Process, Queue

r = Redis()
r.flushdb()

doubleBrackets = re.compile("\[\[(.*?)\]\]")

def findLinks(q, oq):
    while True:
        if not q.empty():
            title, lines = q.get()
            links = []
            for line in lines:
                for l in doubleBrackets.findall(line):
                    l = l.split('|')[0]
                    l = l.strip('|')
                    links.append(l)
                    #r.rpush(title, l)

            if len(links) == 1:
                oq.put(0)
                #r.incr('Redirects')
            else:
                oq.put(1)
                #r.incr('Articles')

numArticles = 0
numRedirects = 0

print 'Starting'

# This is a 1 Gb file with the paths to all the files I am accessing
linkFile = '/home/andrew/Wikipedia/logFileAll'

q = Queue()
oq = Queue()
processes = []

for i in range(7):
    p = Process(target=findLinks, args=(q,oq))
    processes.append(p)
    p.start()

startTime = time.time()
timer = time.time()

with open(linkFile, 'rb') as f:
    while True:
        line = f.readline()

        # The data is formatted so the title and path are separated by a single space
        title, path = line.split(' ')

        with open(path.strip(), 'rb') as fi:
            # Here we read the article
            lines = fi.readlines()

        # We put the title and the article content in the queue
        q.put((title, lines))

        if time.time() - timer > 10:
            # If using Redis
            #print '[ ] T:%.2f articles/second R: %s A: %s' %((int(r.get('Redirects'))+int(r.get('Articles')))/(time.time()-startTime), r.get('Redirects'), r.get('Articles'))

            # Test for redis dependent performance
            while not oq.empty():
                response = oq.get()
                if response:
                    numArticles += 1
                else:
                    numRedirects += 1    
            print '[ ] T:%.2f articles/second R:%s A:%s' %((numArticles+numRedirects)/(time.time()-startTime), numRedirects, numArticles)
            timer = time.time()

# When we run through the 1 Gb file, we will still have a couple more items to chew through
while True:
    if time.time() - timer > 10:
        print '[ ] R: %s A: %s C:%s' %(r.get('Redirects'), r.get('Articles'), title)
        timer = time.time()

编辑:根据 JF Sebastian 的评论,我添加了哨兵值而不是 q.empty() 检查。似乎有些进程卡在某个地方,但没有抛出异常(会发生这种情况有点奇怪),无论如何,这是性能提升!谢谢!

[ ] T:250.88 articles/second R:663 A:1850 Proc:7
[ ] T:257.17 articles/second R:1216 A:3940 Proc:7
[ ] T:259.92 articles/second R:1820 A:6000 Proc:7
[ ] T:251.81 articles/second R:2337 A:7762 Proc:7
[ ] T:250.04 articles/second R:2943 A:9590 Proc:7
[ ] T:248.24 articles/second R:3543 A:11389 Proc:7
[ ] T:246.83 articles/second R:4060 A:13260 Proc:7
[ ] T:247.59 articles/second R:4583 A:15271 Proc:7
[ ] T:243.97 articles/second R:5074 A:16938 Proc:7
[ ] T:242.01 articles/second R:5440 A:18819 Proc:7
[ ] T:252.34 articles/second R:6086 A:21741 Proc:7
[ ] T:255.94 articles/second R:6738 A:24053 Proc:7
[ ] T:261.38 articles/second R:7547 A:26518 Proc:7
[ ] T:268.01 articles/second R:8617 A:29000 Proc:7
[ ] T:276.48 articles/second R:9933 A:31648 Proc:7
[ ] T:283.45 articles/second R:11114 A:34358 Proc:7
[ ] T:293.25 articles/second R:12836 A:37148 Proc:7
[ ] T:302.41 articles/second R:14567 A:40015 Proc:7
[ ] T:313.33 articles/second R:16553 A:43147 Proc:7
[ ] T:320.35 articles/second R:17699 A:46551 Proc:7
[ ] T:328.72 articles/second R:18966 A:50261 Proc:7
[ ] T:337.07 articles/second R:19645 A:54724 Proc:7
[ ] T:349.34 articles/second R:19820 A:60768 Proc:7
[ ] T:364.98 articles/second R:20190 A:67674 Proc:7
[ ] T:373.08 articles/second R:20384 A:73183 Proc:7
[ ] T:381.27 articles/second R:20495 A:78957 Proc:7
[ ] T:391.39 articles/second R:20960 A:85070 Proc:7
[ ] T:394.74 articles/second R:22194 A:88710 Proc:7
[ ] T:397.37 articles/second R:23525 A:92105 Proc:7
[ ] T:397.76 articles/second R:24882 A:94855 Proc:7
[ ] T:397.11 articles/second R:26138 A:97387 Proc:7
4

0 回答 0