需要一些建议来提高我的代码的性能。
我有两个文件( Keyword.txt , description.txt )。Keyword.txt 包含关键字列表(具体为 11,000+),descriptions.txt 包含非常大的文本描述(9,000+)。
我正在尝试一次从关键字.txt 中读取一个关键字,并检查该关键字是否存在于描述中。如果关键字存在,我会将其写入新文件。所以这就像一个多对多关系(11,000 * 9,000)。
示例关键字:
Xerox
VMWARE CLOUD
示例说明(很大):
Planning and implementing entire IT Infrastructure. Cyberoam firewall implementation and administration in head office and branch office. Report generation and analysis. Including band width conception, internet traffic and application performance. Windows 2003/2008 Server Domain controller implementation and managing. VERITAS Backup for Clients backup, Daily backup of applications and database. Verify the backed up database for data integrity. Send backup tapes to remote location for safe storage Installing and configuring various network devices; Routers, Modems, Access Points, Wireless ADSL+ modems / Routers Monitoring, managing & optimizing Network. Maintaining Network Infrastructure for various clients. Creating Users and maintaining the Linux Proxy servers for clients. Trouble shooting, diagnosing, isolating & resolving Windows / Network Problems. Configuring CCTV camera, Biometrics attendance machine, Access Control System Kaspersky Internet Security / ESET NOD32
下面是我写的代码:
import csv
import nltk
import re
wr = open(OUTPUTFILENAME,'w')
def match():
c = 0
ft = open('DESCRIPTION.TXT','r')
ky2 = open('KEYWORD.TXT','r')
reader = csv.reader(ft)
keywords = []
keyword_reader2 = csv.reader(ky2)
for x in keyword_reader2: # Storing all the keywords to a list
keywords.append(x[1].lower())
string = ' '
c = 0
for row in reader:
sentence = row[1].lower()
id = row[0]
for word in keywords:
if re.search(r'\b{}\b'.format(re.escape(word.lower())),sentence):
string = string + id+'$'+word.lower()+'$'+sentence+ '\n'
c = c + 1
if c > 5000: # I am writing 5000 lines at a time.
print("Batch printed")
c = 0
wr.write(string)
string = ' '
wr.write(string)
ky2.close()
ft.close()
wr.close()
match()
现在这段代码大约需要 120 分钟才能完成。我尝试了几种方法来提高速度。
- 起初我一次写一行,然后我把它一次改成了 5000 行,因为它是一个小文件,我可以把所有东西都放在内存中。没有看到太大的改善。
- 我将所有内容推送到标准输出并使用控制台中的管道将所有内容附加到文件中。这甚至更慢。
我想知道是否有更好的方法来做到这一点,因为我可能在代码中做错了什么。
我的电脑规格:内存:15gb 处理器:i7 4th gen