python - 如何加快python中的文件解析？

Question

下面是我一直在开发的应用程序的一部分。该部分用于使用 addValue 更新文本文件。起初我认为它正在工作，但它接缝添加了更多的线条，而且它非常非常慢。

trakt_shows_seen 是一个节目字典，1 个节目部分看起来像

{'episodes': [{'season': 1, 'playcount': 0, 'episode': 1}, {'season': 1, 'playcount': 0, 'episode': 2}, {'season': 1, 'playcount': 0, 'episode': 3}], 'title': 'The Ice Cream Girls'}

该部分应搜索文件中的每个标题，季节和剧集，并在找到时检查它是否有观看标记（checkValue），如果有，则将其更改为 addvalue，如果没有，则应将 addValue 添加到末尾线。

文件中的一行

_F  /share/Storage/NAS/Videos/Tv/The Ice Cream Girls/Season 01/The Ice Cream Girls - S01E01 - Episode 1.mkv _ai Episode 1   _e  1   _r  6.5 _Y  71  _s  1   _DT 714d861 _et Episode 1   _A  4379,4376,4382,4383 _id 2551    _FT 714d861 _v  c0=h264,f0=25,h0=576,w0=768 _C  T   _IT 717ac9d _R  GB: _m  1250    _ad 2013-04-19  _T  The Ice Cream Girls _G  d   _U   thetvdb:268910 imdb:tt2372806  _V  HDTV

所以我的问题是，有没有更好更快的方法？我可以将文件加载到内存中（文件大约 1Mb）更改所需的行然后保存文件，或者任何人都可以建议另一种可以加快速度的方法。

感谢您花时间查看。

编辑我已经对代码进行了相当多的更改，这确实工作得更快，但输出并不像预期的那样，由于某种原因，即使没有代码来执行此操作，它也会将lines_of_interest 写入文件？

我还没有添加任何编码选项，但由于文件是 utf-8，我怀疑重音标题会有问题。

    if trakt_shows_seen:
        addValue = "\t_w\t1\t"
        replacevalue = "\t_w\t0\t"
        with open(OversightFile, 'rb') as infile:
            p = '\t_C\tT\t'
            for line in infile:
                if p in line:
                    tv_offset = infile.tell() - len(line) - 1#Find first TV in file, search from here
                    break

            lines_of_interest = set()
            for show_dict in trakt_shows_seen:
                for episode in show_dict['episodes']:
                    p = re.compile(r'\t_s\t('+str(episode["season"])+')\t.*\t_T\t('+show_dict["title"]+')\t.*\t_e\t('+str(episode["episode"])+')\t')
                    infile.seek(tv_offset)#search from first Tv show
                    for line in infile:
                        if p.findall(line):
                            search_offset = infile.tell() - len(line) - 1
                            lines_of_interest.add(search_offset)#all lines that need to be changed
        with open(OversightFile, 'rb+') as outfile:
            for lines in lines_of_interest:
                for change_this in outfile:
                    outfile.seek(lines)
                    if replacevalue in change_this:
                        change_this = change_this.replace(replacevalue, addValue)
                        outfile.write(change_this)
                        break#Only check 1 line
                    elif not addValue in change_this:
                        #change_this.extend(('_w', '1'))
                        change_this = change_this.replace("\t\n", addValue+"\n")
                        outfile.write(change_this)
                        break#Only check 1 line

score 0 · Accepted Answer

Aham - 您在for循环的每次重复中打开、读取和重写您的文件 - 每个节目的每一集一次。整个多元宇宙中很少有事情会比这慢。

您可以沿着同一行 - 只需在 for 循环之前读取所有“文件”一次，遍历列表读取，并将所有内容写回磁盘，只需一次 =

或多或少：

if trakt_shows_seen:
    addValue = "\t_w\t1\t"
    checkvalue = "\t_w\t0\t"
    print '  %s TV shows episodes playcount will be updated on Oversight' % len(trakt_shows_seen)
    myfile_list = open(file).readlines()
    for show in trakt_shows_seen:
        print '    --> ' + show['title'].encode('utf-8')
        for episode in show['episodes']:
            print '     Season %i - Episode %i' % (episode['season'], episode['episode'])
            p = re.compile(r'\t_s\t('+str(episode["season"])+')\t.*\t_T\t('+show["title"]+')\t.*\t_e\t('+str(episode["episode"])+')\t')
            newList = []

            for line in myfile_list:
                if p.findall(line) :
                    if checkvalue in line:
                        line = line.replace(checkvalue, addValue)
                    elif not addValue in line:
                        line = line.strip("\t\n") + addValue+"\n"
                newList.append(line)
            myfile_list = newlist

    outref = open(file,'w')
    outref.writelines(newList)
    outref.close()

这仍然远未达到最佳状态 - 但在您的代码中进行最少的更改是为了阻止导致它如此缓慢的原因。

score 0 · Accepted Answer

您正在为您跟踪的每个节目的每一集重新读取和重写您的整个文件 - 当然这很慢。不要那样做。相反，读取文件一次。从每一行解析节目标题和季节和剧集编号（可能使用csv带有分隔符 ='\t' 的内置库），并查看它们是否在您正在跟踪的集合中。如果是，请进行替换，并以任何一种方式编写该行。

它看起来像这样：

title_index = # whatever column number has the show title
season_index = # whatever column number has the season number
episode_index = # whatever column number has the episode number

with open('somefile', 'rb') as infile:
    reader = csv.reader(infile, delimiter='\t')
    modified_lines = []
    for line in reader:
        showtitle = line[title_index]
        if showtitle in trakt_shows_seen:
            season_number = int(line[season_index])
            episode_number = int(line[episode_index])
            if any((x for x in trakt_shows_seen[showtitle] if x['season'] = season_number and x['episode'] = episode_number)):
                # line matches a tracked episode
                watch_count_index = line.index('_w')
                if watch_count_index != -1:
                    # possible check value found - you may be able to skip straight to assigning the next element to '1'
                    if line[watch_count_index + 1] == '0':
                        # check value found, replace
                        line[watch_count_index + 1] = '1'
                    elif line[watch_count_index + 1] != '1':
                        # not sure what you want to do if something like \t_w\t2\t is present 
                        line[watch_count_index + 1] = '1'
                else:
                     line.extend(('_w', '1'))
        modified_lines.append(line)
with open('somefile', 'wb') as outfile:
    writer = csv.writer(outfile, delimiter='\t')
    writer.writerows(modified_lines)

确切的细节将取决于您的文件格式有多严格——您事先对行的结构了解得越多越好。如果标题、季节和剧集字段的索引不同，最好的办法可能是遍历代表行的列表以查找相关标记。

我跳过了错误检查 - 根据您对原始文件的信心，您可能希望确保可以将季节和剧集编号转换为整数，或将您的trakt_shows_seen值字符串化。csv 阅读器将返回编码的字节串，因此如果显示名称trakt_shows_seen是 Unicode 对象（它们似乎不在您粘贴的代码中），您应该解码 csv 阅读器的结果或编码字典值。

我个人可能会转换trakt_shows_seen为一组（标题、季节、剧集）元组，以便更方便地检查一行是否感兴趣。至少如果标题、季节和剧集的字段编号是固定的。当我读取输入文件而不是在内存中保留行列表时，我也会写入我的输出文件（使用不同的文件名）；这将允许在覆盖原始输入之前使用 shell 的 diff 实用程序进行一些健全性检查。

从您现有的字典中创建一个集合 - 在某种程度上它取决于trakt_shows_seen使用什么格式。您的示例显示了一个节目的条目，但没有说明它如何代表多个节目。现在，我将根据您尝试的代码假设它是此类字典的列表。

shows_of_interest = set()
for show_dict in trakt_shows_seen:
    title = show_dict['title']
    for episode_dict in show_dict['episodes']:
        shows_of_interest.add((title, episode_dict['season'], episode_dict['episode']))

然后在读取文件的循环中：

        # the rest as shown above              
        season_number = int(line[season_index])
        episode_number = int(line[episode_index])
        if (showtitle, season_number, episode_number) in shows_of_interest:
            # line matches a tracked episode

python - 如何加快python中的文件解析？

2 回答 2

Related

Reference