1

这次我尝试使用 Python 的 xlsxwriter 模块将数据从 .srt 写入 excel。

字幕文件在 sublime 文本中如下所示:

但我想将数据写入excel,所以它看起来像这样:

这是我第一次为此编写python代码,所以我仍处于试错阶段......我尝试编写如下代码,

但我认为这没有意义......

我会继续尝试,但如果你知道怎么做,请告诉我。我会阅读您的代码并尝试理解它们!谢谢!:)

4

1 回答 1

3

以下将问题分为几部分:

  • 解析输入文件。parse_subtitles是一个生成器,它获取行的来源并以{'index':'N', 'timestamp':'NN:NN:NN,NNN -> NN:NN:NN,NNN', 'subtitle':'TEXT'}'. 我采用的方法是跟踪我们处于三种不同状态中的哪一种:
    1. seeking to next entry当我们寻找下一个索引号时,它应该匹配正则表达式^\d*$(除了一堆数字之外什么都没有)
    2. looking for timestamp当找到索引并且我们期望时间戳出现在下一行时,它应该匹配正则表达式^\d{2}:\d{2}:\d{2},\d{3} --> \d{2}:\d{2}:\d{2},\d{3}$(HH:MM:SS,mmm -> HH:MM:SS,mmm) 和
    3. reading subtitles在使用实际的字幕文本时,将空白行和 EOF 解释为字幕终止点。
  • 将上述记录写入工作表中的一行。write_dict_to_worksheet接受行和工作表,以及为每个记录的键定义 Excel 0 索引列号的记录和字典,然后适当地写入数据。
  • 组织整体转换convert接受输入文件名(例如,'Wildlife.srt'被打开并传递给parse_subtitles函数,以及输出文件名(例如'Subtitle.xlsx',将使用xlsxwriter.该记录到 XLSX 文件

出于自我注释目的而留下的记录语句:,并且因为在复制您的输入文件时,我在时间戳中将 a指向 a ;,使其无法识别,并且弹出错误对于调试很方便!

我已将您的源文件的文本版本以及以下代码放在此 Gist中

import xlsxwriter
import re
import logging

def parse_subtitles(lines):
    line_index = re.compile('^\d*$')
    line_timestamp = re.compile('^\d{2}:\d{2}:\d{2},\d{3} --> \d{2}:\d{2}:\d{2},\d{3}$')
    line_seperator = re.compile('^\s*$')

    current_record = {'index':None, 'timestamp':None, 'subtitles':[]}
    state = 'seeking to next entry'

    for line in lines:
        line = line.strip('\n')
        if state == 'seeking to next entry':
            if line_index.match(line):
                logging.debug('Found index: {i}'.format(i=line))
                current_record['index'] = line
                state = 'looking for timestamp'
            else:
                logging.error('HUH: Expected to find an index, but instead found: [{d}]'.format(d=line))

        elif state == 'looking for timestamp':
            if line_timestamp.match(line):
                logging.debug('Found timestamp: {t}'.format(t=line))
                current_record['timestamp'] = line
                state = 'reading subtitles'
            else:
                logging.error('HUH: Expected to find a timestamp, but instead found: [{d}]'.format(d=line))

        elif state == 'reading subtitles':
            if line_seperator.match(line):
                logging.info('Blank line reached, yielding record: {r}'.format(r=current_record))
                yield current_record
                state = 'seeking to next entry'
                current_record = {'index':None, 'timestamp':None, 'subtitles':[]}
            else:
                logging.debug('Appending to subtitle: {s}'.format(s=line))
                current_record['subtitles'].append(line)

        else:
            logging.error('HUH: Fell into an unknown state: `{s}`'.format(s=state))
    if state == 'reading subtitles':
        # We must have finished the file without encountering a blank line. Dump the last record
        yield current_record

def write_dict_to_worksheet(columns_for_keys, keyed_data, worksheet, row):
    """
    Write a subtitle-record to a worksheet. 
    Return the row number after those that were written (since this may write multiple rows)
    """
    current_row = row
    #First, horizontally write the entry and timecode
    for (colname, colindex) in columns_for_keys.items():
        if colname != 'subtitles': 
            worksheet.write(current_row, colindex, keyed_data[colname])

    #Next, vertically write the subtitle data
    subtitle_column = columns_for_keys['subtitles']
    for morelines in keyed_data['subtitles']:
        worksheet.write(current_row, subtitle_column, morelines)
        current_row+=1

    return current_row

def convert(input_filename, output_filename):
    workbook = xlsxwriter.Workbook(output_filename)
    worksheet = workbook.add_worksheet('subtitles')
    columns = {'index':0, 'timestamp':1, 'subtitles':2}

    next_available_row = 0
    records_processed = 0
    headings = {'index':"Entries", 'timestamp':"Timecodes", 'subtitles':["Subtitles"]}
    next_available_row=write_dict_to_worksheet(columns, headings, worksheet, next_available_row)

    with open(input_filename) as textfile:
        for record in parse_subtitles(textfile):
            next_available_row = write_dict_to_worksheet(columns, record, worksheet, next_available_row)
            records_processed += 1

    print('Done converting {inp} to {outp}. {n} subtitle entries found. {m} rows written'.format(inp=input_filename, outp=output_filename, n=records_processed, m=next_available_row))
    workbook.close()

convert(input_filename='Wildlife.srt', output_filename='Subtitle.xlsx')

编辑:更新为在输出中将多行字幕拆分为多行

于 2015-08-30T05:30:28.757 回答