0

我将文本直接转换为 epub,但在自动将 HTML 书籍文件拆分为单独的标题/章节文件时遇到问题。目前,下面的代码部分有效,但只创建了所有其他章节文件。因此,输出中缺少一半的头文件/章节文件。这是代码:

def splitHeaderstoFiles(fpath):

infp = open(fpath, 'rt', encoding=('utf-8'))
for line in infp:

    # format and split headers to files
    if '<h1' in line:   

       #-----------format header file names and other stuff ------------#

        # create a new file for the header/chapter section
        path = os.getcwd() + os.sep + header
        with open(path, 'wt', encoding=('utf-8')) as outfp:            

            # write html top meta headers
            outfp = addMetaHeaders(outfp)
            # add the header
            outfp = outfp.write(line)

            # add the chapter/header bodytext
            for line in infp:
                if '<h1' not in line:
                    outfp.write(line)
                else:                     
                    outfp.write('</body>\n</html>')         
                    break                
    else:          
        continue

infp.close() 

问题出现在代码底部的第二个“for 循环”中,当我寻找下一个 h1 标记来停止拆分时。我不能使用 seek() 或 tell() 倒回或后退一行,以便程序可以在下一次迭代中找到下一个标题/章节。显然,您不能在包含隐式 iter 或 next 操作对象的 for 循环中在 python 中使用这些。只是给出一个“不能做非零当前相对搜索”的错误。

我还尝试了代码中的while line != ' ' + readline()组合,它也给出了与上面相同的错误。

有谁知道在python中将不同长度的HTML标题/章节拆分为单独文件的简单方法?是否有任何特殊的 Python 模块(例如泡菜)可以帮助简化这项任务?

我正在使用 Python 3.4

我提前感谢您对这个问题的任何解决方案......

4

3 回答 3

2

不久前我遇到了类似的问题,这是一个简化的解决方案:

from itertools import count

chapter_number = count(1)
output_file = open('000-intro.html', 'wb')

with open('index.html', 'rt') as input_file:
    for line in input_file:
        if '<h1' in line:
            output_file.close()
            output_file = open('{:03}-chapter'.format(next(chapter_number)), 'wb')
        output_file.write(line)

output_file.close()

在这种方法中,导致第一个块的第一个文本h1块被写入000-intro.html,第一章将被写入001-chapter.html等等。请根据口味修改。

解决方案很简单:遇到h1标签后,关闭最后一个输出文件并打开一个新文件。

于 2015-11-23T00:38:03.907 回答
0

我最终找到了上述问题的答案。下面的代码除了获取文件头之外还做了很多。它还同时加载两个并行列表数组,分别带有格式化的文件名数据(带扩展名)和纯标题名称数据,因此我可以使用这些列表在一次循环中在这些 html 文件中填充格式化的文件扩展名。代码现在运行良好,如下所示。

def splitHeaderstoFiles(dir, inpath):
count = 1
t_count = 0
out_path = ''
header = ''
write_bodytext = False
file_path_names = []
pure_header_names = []

inpath = dir + os.sep + inpath
with open(inpath, 'rt', encoding=('utf-8')) as infp:

    for line in infp:

        if '<h1' in line:                
            #strip html tags, convert to start caps
            p = re.compile(r'<.*?>')
            header = p.sub('', line)
            header = capwords(header)
            line_save = header

            # Add 0 for count below 10
            if count < 10: 
                header = '0' + str(count) + '_' + header
            else:
                header = str(count) + '_' + header              

            # remove all spaces + add extension in header
            header = header.replace(' ', '_')
            header = header + '.xhtml'
            count = count + 1

            #create two parallel lists used later 
            out_path = dir + os.sep + header
            outfp = open(out_path, 'wt', encoding=('utf-8'))
            file_path_names.insert(t_count, out_path)
            pure_header_names.insert(t_count, line_save)
            t_count = t_count + 1

            # Add html meta headers and write it 
            outfp = addMainHeaders(outfp)
            outfp.write(line)
            write_bodytext = True

        # add header bodytext   
        elif write_bodytext == True:
            outfp.write(line)

# now add html titles and close the html tails on all files    
max_num_files = len(file_path_names)
tmp = dir + os.sep + 'temp1.tmp'
i = 0

while i < max_num_files:
    outfp = open(tmp, 'wt', encoding=('utf-8'))     
    infp = open(file_path_names[i], 'rt', encoding=('utf-8'))

    for line in infp:
        if '<title>'  in line:
            line = line.strip(' ')
            line = line.replace('<title></title>', '<title>' +    pure_header_names[i] + '</title>')
            outfp.write(line)
        else:
            outfp.write(line)            

    # add the html tail
    if '</body>' in line or '</html>' in line:
        pass
    else:            
        outfp.write('  </body>' + '\n</html>')    

    # clean up
    infp.close()
    outfp.close()
    shutil.copy2(tmp, file_path_names[i])
    os.remove(tmp) 
    i = i + 1                

# now rename just the title page
if os.path.isfile(file_path_names[0]):    
    title_page_name = file_path_names[0]
    new_title_page_name = dir + os.sep + '01_Title.xhtml'    
    os.rename(title_page_name, new_title_page_name)
    file_path_names[0] = '01_Title.xhtml'
else:
    logmsg27(DEBUG_FLAG)
    os._exit(0) 

# xhtml file is no longer needed    
if os.path.isfile(inpath):
    os.remove(inpath)    

# returned list values are also used 
# later to create epub opf and ncx files
return(file_path_names, pure_header_names) 

@Hai Vu 和 @Seth -- 感谢您的所有帮助。

于 2015-11-25T02:57:57.107 回答
0

您循环输入文件两次,这可能会导致您的问题:

for line in infp:
    ...
    with open(path, 'wt', encoding=('utf-8')) as outfp:            
        ...
        for line in infp:
            ...

每个 for 都有自己的迭代器,因此您将多次循环文件。

您可以尝试将 for 循环转换为 while,这样您就不会使用两个不同的迭代器:

while infp: 
    line = infp.readline()
    if '<h1' in line:
        with open(...) as outfp:
            while infp:                
                line = infp.readline()
                if '<h1' in line:
                    break
                outfp.writeline(...)

或者,您可能希望使用 HTML 解析器(即BeautifulSoup)。然后您可以执行此处描述的操作:https ://stackoverflow.com/a/8735688/65295 。


从评论更新 - 基本上,一次阅读整个文件,以便您可以根据需要自由地前后移动。除非您有一个非常大的文件(或非常少的内存),否则这可能不会是性能问题。

lines = infp.readlines() # read the entire file
i = 0
while i < len(lines): 
    if '<h1' in lines[i]:
        with open(...) as outfp:
            j = i + 1
            while j < len(lines):
                if '<h1' in lines[j]:
                    break
                outfp.writeline(lines[j])
        # line j has an <h1>, set i to j so we detect the it at the
        # top of the next loop iteration. 
        i = j
    else:
        i += 1
于 2015-11-23T00:12:15.420 回答