python - 将选项卡式文本转换为 html 无序列表？

Question

我是一个初学者程序员，所以这个问题可能听起来微不足道：我有一些包含制表符分隔文本的文本文件，例如：

现在我想从中生成无序的 .html 列表，其结构如下：

<ul>
<li>A
<ul><li>B</li>
<li>C
<ul><li>D</li>
<li>E</li></ul></li></ul></li>
</ul>

我的想法是编写一个 Python 脚本，但如果有更简单（自动）的方法，那也很好。为了识别缩进级别和项目名称，我会尝试使用以下代码：

import sys
indent = 0
last = []
for line in sys.stdin:
    count = 0
    while line.startswith("\t"):
       count += 1
       line = line[1:]
    if count > indent:
       indent += 1
       last.append(last[-1])
    elif count < indent:
       indent -= 1
       last = last[:-1]

score 5 · Accepted Answer

试试这个（适用于您的测试用例）：

import itertools
def listify(filepath):
    depth = 0
    print "<ul>"*(depth+1)
    for line in open(filepath):
        line = line.rstrip()
        newDepth = sum(1 for i in itertools.takewhile(lambda c: c=='\t', line))
        if newDepth > depth:
            print "<ul>"*(newDepth-depth)
        elif depth > newDepth:
            print "</ul>"*(depth-newDepth)
        print "<li>%s</li>" %(line.strip())
        depth = newDepth
    print "</ul>"*(depth+1)

希望这可以帮助

score 2 · Accepted Answer

tokenize模块理解您的输入格式：行包含有效的 Python 标识符，语句的缩进级别很重要。ElementTree模块允许您在内存中操作树结构，因此将树的创建与将其呈现为 html 分离可能更灵活：

from tokenize import NAME, INDENT, DEDENT, ENDMARKER, NEWLINE, generate_tokens
from xml.etree import ElementTree as etree

def parse(file, TreeBuilder=etree.TreeBuilder):
    tb = TreeBuilder()
    tb.start('ul', {})
    for type_, text, start, end, line in generate_tokens(file.readline):
        if type_ == NAME: # convert name to <li> item
            tb.start('li', {})
            tb.data(text)
            tb.end('li')
        elif type_ == NEWLINE:
            continue
        elif type_ == INDENT: # start <ul>
            tb.start('ul', {})
        elif type_ == DEDENT: # end </ul>
            tb.end('ul')
        elif type_ == ENDMARKER: # done
            tb.end('ul') # end parent list
            break
        else: # unexpected token
            assert 0, (type_, text, start, end, line)
    return tb.close() # return root element

任何提供.start(), .end(), .data(),.close()方法的类都可以用作TreeBuilder例如，您可以只动态编写 html 而不是构建树。

要解析标准输入并将 html 写入标准输出，您可以使用ElementTree.write()：

import sys

etree.ElementTree(parse(sys.stdin)).write(sys.stdout, method='html')

输出：

<ul><li>A</li><ul><li>B</li><li>C</li><ul><li>D</li><li>E</li></ul></ul></ul>

您可以使用任何文件，而不仅仅是sys.stdin/sys.stdout.

注意：要在 Python 3 上写入标准输出，sys.stdout.buffer或者encoding="unicode"由于字节/Unicode 的区别。

score 0 · Accepted Answer

我认为算法是这样的：

跟踪当前缩进级别（通过计算每行的制表符数量）
如果缩进级别增加：发射<ul> <li>current item</li>
如果缩进级别降低：发射<li>current item</li></ul>
如果缩进级别保持不变：发射<li>current item</li>

将其放入代码中留给 OP 作为练习

score -1 · Accepted Answer

算法很简单。您获取由制表符 \t 指示的行的深度级别，并将下一个项目符号向右移动 \t+\t 或向左移动 \t\t-\t 或将其保留在同一级别 \t。

确保您的“in.txt”包含制表符，或者如果您从此处复制，请用制表符替换缩进。如果缩进由空格组成，则没有任何作用。并且分隔符在末尾是一个空行。如果需要，您可以在代码中更改它。

JF Sebastian 的解决方案很好，但不处理 unicode。

以 UTF-8 编码创建一个文本文件“in.txt”：

qqq
    www
    www
        яяя
        яяя
    ыыы
    ыыы
qqq
qqq

并运行脚本“ul.py”。该脚本将创建“out.html”并在 Firefox 中打开它。

#!/usr/bin/python
# -*- coding: utf-8 -*-

# The script exports a tabbed list from string into a HTML unordered list.

import io, subprocess, sys

f=io.open('in.txt', 'r',  encoding='utf8')
s=f.read()
f.close()

#---------------------------------------------

def ul(s):

    L=s.split('\n\n')

    s='<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">\n\
<html><head><meta content="text/html; charset=UTF-8" http-equiv="content-type"><title>List Out</title></head><body>'

    for p in L:
        e=''
        if p.find('\t') != -1:

            l=p.split('\n')
            depth=0
            e='<ul>'
            i=0

            for line in l:
                if len(line) >0:
                    a=line.split('\t')
                    d=len(a)-1

                    if depth==d:
                        e=e+'<li>'+line+'</li>'


                    elif depth < d:
                        i=i+1
                        e=e+'<ul><li>'+line+'</li>'
                        depth=d


                    elif depth > d:
                        e=e+'</ul>'*(depth-d)+'<li>'+line+'</li>'
                        depth=d
                        i=depth


            e=e+'</ul>'*i+'</ul>'
            p=e.replace('\t','')

            l=e.split('<ul>')
            n1= len(l)-1

            l=e.split('</ul>')
            n2= len(l)-1

            if n1 != n2:
                msg='<div style="color: red;">Wrong bullets position.<br>&lt;ul&gt;: '+str(n1)+'<br>&lt;&frasl;ul&gt;: '+str(n2)+'<br> Correct your source.</div>'
                p=p+msg

        s=s+p+'\n\n'

    return s

#-------------------------------------      

def detach(cmd):
    process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
    sys.exit()

s=ul(s)

f=io.open('out.html', 'w',  encoding='utf8')
s=f.write(s)
f.close()

cmd='firefox out.html'
detach(cmd)

HTML 将是：

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html><head><meta content="text/html; charset=UTF-8" http-equiv="content-type"><title>List Out</title></head><body><ul><li>qqq</li><ul><li>www</li><li>www</li><ul><li>яяя</li><li>яяя</li></ul><li>ыыы</li><li>ыыы</li></ul><li>qqq</li><li>qqq</li></ul>

python - 将选项卡式文本转换为 html 无序列表？

4 回答 4

Related

Reference