2

我有一些代码在一个更大的程序中独立工作,但现在在更大的程序中它似乎没有工作——即它没有执行所需的操作。

问题发生在第 4 步(见下文),经过反思,我在字符类中的预期逻辑(即“除回车之外的所有内容”)似乎没有正确编码(但我不知道还有什么“表达”逻辑)。

我的目标只是用段落标签包装每一行或段落。

Python代码

import re

# 1.  open the html file in read mode
html_file = open('test.html', 'r')

# 2.  convert to string
html_file_as_string = html_file.read()

# 3.  close the html file
html_file.close()

# 4.  replace carriage returns with closing and opening paragraph tags
html_file_as_string = re.sub('([^\r]*)\r', r'\1</p>\n<p>', html_file_as_string)

# 5.  remove time and date
html_file_as_string = re.sub(r'(Lorem ipsum \d*/\d*/\d*, \d*:\d* [a-z]{2})', r"", html_file_as_string)

# 6.  remove the white space after the opening paragraph tags
html_file_as_string = re.sub('<p>\n*\s*', r"<p>", html_file_as_string)

# 7.  remove the white space before the closing paragraph tags
html_file_as_string = re.sub('\s*</p>', r"</p>", html_file_as_string)

# 8.  open the file in write mode to clear
html_file = open('test.html', 'w')

# 9.  write the new contents to file
html_file.write(html_file_as_string)

# 10.  print to screen so we can see what is happening
print html_file_as_string

# 11.  close the html file
html_file.close()

这是 HTML 文件的内容:

<p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
   Lorem ipsum..consectetur adipiscing elit.
   Lorem ipsum dolor sit amet, consectetur adipiscing elit. Lorem ipsum dolor sit amet, consectetur adipiscing elit.
   Lorem ipsum dolor sit amet, consectetur adipiscing elit.Lorem ipsum dolor sit amet, consectetur adipiscing elit."Lorem ipsum dolor sit amet", consectetur adipisc'ing elit.Lorem ipsum dolor...sit amet, consectetur adipiscing elit..
   Lorem ipsum dolor sit amet, consectetur adipiscing elit.Lorem ipsum dolor sit amet, consectetur adipiscing elit.
   Lorem ipsum dolor sit amet, consectetur adipiscing elit..
   .....Lorem ipsum dolor sit amet, consectetur adipiscing elit.Lorem ipsum dolor sit amet, consectetur adipiscing elit.Lorem ipsum dolor sit amet, consectetur adipiscing elit.Lorem ipsum dolor sit amet, consectetur adipiscing elit.Lorem ipsum dolor sit amet, consectetur adipiscing elit.Lorem ipsum 01/01/05, 05:00 am</p>

这是在 SciTE 编辑器中查看的文件内容(因此可以看到空格、回车和换行符)。

在此处输入图像描述

编辑:

我根据以下建议更改了正则表达式,然后将替换加倍(从第 4 步中可见的原始代码更改和第 4 步之前的第 6 步复制)。

工作代码:

import re

# 1.  open the html file in read mode
html_file = open('test.html', 'r')

# 2.  convert to string
html_file_as_string = html_file.read()

# 3.  close the html file
html_file.close()

# 6(added).  remove the white space after the opening paragraph tags
html_file_as_string = re.sub('<p>\n*\s*', r"<p>", html_file_as_string)

# 4(changed).  replace carriage returns with closing and opening paragraph tags
html_file_as_string = re.sub('([^\r\n]*)(\r\n?|\n)', r'\1</p>\2<p>', html_file_as_string)

# 5.  remove time and date
html_file_as_string = re.sub(r'(Lorem ipsum \d*/\d*/\d*, \d*:\d* [a-z]{2})', r"", html_file_as_string)

# 6.  remove the white space after the opening paragraph tags
html_file_as_string = re.sub('<p>\n*\s*', r"<p>", html_file_as_string)

# 7.  remove the white space before the closing paragraph tags
html_file_as_string = re.sub('\s*</p>', r"</p>", html_file_as_string)

# 8.  open the file in write mode to clear
html_file = open('test.html', 'w')

# 9.  write the new contents to file
html_file.write(html_file_as_string)

# 10.  print to screen so we can see what is happening
print html_file_as_string

# 11.  close the html file
html_file.close()

编辑2:

上面的代码在其他部分代码过于激进,做了太多的修改,回到绘图板上。

4

1 回答 1

1

我认为您最好遍历字符串,而不是尝试使用正则表达式作为 Python 来做一些事情,您可以通过几个步骤来做到这一点

import re
parsed_html = []

# Open the file and close it after being read
with open('test.html', 'r') as html_file:
    lines = html_file.readlines()

# Iterate through each line using \r\n, \r or \n as separators 
for line in lines:
    # Remove whitespace chars before and after the content (6 & 7)
    line = line.strip()

    # Skip empty lines (you could merge this and latter if)
    if not line:
         continue

    # Skip lines with only <p> or </p> (check output and remove if not needed)
    if re.match("^</?p>$", line):
         continue

    # Remove the date line (using + instead of * as * matches 0 or more entries)
    line = re.sub(r'Lorem ipsum \d+/\d+/\d+, \d+:\d+ (am|pm)', '', line)

    # Make sure we add p tag and CL/LR to the end of each line
    line = "<p>{0}</p>\r\n".format(line)

    # Append current line to a list that we will use to write the file
    parsed_html.append(line)

# Write contents of the parsed html to the file
with open('test.html', 'w') as html_file:
    html_file.writelines(parsed_html)
于 2013-07-09T06:31:45.383 回答