python - Python XML删除标签内的换行符

Question

问题是，在我从 SEC 抓取的一些 xml 文件中，标签内有换行符。因此，这些 xml 文件格式不正确。

<footnote id="F4">Shares sold on the open market are reported as an average sell price per share of $56.87; breakdown of shares sold and per share sale prices are as follows; 100 at $56.31; 200 at $56.32; 100 at $56.33; 198 at $56.39; 600 at $56.40; 100 at $56.41; 102 at $56.42; 600 at $56.44; 320 at $56.45; 100 at $56.46; 900 at $56.47; 480 at $56.48; 300 at $56.49; 1,200 at $56.50; 400 at $56.51; 1,130 at $56.52; 600 at $56.53; 100 at $56.54; 1,500 at $56.55; 600 at $56.56; 644 at $56.57; 1,656 at $56.58; 1,070 at $56.59; 2069 at $56.60; 1,831 at $56.61; 1,000 at $56.62; 1,000 at $56.63; 492 at $56.64; 1,400 at $56.65; 920 at $56.66; 1,000 at $56.67; 600 at $56.68; 500 at $56.69; 1,200 at $56.70; 500 at $56.71; 582 at $56.72; 400 at $56.73; 1,108 at $56.74; 37 at $56.75; 710 at $56.76; 630 at $56.77; 1,600 at $56.78; 400 at $56.79; 400 at $56.80; 1,500 at $56.81; 1,100 at $56.82; 100 at $56.83; 800 at $56.84; 200 at $56.85; 1,300 at $56.87; additional shares sold continued on Footnote (5).</footnot
e>

我的第一个想法是，这是因为utf-8和ISO-8859-1的编码不同，但是改了编码后问题依旧。我的下一个解决方案是一个正则表达式，它检测标签内的那些换行符，但由于它们可能出现在任何地方，这个解决方案不是很可靠。

你们对如何解决这个问题有任何想法吗？

score 0 · Accepted Answer

对于这个带有 xml 部分的 txt 文件，可以通过以下方式完成：

import re

# open the txt file
with open("0001112679-10-000086.txt", "r", encoding="utf8") as f:
    txt = f.read();

# cut out the xml part from the txt file
start = txt.find("<XML>")
end = txt.find("</XML>") + 6
xml = txt[start:end]

# process the xml part
xml = re.sub(r"([^\n]{1023})\n", r"\1", xml)

# combine a new txt back from the parts
new_txt = txt[:start] + xml + txt[end:]

# save the new txt in file
with open("0001112679-10-000086_output.txt", "w", encoding="utf8") as f:
    f.write(new_txt)

python - Python XML删除标签内的换行符

1 回答 1

Related

Reference