0

问题是,在我从 SEC 抓取的一些 xml 文件中,标签内有换行符。因此,这些 xml 文件格式不正确。

<footnote id="F4">Shares sold on the open market are reported as an average sell price per share of $56.87; breakdown of shares sold and per share sale prices are as follows; 100 at $56.31; 200 at $56.32; 100 at $56.33; 198 at $56.39; 600 at $56.40; 100 at $56.41; 102 at $56.42; 600 at $56.44; 320 at $56.45; 100 at $56.46; 900 at $56.47; 480 at $56.48; 300 at $56.49; 1,200 at $56.50; 400 at $56.51; 1,130 at $56.52; 600 at $56.53; 100 at $56.54; 1,500 at $56.55; 600 at $56.56; 644 at $56.57; 1,656 at $56.58; 1,070 at $56.59; 2069 at $56.60; 1,831 at $56.61; 1,000 at $56.62; 1,000 at $56.63; 492 at $56.64; 1,400 at $56.65; 920 at $56.66; 1,000 at $56.67; 600 at $56.68; 500 at $56.69; 1,200 at $56.70; 500 at $56.71; 582 at $56.72; 400 at $56.73; 1,108 at $56.74; 37 at $56.75; 710 at $56.76; 630 at $56.77; 1,600 at $56.78; 400 at $56.79; 400 at $56.80; 1,500 at $56.81; 1,100 at $56.82; 100 at $56.83; 800 at $56.84; 200 at $56.85; 1,300 at $56.87; additional shares sold continued on Footnote (5).</footnot
e>

我的第一个想法是,这是因为utf-8和ISO-8859-1的编码不同,但是改了编码后问题依旧。我的下一个解决方案是一个正则表达式,它检测标签内的那些换行符,但由于它们可能出现在任何地方,这个解决方案不是很可靠。

你们对如何解决这个问题有任何想法吗?

4

1 回答 1

0

对于这个带有 xml 部分的 txt 文件,可以通过以下方式完成:

import re

# open the txt file
with open("0001112679-10-000086.txt", "r", encoding="utf8") as f:
    txt = f.read();

# cut out the xml part from the txt file
start = txt.find("<XML>")
end = txt.find("</XML>") + 6
xml = txt[start:end]

# process the xml part
xml = re.sub(r"([^\n]{1023})\n", r"\1", xml)

# combine a new txt back from the parts
new_txt = txt[:start] + xml + txt[end:]

# save the new txt in file
with open("0001112679-10-000086_output.txt", "w", encoding="utf8") as f:
    f.write(new_txt)
于 2021-05-24T16:18:56.210 回答