python - SEC EDGAR 20-F 表格 - 如何处理包含 html 标签的文本

Question

我从 SEC EDGAR 下载了以下 20-F 表格：

https://www.sec.gov/Archives/edgar/data/1729089/000121390019021541/0001213900-19-021541.txt

如您所见，.txt 文件包含多个 html 标签，例如：

</HEAD>
<BODY STYLE="font: 10pt Times New Roman, Times, Serif">

<P STYLE="margin-top: 0; text-align: center; margin-bottom: 0; font: 10pt Times New Roman, Times, Serif"></P>

<!-- Field: Rule-Page --><DIV STYLE="width: 100%"><DIV STYLE="font-size: 1pt; border-top: Black 2pt solid; border-bottom: Black 1pt solid">&nbsp;</DIV></DIV><!-- Field: /Rule-Page -->

<P STYLE="margin-top: 0; text-align: center; margin-bottom: 0; font: 10pt Times New Roman, Times, Serif">&nbsp;</P>

<P STYLE="margin-top: 0; text-align: center; margin-bottom: 0; font: 10pt Times New Roman, Times, Serif"><FONT STYLE="font-family: Times New Roman, Times, Serif; font-size: 10pt"><B>UNITED
STATES</B></FONT></P>

<P STYLE="font: 10pt Times New Roman, Times, Serif; margin-top: 0; margin-bottom: 0; text-align: center"><FONT STYLE="font-family: Times New Roman, Times, Serif; font-size: 10pt"><B>SECURITIES
AND EXCHANGE COMMISSION</B></FONT></P>

<P STYLE="font: 10pt Times New Roman, Times, Serif; margin-top: 0; margin-bottom: 0; text-align: center"><FONT STYLE="font-family: Times New Roman, Times, Serif; font-size: 10pt"><B>WASHINGTON,
D.C. 20549</B></FONT></P>

<P STYLE="font: 10pt Times New Roman, Times, Serif; margin-top: 0; margin-bottom: 0; text-align: center"><FONT STYLE="font-family: Times New Roman, Times, Serif; font-size: 10pt">&nbsp;</FONT></P>

<P STYLE="font: 10pt Times New Roman, Times, Serif; margin-top: 0; margin-bottom: 0; text-align: center"><FONT STYLE="font-family: Times New Roman, Times, Serif; font-size: 10pt"><B>FORM
20-F</B></FONT></P>

由于我想执行自然语言处理 (NLP) - 文本分析，我需要摆脱所有这些 HTML 等标签吗？我怎样才能做到这一点？通过正则表达式或使用包，例如 BeautifulSoup？

python - SEC EDGAR 20-F 表格 - 如何处理包含 html 标签的文本

0 回答 0

Related

Reference