我从 SEC EDGAR 下载了以下 20-F 表格:
https://www.sec.gov/Archives/edgar/data/1729089/000121390019021541/0001213900-19-021541.txt
如您所见,.txt 文件包含多个 html 标签,例如:
</HEAD>
<BODY STYLE="font: 10pt Times New Roman, Times, Serif">
<P STYLE="margin-top: 0; text-align: center; margin-bottom: 0; font: 10pt Times New Roman, Times, Serif"></P>
<!-- Field: Rule-Page --><DIV STYLE="width: 100%"><DIV STYLE="font-size: 1pt; border-top: Black 2pt solid; border-bottom: Black 1pt solid"> </DIV></DIV><!-- Field: /Rule-Page -->
<P STYLE="margin-top: 0; text-align: center; margin-bottom: 0; font: 10pt Times New Roman, Times, Serif"> </P>
<P STYLE="margin-top: 0; text-align: center; margin-bottom: 0; font: 10pt Times New Roman, Times, Serif"><FONT STYLE="font-family: Times New Roman, Times, Serif; font-size: 10pt"><B>UNITED
STATES</B></FONT></P>
<P STYLE="font: 10pt Times New Roman, Times, Serif; margin-top: 0; margin-bottom: 0; text-align: center"><FONT STYLE="font-family: Times New Roman, Times, Serif; font-size: 10pt"><B>SECURITIES
AND EXCHANGE COMMISSION</B></FONT></P>
<P STYLE="font: 10pt Times New Roman, Times, Serif; margin-top: 0; margin-bottom: 0; text-align: center"><FONT STYLE="font-family: Times New Roman, Times, Serif; font-size: 10pt"><B>WASHINGTON,
D.C. 20549</B></FONT></P>
<P STYLE="font: 10pt Times New Roman, Times, Serif; margin-top: 0; margin-bottom: 0; text-align: center"><FONT STYLE="font-family: Times New Roman, Times, Serif; font-size: 10pt"> </FONT></P>
<P STYLE="font: 10pt Times New Roman, Times, Serif; margin-top: 0; margin-bottom: 0; text-align: center"><FONT STYLE="font-family: Times New Roman, Times, Serif; font-size: 10pt"><B>FORM
20-F</B></FONT></P>
由于我想执行自然语言处理 (NLP) - 文本分析,我需要摆脱所有这些 HTML 等标签吗?我怎样才能做到这一点?通过正则表达式或使用包,例如 BeautifulSoup?