1

我从 SEC EDGAR 下载了以下 20-F 表格:

https://www.sec.gov/Archives/edgar/data/1729089/000121390019021541/0001213900-19-021541.txt

如您所见,.txt 文件包含多个 html 标签,例如:

</HEAD>
<BODY STYLE="font: 10pt Times New Roman, Times, Serif">

<P STYLE="margin-top: 0; text-align: center; margin-bottom: 0; font: 10pt Times New Roman, Times, Serif"></P>

<!-- Field: Rule-Page --><DIV STYLE="width: 100%"><DIV STYLE="font-size: 1pt; border-top: Black 2pt solid; border-bottom: Black 1pt solid">&nbsp;</DIV></DIV><!-- Field: /Rule-Page -->

<P STYLE="margin-top: 0; text-align: center; margin-bottom: 0; font: 10pt Times New Roman, Times, Serif">&nbsp;</P>

<P STYLE="margin-top: 0; text-align: center; margin-bottom: 0; font: 10pt Times New Roman, Times, Serif"><FONT STYLE="font-family: Times New Roman, Times, Serif; font-size: 10pt"><B>UNITED
STATES</B></FONT></P>

<P STYLE="font: 10pt Times New Roman, Times, Serif; margin-top: 0; margin-bottom: 0; text-align: center"><FONT STYLE="font-family: Times New Roman, Times, Serif; font-size: 10pt"><B>SECURITIES
AND EXCHANGE COMMISSION</B></FONT></P>

<P STYLE="font: 10pt Times New Roman, Times, Serif; margin-top: 0; margin-bottom: 0; text-align: center"><FONT STYLE="font-family: Times New Roman, Times, Serif; font-size: 10pt"><B>WASHINGTON,
D.C. 20549</B></FONT></P>

<P STYLE="font: 10pt Times New Roman, Times, Serif; margin-top: 0; margin-bottom: 0; text-align: center"><FONT STYLE="font-family: Times New Roman, Times, Serif; font-size: 10pt">&nbsp;</FONT></P>

<P STYLE="font: 10pt Times New Roman, Times, Serif; margin-top: 0; margin-bottom: 0; text-align: center"><FONT STYLE="font-family: Times New Roman, Times, Serif; font-size: 10pt"><B>FORM
20-F</B></FONT></P>

由于我想执行自然语言处理 (NLP) - 文本分析,我需要摆脱所有这些 HTML 等标签吗?我怎样才能做到这一点?通过正则表达式或使用包,例如 BeautifulSoup?

4

0 回答 0