1

我假设我有一个包含所需 SEC 文件的数据库(最初是表格 10)。大部分文件都是 HTML 标签;它们看起来像这样:

<DOCUMENT>
<TYPE>10-K
<SEQUENCE>1
<FILENAME>d445434d10k.htm
<DESCRIPTION>FORM 10-K
<TEXT>
<HTML><HEAD>
<TITLE>Form 10-K</TITLE>
</HEAD>
 <BODY BGCOLOR="WHITE">
<h5 align="left"><a href="#toc">Table of Contents</a></h5>
<div style="line-height:120%;font-size:8pt;"><font style="font-family:inherit;font-size:8pt;">&#160;</font></div><div style="line-height:120%;text-indent:32px;font-size:10pt;"><font style="font-family:inherit;font-size:10pt;font-style:italic;">All references in this Form 10-K to the &#8220;Company&#8221;, &#8220;Contango&#8221;, &#8220;we&#8221;, &#8220;us&#8221; or &#8220;our&#8221; are to Contango Oil&#160;&amp; Gas Company and wholly-owned Subsidiaries. Unless otherwise noted, all information in this Form 10-K relating to natural gas and oil reserves and the estimated future net cash flows attributable to those reserves are based on estimates prepared by independent engineers and are net to our interest.</font></div>

我希望最终将每个文件放入数据库中的各个部分。

例如这个:

Overview</font></div><div style="line-height:120%;font-size:10pt;"><font style="font-family:inherit;font-size:10pt;"><br></font></div><div style="line-height:120%;text-indent:48px;font-size:10pt;"><font style="font-family:inherit;font-size:10pt;">Contango is a Houston, Texas based, independent natural gas and oil company.&#160; The Company's core business is to explore, develop, produce and acquire natural gas and oil properties offshore in the shallow waters of the Gulf of Mexico.&#160; Contango Operators, Inc. (&#8220;COI&#8221;), our wholly-owned subsidiary, acts as operator of our offshore properties.  Contango has additional onshore investments in i) Alta Resources Investments, LLC ("Alta"), whose primary area of focus is the liquids-rich Kaybob Duvernay in Alberta, Canada; ii) Exaro Energy III LLC ("Exaro"), which is primarily focused on the development of proved natural gas reserves.

进入:

Overview Contango is a Houston, Texas based, independent natural gas and oil company.&#160; The Company's core business is to explore, develop, produce and acquire natural gas and oil properties offshore in the shallow waters of the Gulf of Mexico.&#160; Contango Operators, Inc. (&#8220;COI&#8221;), our wholly-owned subsidiary, acts as operator of our offshore properties.  Contango has additional onshore investments in i) Alta Resources Investments, LLC ("Alta"), whose primary area of focus is the liquids-rich Kaybob Duvernay in Alberta, Canada; ii) Exaro Energy III LLC ("Exaro"), which is primarily focused on the development of proved natural gas reserves

...并且能够将每个部分调用到自定义视图中(制作自定义和精简版本;仅说第 1 项。业务和细分信息),摆脱样板文件。我的模型将包含此文档中的类型、文件名和某些其他元数据。

您将如何解析它以按照我想要的方式存储文档?根据段落的主题将每个段落存储在单独的部分中会很棒。

最后,其中大多数并不完全相同,但有许多共同点。最后,这个问题与 XBRL 或任何定量数据/表格无关,纯文本。我为此使用NodeJS。

任何帮助表示赞赏。

4

0 回答 0