我正在尝试从 SEC Edgar 的数据库中抓取文件。我可以使用请求获取文本文件。当我尝试使用以下代码解析文件时,出现解析错误。当我请求 .xml url 而不是 .txt url 时,相同的代码有效。Url有以下内容:
<SEC-HEADER>0001752724-20-203989.hdr.sgml : 20201001
<ACCEPTANCE-DATETIME>20201001132951
ACCESSION NUMBER: 0001752724-20-203989
CONFORMED SUBMISSION TYPE: NPORT-P
PUBLIC DOCUMENT COUNT: 2
CONFORMED PERIOD OF REPORT: 20200831
FILED AS OF DATE: 20201001
PERIOD START: 20201130
-------------
**
-------------
FORMER COMPANY:
FORMER CONFORMED NAME: ASA LTD
DATE OF NAME CHANGE: 20070301
FORMER COMPANY:
FORMER CONFORMED NAME: ASA BERMUDA LTD
DATE OF NAME CHANGE: 20030505
</SEC-HEADER>
<DOCUMENT>
<TYPE>NPORT-P
<SEQUENCE>1
<FILENAME>primary_doc.xml
<TEXT>
<XML>
<?xml version="1.0" encoding="UTF-8"?><edgarSubmission xmlns="http://www.sec.gov/edgar/nport" xmlns:com="http://www.sec.gov/edgar/common" xmlns:ncom="http://www.sec.gov/edgar/nportcommon" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sec.gov/edgar/nport eis_NPORT_Filer.xsd">
<headerData>
<submissionType>NPORT-P</submissionType>
<isConfidential>false</isConfidential>
<filerInfo>
<filer>
<issuerCredentials>
<cik>0001230869</cik>
<ccc>XXXXXXXX</ccc>
我的代码:
url = 'https://www.sec.gov/Archives/edgar/data/1230869/0001752724-20-203989.txt'
response = requests.get(url)
root = ET.fromstring(response.content)
错误:
Traceback (most recent call last):
File "/usr/local/anaconda/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3326, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-83-cd4e6ed59b34>", line 3, in <module>
root = ET.fromstring(response.content)
File "/usr/local/anaconda/lib/python3.6/xml/etree/ElementTree.py", line 1314, in XML
parser.feed(text)
File "<string>", line unknown
ParseError: not well-formed (invalid token): line 14, column 38