- 我的本地硬盘上有 html 文件,我试图通过发送 http 请求在网页中打开这些文件。
- 创建http请求后,我试图通过传递url来解析存储的html文件:(一次传递一个文件时解析是成功的,但我想对目录中的所有文件动态执行此操作,因此用于循环. 这不锻炼)
解析完成后,我将数据保存到 json 文件中。(工作正常)我在这里粘贴了代码:
import json import os from newspaper import Article import newspaper # initiating the server server_start = os.system('start "HTTP Server on port 8000" cmd.exe /c {python -m http.server}') http_server = 'http://localhost:8000/' links = '' path = "<path>" for f in os.listdir(path): if f.endswith('.html'): links = http_server + path + f blog_post = newspaper.build(links) for article in blog_post.articles: print(article.url) article = Article(links) article.download('') article.parse() data = {"HTML": article.html, "author": article.authors, "title": article.title, "text": article.text, "date": str(article.publish_date)} json_data = json.dumps(data) with open('data.json', 'w') as outfile: json.dump(data, outfile)
错误信息:
...\newspaper\Scripts\python.exe ".../parsing_newspaper/test1.py" [Source parse ERR] http://localhost:8000/.../cnnpolitics-russian.html Traceback(最近一次调用最后):
文件“...\newspaper\lib\site-packages\newspaper\parsers.py”,第 68 行,在 fromstring cls.doc = lxml.html.fromstring(html)
文件“...\newspaper\lib\site-packages\lxml\html__init__.py”,第 876 行,在 fromstring doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
文件“...\newspaper\lib\site-packages\lxml\html__init__.py”,第 762 行,在 document_fromstring value = etree.fromstring(html, parser, **kw)
文件“src\lxml\lxml.etree.pyx”,第 3213 行,在 lxml.etree.fromstring (src\lxml\lxml.etree.c:78994)
文件“src\lxml\parser.pxi”,第 1848 行,在 lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:118325)
文件“src\lxml\parser.pxi”,第 1729 行,在 lxml.etree._parseDoc (src\lxml\lxml.etree.c:116883)
文件“src\lxml\parser.pxi”,第 1063 行,在 lxml.etree._BaseParser._parseUnicodeDoc (src\lxml\lxml.etree.c:110870)
文件“src\lxml\parser.pxi”,第 595 行,在 lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:105093)
文件“src\lxml\parser.pxi”,第 706 行,在 lxml.etree._handleParseResult (src\lxml\lxml.etree.c:106801)
文件“src\lxml\parser.pxi”,第 646 行,在 lxml.etree._raiseParseError (src\lxml\lxml.etree.c:105947)
文件“”,第 0 行 lxml.etree.XMLSyntaxError:
你必须
download()
在呼吁parse()
它之前的文章!Traceback(最近一次调用最后一次):文件“.../test1.py”,第 26 行,在 article.parse()
文件“...\newspaper\lib\site-packages\newspaper\article.py”,第 168 行,解析中引发 ArticleException() news.article.ArticleException