python - 无法解析目录中的多个文件

Question

我的本地硬盘上有 html 文件，我试图通过发送 http 请求在网页中打开这些文件。
创建http请求后，我试图通过传递url来解析存储的html文件:(一次传递一个文件时解析是成功的，但我想对目录中的所有文件动态执行此操作，因此用于循环. 这不锻炼）

解析完成后，我将数据保存到 json 文件中。（工作正常）我在这里粘贴了代码：

import json
import os
from newspaper import Article
import newspaper

# initiating the server
server_start = os.system('start "HTTP Server on port 8000" cmd.exe /c {python -m http.server}')
http_server = 'http://localhost:8000/'
links = ''
path = "<path>"
for f in os.listdir(path):
    if f.endswith('.html'):
        links = http_server + path + f

    blog_post = newspaper.build(links)

    for article in blog_post.articles:
        print(article.url)

    article = Article(links)
    article.download('')
    article.parse()
    data = {"HTML": article.html, "author": article.authors, "title": article.title, "text": article.text, "date": str(article.publish_date)}

    json_data = json.dumps(data)
    with open('data.json', 'w') as outfile:
        json.dump(data, outfile)

错误信息：

...\newspaper\Scripts\python.exe ".../parsing_newspaper/test1.py" [Source parse ERR] http://localhost:8000/.../cnnpolitics-russian.html Traceback（最近一次调用最后):

文件“...\newspaper\lib\site-packages\newspaper\parsers.py”，第 68 行，在 fromstring cls.doc = lxml.html.fromstring(html)

文件“...\newspaper\lib\site-packages\lxml\html__init__.py”，第 876 行，在 fromstring doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)

文件“...\newspaper\lib\site-packages\lxml\html__init__.py”，第 762 行，在 document_fromstring value = etree.fromstring(html, parser, **kw)

文件“src\lxml\lxml.etree.pyx”，第 3213 行，在 lxml.etree.fromstring (src\lxml\lxml.etree.c:78994)

文件“src\lxml\parser.pxi”，第 1848 行，在 lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:118325)

文件“src\lxml\parser.pxi”，第 1729 行，在 lxml.etree._parseDoc (src\lxml\lxml.etree.c:116883)

文件“src\lxml\parser.pxi”，第 1063 行，在 lxml.etree._BaseParser._parseUnicodeDoc (src\lxml\lxml.etree.c:110870)

文件“src\lxml\parser.pxi”，第 595 行，在 lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:105093)

文件“src\lxml\parser.pxi”，第 706 行，在 lxml.etree._handleParseResult (src\lxml\lxml.etree.c:106801)

文件“src\lxml\parser.pxi”，第 646 行，在 lxml.etree._raiseParseError (src\lxml\lxml.etree.c:105947)

文件“”，第 0 行 lxml.etree.XMLSyntaxError：

你必须download()在呼吁parse()它之前的文章！

Traceback（最近一次调用最后一次）：文件“.../test1.py”，第 26 行，在 article.parse()

文件“...\newspaper\lib\site-packages\newspaper\article.py”，第 168 行，解析中引发 ArticleException() news.article.ArticleException

score 1 · Accepted Answer

不知道这是否有帮助，但试试这个：

import json
import os
from newspaper import Article
import newspaper

# initiating the server
server_start = os.system('start "HTTP Server on port 8000" cmd.exe /c {python -m http.server}')
http_server = 'http://localhost:8000/'
links = ''
path = "<path>"
for f in os.listdir(path):
    if f.endswith('.html'):
       links = http_server + path + f

       blog_post = newspaper.build(links)

       for article in blog_post.articles:
       print(article.url)

       article = Article(links)
       article.download('')
       article.parse()
       data = {"HTML": article.html, "author": article.authors, "title": article.title, "text": article.text, "date": str(article.publish_date)}

       json_data = json.dumps(data)
       with open('data.json', 'w') as outfile:
       json.dump(data, outfile)

因为否则如果第一个文件不是带有 html 扩展名的文件，那么您将尝试构建一个空字符串。

或者如果第一个是带有 html 扩展名的文件，但第二个不是你将构建相同的文件（至少）两次

score 0 · Accepted Answer

在深入调试之前要遵循的检查列表：

检查html是否不为空
检查 ahtml 是否“格式正确”
检查文章是否为空
检查是否下载了一篇文章（函数 parse() 的作用，但这有助于您隔离“有问题的”文章）

python - 无法解析目录中的多个文件

2 回答 2

Related

Reference