python - 是否有 .endswith 可以在 (".html?") 之后打开后缀变体的文件（例如通过 ".html?p=1209401"、".html?p=92030" 等）

Question

我正在尝试创建一个 html 解析器，它将获取文件夹及其子文件夹中的所有 html 和 htm 文件，并取出所有 html 标签并导出 CSV 和 TXT 文件。我有一个包含子文件夹的文件夹，其中包含许多以“.html?p=39200”或“index.html?replytocom=5467”结尾的文件

我想告诉 Python 用“.html”打开所有文件？+ *（包括之后的任何变化）不仅仅是“.html”文件。

我试过谷歌搜索、查看文档和堆栈溢出，但找不到解决这个问题的方法。到目前为止，这是我的代码：

with os.scandir(directory) as it:
    for entry in it:
        if entry.name.endswith(".html") or entry.name.endswith("htm"):

免责声明：我是初学者

score 1 · Accepted Answer

您可以使用str.split()获取问号之前的部分（或整个文件名，如果它不包含问号），并使用该部分与“.html”和“htm”匹配：

with os.scandir(directory) as it:
    for entry in it:
        name = entry.name.split('?')[0]
        if name.endswith(".html") or name.endswith("htm"):
            print(entry.name)

score 1 · Accepted Answer

您可以检查字符串是否".html"出现在文件名中的任何位置，而不仅仅是在末尾：

with os.scandir(directory) as it:
    for entry in it:
        if ".html" in entry.name:

score 0 · Accepted Answer

还带有正则表达式。

import re
with os.scandir(directory) as it:
    for entry in it:
        if re.match(r'.*?(?:\.html?$|\.html?\?.*)', entry.name) is not None:

python - 是否有 .endswith 可以在 (".html?") 之后打开后缀变体的文件（例如通过 ".html?p=1209401"、".html?p=92030" 等）

3 回答 3

Related

Reference