python - 如何使用 python 在文本块中查找文件名？

Question

我已经使用 Python 获得了网页的 HTML，现在我想找到在标题中链接到的所有 .CSS 文件。我尝试了分区，如下所示，但是在运行它时出现错误“IndexError：字符串索引超出范围”并将每个都保存为自己的变量（我知道如何做这部分）。

sytle = src.partition(".css")
style = style[0].partition('<link href=')
print style[2]
c =1

我不认为这是处理这个问题的正确方法，所以希望得到一些建议。提前谢谢了。这是我需要从中提取 .CSS 文件的文本的一部分。

    <meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0" />

<!--[if gte IE 7]><!-->
<link href="/stylesheets/master.css?1342791430" media="screen, projection" rel="stylesheet" type="text/css" />

<link href="/stylesheets/adapt.css?1342791413" media="screen, projection" rel="stylesheet" type="text/css" />
<!-- <![endif]-->
<link href="/stylesheets/print.css?1342791421" media="print" rel="stylesheet" type="text/css" />
<link href="/apple-touch-icon-precomposed.png" rel="apple-touch-icon-precomposed" />
<link href="http://dribbble.com/shots/popular.rss" rel="alternate" title="RSS" type="application/rss+xml" />

score 4 · Accepted Answer

您应该为此使用正则表达式。尝试以下操作：

/href="(.*\.css[^"]*)/g

编辑

import re
matches = re.findall('href="(.*\.css[^"]*)', html)
print(matches)

score 2 · Accepted Answer

我的答案与Jon Clements 的答案相同，但我测试了我的答案并添加了一些解释。

您不应该使用正则表达式。您无法使用 regex 解析 HTML。正则表达式的答案可能有效，但是使用lxml编写一个健壮的解决方案非常容易。这种方法保证返回所有<link rel="stylesheet">标签的完整 href 属性，而不是其他标签。

from lxml import html

def extract_stylesheets(page_content):
    doc = html.fromstring(page_content)                        # Parse
    return doc.xpath('//head/link[@rel="stylesheet"]/@href')   # Search

无需检查文件名，因为已知 xpath 搜索的结果是样式表链接，并且无法保证文件.css名无论如何都会有扩展名。简单的正则表达式只会捕获一个非常具体的形式，但一般的 html 解析器解决方案也会在这种情况下做正确的事情，在这种情况下，正则表达式会惨遭失败：

<link REL="stylesheet" hREf = 

     '/stylesheets/print?1342791421'
  media="print"
><!-- link href="/css/stylesheet.css" -->

它也可以很容易地扩展为仅选择特定媒体的样式表。

score 1 · Accepted Answer

对于它的价值（使用 lxml.html）作为解析库。

未经测试

import lxml.html
from urlparse import urlparse

sample_html = """<meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0" />

<!--[if gte IE 7]><!-->
<link href="/stylesheets/master.css?1342791430" media="screen, projection" rel="stylesheet" type="text/css" />

<link href="/stylesheets/adapt.css?1342791413" media="screen, projection" rel="stylesheet" type="text/css" />
<!-- <![endif]-->
<link href="/stylesheets/print.css?1342791421" media="print" rel="stylesheet" type="text/css" />
<link href="/apple-touch-icon-precomposed.png" rel="apple-touch-icon-precomposed" />
<link href="http://dribbble.com/shots/popular.rss" rel="alternate" title="RSS" type="application/rss+xml" />
"""

import lxml.html
page = lxml.html.fromstring(html)
link_hrefs = (p.path for p in map(urlparse, page.xpath('//head/link/@href')))
for href in link_hrefs:
    if href.rsplit(href, 1)[-1].lower() == 'css': # implement smarter error handling here
        pass # do whatever

python - 如何使用 python 在文本块中查找文件名？

3 回答 3

Related

Reference