python - 手动提取列表中包含的部分字符串（解析）

Question

我知道有些模块可以完全简化此功能，但说我是从 python 的基本安装运行的（仅限标准模块），我将如何提取以下内容：

我有一个清单。此列表是网页的逐行内容。这是一个用于提供信息的模型列表（未格式化）：

<script>
    link = "/scripts/playlists/1/" + a.id + "/0-5417069212.asx";
<script>

"<a href="/apps/audio/?feedId=11065"><span class="px13">Eastern Metro Area Fire</span>"

从上面的字符串中，我需要提取以下内容。feedId (11065)，在上面的代码中顺便说一下 a.id。“/scripts/playlists/1/”和“/0-5417069212.asx”。记住这些行中的每一行只是列表中对象的内容，我将如何提取这些数据？

以下是完整列表：

contents = urllib2.urlopen("http://www.radioreference.com/apps/audio/?ctid=5586")

伪：

from urllib2 import urlopen as getpage
page_contents = getpage("http://www.radioreference.com/apps/audio/?ctid=5586")

feedID        = % in (page_contents.search() for "/apps/audio/?feedId=%")
titleID       = % in (page_contents.search() for "<span class="px13">%</span>")
playlistID    = % in (page_contents.search() for "link = "%" + a.id + "*.asx";")
asxID         = * in (page_contents.search() for "link = "*" + a.id + "%.asx";")

streamURL     = "http://www.radioreference.com/" + playlistID + feedID + asxID + ".asx"

我计划将其格式化为 streamURL 应该 = ：

http://www.radioreference.com/scripts/playlists/1/11065/0-5417067072.asx

score 0 · Accepted Answer

我会用正则表达式来做到这一点。Python的re模块很棒！

但是，搜索包含所有页面文本的单个字符串更容易（也更快）（而不是逐行重复搜索）。如果可以，请read()在打开 URL 时获得的类文件对象上执行操作，而不是readlines()（或直接遍历文件对象）。如果你不能这样做，你可以使用"\n".join(list_of_strings)将这些行重新组合成一个字符串。

以下是一些适用于您的示例 URL 的代码：

from urllib2 import urlopen
import re

contents = urlopen("http://www.radioreference.com/apps/audio/?ctid=5586").read()

playlist_pattern = r'link = "([^"]+)" \+ a.id \+ "([^"]+\.asx)'
feed_pattern = r'href="/apps/audio/\?feedId=(\d+)"><span class="px13">([^<]+)'
pattern = playlist_pattern + ".*" + feed_pattern

playlist, asx, feed, title = re.search(pattern, contents, re.DOTALL).groups()

streamURL = "http://www.radioreference.com" + playlist + feed + asx

print title
print streamURL

输出：

Eastern Metro Area Fire
http://www.radioreference.com/scripts/playlists/1/11065/0-5417090148.asx

并非严格需要一次性完成所有匹配。如果需要，您可以使用playlist_pattern和feed_pattern分别获取两个部分。但是，将其中任何一个部分分开会有点困难，因为您将开始为某些部分遇到额外的匹配（link = "stuff"例如，有几个相同的部分）。

python - 手动提取列表中包含的部分字符串（解析）

1 回答 1

Related

Reference