python - Python 从 html 中抓取所有链接并仅显示链接

Question

我正在尝试使用以下语句从网页中获取标题：

titl1 = re.findall(r'<title>(.*?)</title>',the_webpage)

使用它，我得到['random webpage example1']. 如何删除引号和括号？

我还尝试使用以下方法获取一组每小时更改一次的链接（这就是我需要通配符的原因）links = re.findall(r'(file=(.*?).mp3)',the_webpage)：

我明白了

[('file=http://media.kickstatic.com/kickapps/images/3380/audios/944521.mp3', 
  'http://media.kickstatic.com/kickapps/images/3380/audios/944521'), 
 ('file=http://media.kickstatic.com/kickapps/images/3380/audios/944521.mp3', 
  'http://media.kickstatic.com/kickapps/images/3380/audios/944521'), 
 ('file=http://media.kickstatic.com/kickapps/images/3380/audios/944521.mp3', 
  'http://media.kickstatic.com/kickapps/images/3380/audios/944521')]

如何在没有的情况下获得 mp3 链接file=？

我还想下载 mp3 文件并在它们后面附加网站的标题，这样它就会显示

random webpage example1.mp3

我该怎么做？我还在学习 Python 和正则表达式，这有点难倒我。

score 0 · Accepted Answer

至少在第 1 部分，你可以做

>>> mytitle = title1[0]
>>> print mytitle
random webpage example1

正则表达式返回匹配的字符串列表，因此您只需要获取列表中的第一项。

同样，对于第二部分，正则表达式返回一个包含元组的列表。你可以这样做：

>>> download_links = [href for (discard, href) in links]
>>> print download_links
['http://media.kickstatic.com/kickapps/images/3380/audios/944521', 'http://media.kickstatic.com/kickapps/images/3380/audios/944521', 'http://media.kickstatic.com/kickapps/images/3380/audios/944521']

至于下载文件，请使用urlib2（至少对于python 2.x，不确定python 3.x）。有关详细信息，请参阅此问题。

score 0 · Accepted Answer

第一部分 titl1 = re.findall(r'<title>(.*?)</title>',the_webpage)将返回一个列表，当您打印一个列表时，它会带有括号和引号。因此print title[0]，如果您确定永远只有一场比赛，请尝试一下。（您也可以尝试 re.search 代替）

对于第二部分，如果您将 re 模式从更改为"(file=(.*?)\.mp3)"，"file=(.*?)\.mp3"您将只获得'http://linkInThisPart/path/etc/etc'添加 .mp3 扩展名所需的部分。

IE

audio_links = [x +'.mp3' for x in re.findall(r'file=(.*?)\.mp3',web_page)]

要下载您可能需要查看 urllib、urllib2 的文件

import urllib2
url='http://media.kickstatic.com/kickapps/images/3380/audios/944521.mp3'
req=urllib2.Request(url)
temp_file=open('random webpage example1.mp3','wb')
buffer=urllib2.urlopen(req).read()
temp_file.write(buff)
temp_file.close()

score 0 · Accepted Answer

代码：

#!/usr/bin/env python

import re,urllib,urllib2

Url = "http://www.ihiphopmusic.com/music/rick-ross-sixteen-feat-andre-3000"
print Url
print 'test .............'
req = urllib2.Request(Url)
print "1"
response = urllib2.urlopen(req)
print "2"
the_webpage = response.read()
print "3"
titl1 = re.findall(r'<title>(.*?)</title>',the_webpage)
print "4"
a2 = [x +'.mp3' for x in re.findall(r'file=(.*?)\.mp3',the_webpage)]
print "5"
a2 = [x[0][5:] for x in a2]
print "6"
ti = titl1[0]
print ti
print "7"
print a2
print "8"

print "9"
#print the_page
print "10"

req=urllib2.Request(a2)
print "11"
temp_file=open(ti)
print "12"
buffer=urllib2.urlopen(req).read()
print "13"
temp_file.write(buff)
print "14"
temp_file.close()
print "15"
print "16"

结果

http://www.ihiphopmusic.com/music/rick-ross-sixteen-feat-andre-3000
test .............
1
2
3
4
5
6
Rick Ross - Sixteen (feat. Andre 3000)
7
['', '', '']
8
9
10
Traceback (most recent call last):
  File "grub.py", line 29, in <module>
    req=urllib2.Request(a2)
  File "/usr/lib/python2.7/urllib2.py", line 198, in __init__
    self.__original = unwrap(url)
  File "/usr/lib/python2.7/urllib.py", line 1056, in unwrap
    url = url.strip()
AttributeError: 'list' object has no attribute 'strip'

score 0 · Accepted Answer

蟒蛇 3：

import requests
import re
from urllib.request import urlretrieve

-首先获取 HTML 文本

html_text=requests.get('url')

-正则表达式查找网址

正则表达式模式，匹配（'模式'，'文本'，标志）

在该模式中，'()' 用于对您想要的内容进行分组。在这种情况下，我们将 'http://*****.mp3' 分组，您可以使用 .group(1) 或 groups() 来引用它。

url_find=re.findall('file=(http://media.mp3*',html_text)
for url_match in url_matches:
    index += 1
    print(url_match)
    urlretrieve(url_match, './graber/mp3/user' + str(index) + '.mp3')

我就是这样完成的，希望对你有帮助。（下载东西有多种方式，在这种情况下，我使用urlretrieve）

python - Python 从 html 中抓取所有链接并仅显示链接

4 回答 4

Related

Reference