我想从 html 页面中提取所有锚标记。我在 Linux 中使用它。
lynx --source http://www.imdb.com | egrep "<a[^>]*>"
但这没有按预期工作,因为结果包含不需要的结果
<a class="amazon-affiliate-site-name" href="http://www.fabric.com">Fabric</a><br>
我只想
<a href >...</a>
有什么好办法吗?
如果您-P
的 grep 中有一个选项可以接受 PCRE 模式,那么您应该能够使用更好的正则表达式。有时像这样的最小量词会有所*?
帮助。此外,您将获得整个输入行,而不仅仅是匹配本身;如果您有-o
grep 选项,它将仅列出匹配的部分。
egrep -Po '<a[^<>]*>'
如果您的 grep 没有这些选项,请尝试
perl -00 -nle 'print $1 while /(<a[^<>]*>)/gi'
现在跨越了界限。
要对 HTML 进行真正的解析,需要的正则表达式比您希望在命令行中输入的要复杂得多。 这是一个例子,这是另一个例子。这些可能无法说服您尝试非正则表达式方法,但它们至少应该向您展示在一般情况下比在特定情况下更难。
这个答案说明了为什么所有事情都是可能的,但并非所有事情都是权宜之计。
为什么你不能使用类似的选项--dump
?
lynx --dump --listonly http://www.imdb.com
下面是一些示例,说明为什么不应该使用正则表达式来解析 html。
要提取'href'
锚标记的属性值,请运行:
$ python -c'import sys, lxml.html as h
> root = h.parse(sys.argv[1]).getroot()
> root.make_links_absolute(base_url=sys.argv[1])
> print "\n".join(root.xpath("//a/@href"))' http://imdb.com | sort -u
如果需要,安装lxml
模块:$ sudo apt-get install python-lxml
.
http://askville.amazon.com http://idfilm.blogspot.com/2011/02/another-class.html http://imdb.com http://imdb.com/ http://imdb.com/a2z http://imdb.com/a2z/ http://imdb.com/advertising/ http://imdb.com/boards/ http://imdb.com/chart/ http://imdb.com/chart/top http://imdb.com/czone/ http://imdb.com/features/hdgallery http://imdb.com/features/oscars/2011/ http://imdb.com/features/sundance/2011/ http://imdb.com/features/video/ http://imdb.com/features/video/browse/ http://imdb.com/features/video/trailers/ http://imdb.com/features/video/tv/ http://imdb.com/features/yearinreview/2010/ http://imdb.com/genre http://imdb.com/help/ http://imdb.com/helpdesk/contact http://imdb.com/help/show_article?conditions http://imdb.com/help/show_article?rssavailable http://imdb.com/jobs http://imdb.com/lists http://imdb.com/media/index/rg2392693248 http://imdb.com/media/rm3467688448/rg2392693248 http://imdb.com/media/rm3484465664/rg2392693248 http://imdb.com/media/rm3719346688/rg2392693248 http://imdb.com/mymovies/list http://imdb.com/name/nm0000207/ http://imdb.com/name/nm0000234/ http://imdb.com/name/nm0000631/ http://imdb.com/name/nm0000982/ http://imdb.com/name/nm0001392/ http://imdb.com/name/nm0004716/ http://imdb.com/name/nm0531546/ http://imdb.com/name/nm0626362/ http://imdb.com/name/nm0742146/ http://imdb.com/name/nm0817980/ http://imdb.com/name/nm2059117/ http://imdb.com/news/ http://imdb.com/news/celebrity http://imdb.com/news/movie http://imdb.com/news/ni7650335/ http://imdb.com/news/ni7653135/ http://imdb.com/news/ni7654375/ http://imdb.com/news/ni7654598/ http://imdb.com/news/ni7654810/ http://imdb.com/news/ni7655320/ http://imdb.com/news/ni7656816/ http://imdb.com/news/ni7660987/ http://imdb.com/news/ni7662397/ http://imdb.com/news/ni7665028/ http://imdb.com/news/ni7668639/ http://imdb.com/news/ni7669396/ http://imdb.com/news/ni7676733/ http://imdb.com/news/ni7677253/ http://imdb.com/news/ni7677366/ http://imdb.com/news/ni7677639/ http://imdb.com/news/ni7677944/ http://imdb.com/news/ni7678014/ http://imdb.com/news/ni7678103/ http://imdb.com/news/ni7678225/ http://imdb.com/news/ns0000003/ http://imdb.com/news/ns0000018/ http://imdb.com/news/ns0000023/ http://imdb.com/news/ns0000031/ http://imdb.com/news/ns0000128/ http://imdb.com/news/ns0000136/ http://imdb.com/news/ns0000141/ http://imdb.com/news/ns0000195/ http://imdb.com/news/ns0000236/ http://imdb.com/news/ns0000344/ http://imdb.com/news/ns0000345/ http://imdb.com/news/ns0004913/ http://imdb.com/news/top http://imdb.com/news/tv http://imdb.com/nowplaying/ http://imdb.com/photo_galleries/new_photos/2010/ http://imdb.com/poll http://imdb.com/privacy http://imdb.com/register/login http://imdb.com/register/?why=footer http://imdb.com/register/?why=mymovies_footer http://imdb.com/register/?why=personalize http://imdb.com/rg/NAV_TWITTER/NAV_EXTRA/http://www.twitter.com/imdb http://imdb.com/ri/TRAILERS_HPPIRATESVID/TOP_BUCKET/102785/video/imdb/vi161323033/ http://imdb.com/search http://imdb.com/search/ http://imdb.com/search/name?birth_monthday=02-12 http://imdb.com/search/title?sort=num_votes,desc&title_type=feature&my_ratings=exclude http://imdb.com/sections/dvd/ http://imdb.com/sections/horror/ http://imdb.com/sections/indie/ http://imdb.com/sections/tv/ http://imdb.com/showtimes/ http://imdb.com/tiger_redirect?FT_LIC&licensing/ http://imdb.com/title/tt0078748/ http://imdb.com/title/tt0279600/ http://imdb.com/title/tt0377981/ http://imdb.com/title/tt0881320/ http://imdb.com/title/tt0990407/ http://imdb.com/title/tt1034389/ http://imdb.com/title/tt1265990/ http://imdb.com/title/tt1401152/ http://imdb.com/title/tt1411238/ http://imdb.com/title/tt1411238/trivia http://imdb.com/title/tt1446714/ http://imdb.com/title/tt1452628/ http://imdb.com/title/tt1464174/ http://imdb.com/title/tt1464540/ http://imdb.com/title/tt1477837/ http://imdb.com/title/tt1502404/ http://imdb.com/title/tt1504320/ http://imdb.com/title/tt1563069/ http://imdb.com/title/tt1564367/ http://imdb.com/title/tt1702443/ http://imdb.com/tvgrid/ http://m.imdb.com http://pro.imdb.com/r/IMDbTabNB/ http://resume.imdb.com http://resume.imdb.com/ https://secure.imdb.com/register/subscribe?c=a394d4442664f6f6475627 http://twitter.com/imdb http://wireless.amazon.com http://www.3news.co.nz/The-Hobbit-media-conference--full-video/tabid/312/articleID/198020/Default.aspx http://www.amazon.com/exec/obidos/redirect-home/internetmoviedat http://www.audible.com http://www.boxofficemojo.com http://www.dpreview.com http://www.endless.com http://www.fabric.com http://www.imdb.com/board/bd0000089/threads/ http://www.imdb.com/licensing/ http://www.imdb.com/media/rm1037220352/rg261921280 http://www.imdb.com/media/rm2695346688/tt1449283 http://www.imdb.com/media/rm3987585536/tt1092026 http://www.imdb.com/name/nm0000092/ http://www.imdb.com/photo_galleries/new_photos/2010/index http://www.imdb.com/search/title?sort=num_votes,desc&title_type=tv_series&my_ratings=exclude http://www.imdb.com/sections/indie/ http://www.imdb.com/title/tt0079470/ http://www.imdb.com/title/tt0079470/quotes?qt0471997 http://www.imdb.com/title/tt1542852/ http://www.imdb.com/title/tt1606392/ http://www.imdb.de http://www.imdb.es http://www.imdb.fr http://www.imdb.it http://www.imdb.pt http://www.movieline.com/2011/02/watch-jon-hamm-talk-butthole-surfers-paul-rudd-impersonate-jay-leno-at-book-reading-1.php http://www.movingimagesource.us/articles/un-tv-20110210 http://www.npr.org/blogs/monkeysee/2011/02/10/133629395/james-franco-recites-byron-to-the-worlds-luckiest-middle-school-journalist http://www.nytimes.com/2011/02/06/books/review/Brubach-t.html http://www.shopbop.com/welcome http://www.smallparts.com http://www.twinpeaks20.com/details/ http://www.twitter.com/imdb http://www.vanityfair.com/hollywood/features/2011/03/lauren-bacall-201103 http://www.warehousedeals.com http://www.withoutabox.com http://www.zappos.com
尝试grep -Eo
:
$ echo '<a class="amazon-affiliate-site-name" href="http://www.fabric.com">Fabric</a><br>' | grep -Eo '<a[^>]*>'
<a class="amazon-affiliate-site-name" href="http://www.fabric.com">
但请阅读 MAK 链接到的答案。
要提取锚标记的 'href' 属性值,您还可以在使用 HTML Tidy(2009 年 3 月 25 日发布的 Mac OS X 版本)将 HTML 转换为 XHTML 后使用 xmlstarlet:
curl -s www.imdb.com |
tidy -q -c -wrap 0 -numeric -asxml -utf8 --merge-divs yes --merge-spans yes 2>/dev/null |
xmlstarlet sel -N x="http://www.w3.org/1999/xhtml" -t -m "//x:a/@href" -v '.' -n |
grep '^[[:space:]]*http://' | sort -u | nl
在 Mac OS X 上,您还可以使用命令行工具 linkscraper:
linkscraper http://www.imdb.com