3

抱歉,如果这是重复的..我不清楚 SO 上已有的内容如何执行此特定任务..

我的目标是在一些 html 代码中找到压缩文件的文件名。文件名位于<a href=...>html 块内,因此很容易被人找到。

这是一些代码来重现我正在查看的内容:

# character vector with two strings from my html file
string.examples <-
    c("ANES Time Series Cumulative Data File</b><br /><a href=\"../cdf/cdf.htm\"> Study Page</a>&nbsp; | &nbsp;<a href=\"../cdf/cdf_errata.htm\">Errata</a>&nbsp; | &nbsp;<a href=\"../data/cdf/anes_cdf.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/cdf-ascii']);\">Download ascii data files  <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a>&nbsp; | &nbsp;<a href=\"../data/cdf/anes_cdfpor.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/cdf-por']);\">Download .por file  <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a>&nbsp; | &nbsp;<a href=\"../data/cdf/anes_cdfdta.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/cdf-dta']);\">Download .dta file <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a>&nbsp; | &nbsp;August 25, 2011 version </td></tr>", 
    "ANES 2012 Time Series Study</b><br /><a href=\"../anes_timeseries_2012/anes_timeseries_2012.htm\">Study Page</a>&nbsp; | &nbsp;<a href=\"../anes_timeseries_2012/anes_timeseries_2012_errata.htm\">Errata</a>&nbsp; |  &nbsp;<a href=\"../data/anes_timeseries_2012/anes2012TS.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/2012TS-ascii']);\">Download ascii data files <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a>&nbsp; | &nbsp;<a href=\"../data/anes_timeseries_2012/anes2012TS_sav.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/2012TS-sav']);\">Download .sav file</a> <a href=\"../data/anes_timeseries_2012/anes2012TS_sav.zip\"><img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a>&nbsp; | &nbsp;<a href=\"../data/anes_timeseries_2012/anes2012TS_dta.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/2012TS-dta']);\">Download .dta file <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a>&nbsp; | &nbsp;July 1, 2013 version<br />"
)

深埋在第一行,有文字<a href=\"../data/cdf/anes_cdfdta.zip\" ,在第二行,有文字<a href=\"../data/anes_timeseries_2012/anes2012TS_dta.zip\"

从这两行中,我想提取../data/cdf/anes_cdfdta.zip并且 ../data/anes_timeseries_2012/anes2012TS_dta.zip因为它们包含文本dta.zip并且因为它们以开头<a href=\"然后以结尾\"

我想要一些东西:

x <- some.regex.function( string.examples )

产生一个长度为 2 的字符向量。

> x
[1] "../data/cdf/anes_cdfdta.zip"                     "../data/anes_timeseries_2012/anes2012TS_dta.zip"
4

2 回答 2

3

在这里,我假设您要查找的模式a href=\"dta.zip. 所以想法是使用贪婪搜索来遍历所有a href直到dta.zip. 此外,我们捕获每个部分并将搜索到的字符串替换为所需的捕获。

gsub("(.*a href=\\\")(.*dta\\.zip)(.*)$", "\\2", string.examples)

.*a href=\\\"如前所述,“贪婪”搜索模式(必须转义 \ 和“)。然后通过执行,.*data\\.zip我们将贪婪搜索限制为不超出我们需要的点。这也是我们感兴趣的模式。所以,我们确保也捕获它。然后剩下的就很明显了。替换模式是第二个捕获。

于 2013-07-21T17:48:27.080 回答
2

描述

这个正则表达式将:

  • 找到值以结尾的锚标记href值dta.zip
  • 避免有问题的边缘情况

<a(?=\s|>)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\shref=\\(['"]?)((?:(?!\1(?:\s|\/>|>)).)*dta\.zip)\\)(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>.*?<\/a>

例子

示例文本

注意第一行有一些困难的边缘情况

<a onmouseup="" onmouseover=' href=\"../data/anes_timeseries_2012/DontFindMe_dta.zip\" ; if (6 > x) { funRotate(href); } ' href=\"../data/anes_timeseries_2012/DifficultToFind_dta.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/2012TS-dta']);\">

"ANES Time Series Cumulative Data File</b><br /><a href=\"../cdf/cdf.htm\"> Study Page</a>&nbsp; | &nbsp;<a href=\"../cdf/cdf_errata.htm\">Errata</a>&nbsp; | &nbsp;<a href=\"../data/cdf/anes_cdf.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/cdf-ascii']);\">Download ascii data files  <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a>&nbsp; | &nbsp;<a href=\"../data/cdf/anes_cdfpor.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/cdf-por']);\">Download .por file  <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a>&nbsp; | &nbsp;<a href=\"../data/cdf/anes_cdfdta.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/cdf-dta']);\">Download .dta file <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a>&nbsp; | &nbsp;August 25, 2011 version </td></tr>", 
    "ANES 2012 Time Series Study</b><br /><a href=\"../anes_timeseries_2012/anes_timeseries_2012.htm\">Study Page</a>&nbsp; | &nbsp;<a href=\"../anes_timeseries_2012/anes_timeseries_2012_errata.htm\">Errata</a>&nbsp; |  &nbsp;<a href=\"../data/anes_timeseries_2012/anes2012TS.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/2012TS-ascii']);\">Download ascii data files <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a>&nbsp; | &nbsp;<a href=\"../data/anes_timeseries_2012/anes2012TS_sav.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/2012TS-sav']);\">Download .sav file</a> <a href=\"../data/anes_timeseries_2012/anes2012TS_sav.zip\"><img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a>&nbsp; | &nbsp;<a href=\"../data/anes_timeseries_2012/anes2012TS_dta.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/2012TS-dta']);\">Download .dta file <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a>&nbsp; | &nbsp;July 1, 2013 version<br />ac

火柴

[0][0] = <a onmouseup="" onmouseover=' href=\"../data/anes_timeseries_2012/DontFindMe_dta.zip\" ; if (6 > x) { funRotate(href); } ' href=\"../data/anes_timeseries_2012/DifficultToFind_dta.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/2012TS-dta']);\">

"ANES Time Series Cumulative Data File</b><br /><a href=\"../cdf/cdf.htm\"> Study Page</a>
[0][1] = "
[0][2] = ../data/anes_timeseries_2012/DifficultToFind_dta.zip


[1][0] = <a href=\"../data/cdf/anes_cdfdta.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/cdf-dta']);\">Download .dta file <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a>
[1][1] = "
[1][2] = ../data/cdf/anes_cdfdta.zip


[2][0] = <a href=\"../data/anes_timeseries_2012/anes2012TS_dta.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/2012TS-dta']);\">Download .dta file <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a>
[2][1] = "
[2][2] = ../data/anes_timeseries_2012/anes2012TS_dta.zip
于 2013-07-21T17:59:08.573 回答