抱歉,如果这是重复的..我不清楚 SO 上已有的内容如何执行此特定任务..
我的目标是在一些 html 代码中找到压缩文件的文件名。文件名位于<a href=...>
html 块内,因此很容易被人找到。
这是一些代码来重现我正在查看的内容:
# character vector with two strings from my html file
string.examples <-
c("ANES Time Series Cumulative Data File</b><br /><a href=\"../cdf/cdf.htm\"> Study Page</a> | <a href=\"../cdf/cdf_errata.htm\">Errata</a> | <a href=\"../data/cdf/anes_cdf.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/cdf-ascii']);\">Download ascii data files <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a> | <a href=\"../data/cdf/anes_cdfpor.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/cdf-por']);\">Download .por file <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a> | <a href=\"../data/cdf/anes_cdfdta.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/cdf-dta']);\">Download .dta file <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a> | August 25, 2011 version </td></tr>",
"ANES 2012 Time Series Study</b><br /><a href=\"../anes_timeseries_2012/anes_timeseries_2012.htm\">Study Page</a> | <a href=\"../anes_timeseries_2012/anes_timeseries_2012_errata.htm\">Errata</a> | <a href=\"../data/anes_timeseries_2012/anes2012TS.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/2012TS-ascii']);\">Download ascii data files <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a> | <a href=\"../data/anes_timeseries_2012/anes2012TS_sav.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/2012TS-sav']);\">Download .sav file</a> <a href=\"../data/anes_timeseries_2012/anes2012TS_sav.zip\"><img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a> | <a href=\"../data/anes_timeseries_2012/anes2012TS_dta.zip\" onClick=\"javascript: _gaq.push(['_trackPageview','/downloads/2012TS-dta']);\">Download .dta file <img src=\"../../images/zip.jpg\" border=\"0\" width=\"23\" height=\"13\" /></a> | July 1, 2013 version<br />"
)
深埋在第一行,有文字<a href=\"../data/cdf/anes_cdfdta.zip\"
,在第二行,有文字<a href=\"../data/anes_timeseries_2012/anes2012TS_dta.zip\"
从这两行中,我想提取../data/cdf/anes_cdfdta.zip
并且 ../data/anes_timeseries_2012/anes2012TS_dta.zip
因为它们包含文本dta.zip
并且因为它们以开头<a href=\"
然后以结尾\"
我想要一些东西:
x <- some.regex.function( string.examples )
产生一个长度为 2 的字符向量。
> x
[1] "../data/cdf/anes_cdfdta.zip" "../data/anes_timeseries_2012/anes2012TS_dta.zip"