2

I'm using wget (from perl) to get web pages from a site. I'm really only interested in the html,htm,php,asp,aspx file types. However, at least one site has supplied links using file names with no extensions/suffix. I need those too.

My:

wget -A html,htm,php,asp,aspx

works great, except for the no suffix links.

I've tried a number of regex strings to try and get the no suffix pages, but to no avail. wget returns just the main page. So far, the only way to get these files is to open it up to all files (which isn't terrible for this website, but would be terrible for others).

Is there either a regex or regular way to specify I want links from wget with no suffixes?

4

1 回答 1

1

wget 1.14 版似乎支持--accept-regex与完整 URL 匹配的参数,即理论上应该可以使用以下内容(未经测试):

wget --accept-regex '/[^.]+(?:\.(?:html?|php|aspx?))?$'

或者也许拒绝那些你不想要的扩展会更容易?

于 2013-09-23T07:24:00.767 回答