mediawiki - 维基百科上给定页面的完整图片网址（仅限我在页面上看到的图片）

Question

我想提取维基百科上“谷歌”页面的所有完整图片网址

我尝试过：

http://en.wikipedia.org/w/api.php?action=query&titles=Google&generator=images&gimlimit=10&prop=imageinfo&iiprop=url|dimensions|mime&format=json

但是，通过这种方式，我也得到了与谷歌无关的图片，例如：

http://upload.wikimedia.org/wikipedia/en/a/a4/Flag_of_the_United_States.svg
http://upload.wikimedia.org/wikipedia/en/4/4a/Commons-logo.svg
http://upload.wikimedia.org/wikipedia/en/4/4a/Commons-logo.svg
http://upload.wikimedia.org/wikipedia/commons/f/fe/Crystal_Clear_app_browser.png

如何仅提取我在Google 页面上看到的图像

score 5 · Accepted Answer

检索页面源代码，https://en.wikipedia.org/w/index.php?title=Google&action=raw
扫描它以查找子字符串，例如[[File:Google web search.png|thumb|left|On February 14, 2012, Google updated its homepage with a minor twist. There are no red lines above the options in the black bar, and there is a tab space before the "+You". The sign-in button has also changed, it is no longer in the black bar, instead under it as a button.]]
请求 API 获取页面上的所有图片，http ://en.wikipedia.org/w/api.php?action=query&titles=Google&generator=images&gimlimit=10&prop=imageinfo&iiprop=url|dimensions|mime&format=json
过滤掉与步骤 2 中找到的图片名称匹配的网址。

步骤 2 和 4 需要更多解释。

@2。正则表达式/\b(File|Image):[^]|\n\r]+/应该足够了。在 Ruby 的正则表达式中，\b表示您选择的语言可能不支持的单词边界。我提出的正则表达式将匹配我想到的所有案例：[[File:something.jpg]]、画廊标签：<gallery>\nFile:one.jpg\nFile:two.jpg\n</gallery>、模板：{{Infobox|pic = File:something.jpg}}。但是，它不会匹配包含]. 我不确定它们是否合法，但如果是，它们一定非常罕见，应该没什么大不了的。

如果你只想匹配这样的结构: [[File:something.jpg|thumb|description]]，下面的正则表达式会更好：/\[\[(File|Image):[^]|]+/

@4。我会从匹配的名称中删除所有字符/[^A-Za-z0-9]/。这比逃避它们更容易，而且在大多数情况下，就足够了。

图标最常附加在模板中，而与文章主题相关的图片通常直接附加（[[File:…]]）。但也有例外，例如在某些文章中，图片附有 {{Gallery}} 模板。还有一个<gallery>标签为画廊引入了特殊的语法。您必须根据您的需求调整我的解决方案，即使那样它也不会是完美的，但它应该足够好。

mediawiki - 维基百科上给定页面的完整图片网址（仅限我在页面上看到的图片）

1 回答 1

Related

Reference