php - 通过API访问维基百科页面的主图

Question

有什么方法可以使用 API 访问任何维基百科页面的缩略图？我的意思是盒子右上角的图像。有没有这方面的 API？

score 66 · Accepted Answer

您可以使用prop=pageimages. 例如：

http://en.wikipedia.org/w/api.php?action=query&titles=Al-Farabi&prop=pageimages&format=json&pithumbsize=100

您将获得缩略图完整 URL。

score 60 · Accepted Answer

http://en.wikipedia.org/w/api.php

看prop=images。

它返回在已解析页面中使用的图像文件名数组。然后，您可以选择进行另一个 API 调用以找出完整的图像 URL，例如： action=query&titles=Image:INSERT_EXAMPLE_FILE_NAME_HERE.jpg&prop=imageinfo&iiprop=url

或通过文件名的 hash 计算 URL。

不幸的是，虽然返回的图像数组prop=images是按照在页面上找到的顺序排列的，但不能保证第一个是信息框中的图像，因为有时页面会在信息框之前包含一个图像（大部分时间有关页面的元数据图标：例如“本文已锁定”）。

在图像数组中搜索包含页面标题的第一张图像可能是对信息框图像的最佳猜测。

score 27 · Accepted Answer

这是在维基百科中获取页面主图像的好方法

http://en.wikipedia.org/w/api.php?action=query&prop=pageimages&format=json&piprop=original&titles=印度

score 13 · Accepted Answer

查看 MediaWiki API 示例以获取维基百科页面的主要图片：https ://www.mediawiki.org/wiki/API:Page_info_in_search_results 。

正如其他人所提到的，您将prop=pageimages在 API 查询中使用。

如果您还需要图像描述，则可以prop=pageimages|pageterms在 API 查询中使用。

您可以使用piprop=original. 或者您可以获得具有指定宽度/高度的缩略图。对于宽度/高度=600 的缩略图，piprop=thumbnail&pithumbsize=600. 如果省略其中一个，API 回调中返回的图像将默认为宽度/高度为 50 像素的缩略图。

如果您以 JSON 格式请求结果，则应始终formatversion=2在 API 查询中使用（即format=json&formatversion=2），因为它可以更轻松地从查询中检索图像。

原始尺寸图像：

https://en.wikipedia.org/w/api.php?action=query&format=json&formatversion=2&prop=pageimages|pageterms&piprop=original&titles=Albert Einstein

缩略图大小（600 像素宽度/高度）图片：

https://en.wikipedia.org/w/api.php?action=query&format=json&formatversion=2&prop=pageimages|pageterms&piprop=thumbnail&pithumbsize=600&titles=Albert Einstein

score 6 · Accepted Answer

很抱歉没有具体回答您关于主图像的问题。但这里有一些代码可以获取所有图像的列表：

function makeCall($url) {
    $curl = curl_init();
    curl_setopt($curl, CURLOPT_URL, $url);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
    return curl_exec($curl);
}

function wikipediaImageUrls($url) {
    $imageUrls = array();
    $pathComponents = explode('/', parse_url($url, PHP_URL_PATH));
    $pageTitle = array_pop($pathComponents);
    $imagesQuery = "http://en.wikipedia.org/w/api.php?action=query&titles={$pageTitle}&prop=images&format=json";
    $jsonResponse = makeCall($imagesQuery);
    $response = json_decode($jsonResponse, true);
    $imagesKey = key($response['query']['pages']);
    foreach($response['query']['pages'][$imagesKey]['images'] as $imageArray) {
        if($imageArray['title'] != 'File:Commons-logo.svg' && $imageArray['title'] != 'File:P vip.svg') {
            $title = str_replace('File:', '', $imageArray['title']);
            $title = str_replace(' ', '_', $title);
            $imageUrlQuery = "http://en.wikipedia.org/w/api.php?action=query&titles=Image:{$title}&prop=imageinfo&iiprop=url&format=json";
            $jsonUrlQuery = makeCall($imageUrlQuery);
            $urlResponse = json_decode($jsonUrlQuery, true);
            $imageKey = key($urlResponse['query']['pages']);
            $imageUrls[] = $urlResponse['query']['pages'][$imageKey]['imageinfo'][0]['url'];
        }
    }
    return $imageUrls;
}
print_r(wikipediaImageUrls('http://en.wikipedia.org/wiki/Saturn_%28mythology%29'));
print_r(wikipediaImageUrls('http://en.wikipedia.org/wiki/Hans-Ulrich_Rudel'));

我得到了这个http://en.wikipedia.org/wiki/Saturn_%28mythology%29：

Array
(
    [0] => http://upload.wikimedia.org/wikipedia/commons/1/10/Arch_of_SeptimiusSeverus.jpg
    [1] => http://upload.wikimedia.org/wikipedia/commons/8/81/Ivan_Akimov_Saturn_.jpg
    [2] => http://upload.wikimedia.org/wikipedia/commons/d/d7/Lucius_Appuleius_Saturninus.jpg
    [3] => http://upload.wikimedia.org/wikipedia/commons/2/2c/Polidoro_da_Caravaggio_-_Saturnus-thumb.jpg
    [4] => http://upload.wikimedia.org/wikipedia/commons/b/bd/Porta_Maggiore_Alatri.jpg
    [5] => http://upload.wikimedia.org/wikipedia/commons/6/6a/She-wolf_suckles_Romulus_and_Remus.jpg
    [6] => http://upload.wikimedia.org/wikipedia/commons/4/45/Throne_of_Saturn_Louvre_Ma1662.jpg
)

对于第二个 URL ( http://en.wikipedia.org/wiki/Hans-Ulrich_Rudel )：

Array
(
    [0] => http://upload.wikimedia.org/wikipedia/commons/e/e9/BmRKEL.jpg
    [1] => http://upload.wikimedia.org/wikipedia/commons/3/3f/BmRKELS.jpg
    [2] => http://upload.wikimedia.org/wikipedia/commons/2/2c/Bundesarchiv_Bild_101I-655-5976-04%2C_Russland%2C_Sturzkampfbomber_Junkers_Ju_87_G.jpg
    [3] => http://upload.wikimedia.org/wikipedia/commons/6/62/Bundeswehr_Kreuz_Black.svg
    [4] => http://upload.wikimedia.org/wikipedia/commons/9/99/Flag_of_German_Reich_%281935%E2%80%931945%29.svg
    [5] => http://upload.wikimedia.org/wikipedia/en/6/64/HansUlrichRudel.jpeg
    [6] => http://upload.wikimedia.org/wikipedia/commons/8/82/Heinkel_He_111_during_the_Battle_of_Britain.jpg
    [7] => http://upload.wikimedia.org/wikipedia/commons/6/66/Regulation_WW_II_Underwing_Balkenkreuz.png
)

请注意，URL 在第二个数组的第 6 个元素上发生了一些变化。这是@JosephJaber 在上面的评论中警告的内容。

希望这可以帮助某人。

score 6 · Accepted Answer

方式1：您可以尝试一些这样的查询：

http://en.wikipedia.org/w/api.php?action=opensearch&limit=5&format=xml&search=italy&namespace=0

在响应中，您可以看到Image标签。

<Item>
<Text xml:space="preserve">Italy national rugby union team</Text>
<Description xml:space="preserve">
The Italy national rugby union team represent the nation of Italy in the sport of rugby union.
</Description>
<Url xml:space="preserve">
http://en.wikipedia.org/wiki/Italy_national_rugby_union_team
</Url>
<Image source="http://upload.wikimedia.org/wikipedia/en/thumb/4/46/Italy_rugby.png/43px-Italy_rugby.png" width="43" height="50"/>
</Item>

方式2：使用查询http://en.wikipedia.org/w/index.php?action=render&title=italy

然后你可以得到一个原始的 html 代码，你可以得到图像使用类似PHP Simple HTML DOM Parser http://simplehtmldom.sourceforge.net

我没有时间写给你。只是给你一些建议，谢谢。

score 6 · Accepted Answer

我已经编写了一些通过维基百科文章标题获取主图像（完整 URL）的代码。这并不完美，但总的来说我对结果非常满意。

挑战在于，当查询特定标题时，维基百科会返回多个图像文件名（没有路径）。此外，辅助搜索（我使用了此线程中发布的代码 varatis - 谢谢！）返回基于搜索的图像文件名找到的所有图像的 URL，无论原始文章标题如何。毕竟，我们最终可能会得到与搜索无关的通用图像，因此我们将其过滤掉。代码迭代文件名和 URL，直到找到（希望是最好的）匹配......有点复杂，但它有效:)

关于通用过滤器的注意事项：我一直在为 isGeneric() 函数编译通用图像字符串列表，但该列表一直在增长。我正在考虑将其保留为公开列表 - 如果有任何兴趣，请告诉我。

protected static $baseurl = "http://en.wikipedia.org/w/api.php";

主要功能 - 从标题中获取图像 URL：

public static function getImageURL($title)
{
    $images = self::getImageFilenameObj($title); // returns JSON object
    if (!$images) return '';

    foreach ($images as $image)
    {
        // get object of image URL for given filename
        $imgjson = self::getFileURLObj($image->title);

        // return first image match
        foreach ($imgjson as $img)
        {
            // get URL for image
            $url = $img->imageinfo[0]->url;

            // no image found               
            if (!$url) continue;

            // filter generic images
            if (self::isGeneric($url)) continue;

            // match found
            return $url;
        }
    }
    // match not found
    return '';          
}

== 上面的 main 函数调用了下面的函数 ==

按标题获取 JSON 对象（文件名）：

public static function getImageFilenameObj($title)
{
    try     // see if page has images
    {
        // get image file name
        $json = json_decode(
            self::retrieveInfo(
                self::$baseurl . '?action=query&titles=' .
                urlencode($title) . '&prop=images&format=json'
            ))->query->pages;

        /** The foreach is only to get around
         *  the fact that we don't have the id.
         */
        foreach ($json as $id) { return $id->images; }
    }
    catch(exception $e) // no images
    {
        return NULL;
    }
}

按文件名获取 JSON 对象（URL）：

public static function getFileURLObj($filename)
{
    try                     // resolve URL from filename
    {
        return json_decode(
            self::retrieveInfo(
                self::$baseurl . '?action=query&titles=' .
                urlencode($filename) . '&prop=imageinfo&iiprop=url&format=json'
            ))->query->pages;
    }
    catch(exception $e)     // no URLs
    {
        return NULL;
    }
}

过滤掉通用图像：

public static function isGeneric($url)
{
    $generic_strings = array(
        '_gray.svg',
        'icon',
        'Commons-logo.svg',
        'Ambox',
        'Text_document_with_red_question_mark.svg',
        'Question_book-new.svg',
        'Canadese_kano',
        'Wiki_letter_',
        'Edit-clear.svg',
        'WPanthroponymy',
        'Compass_rose_pale',
        'Us-actor.svg',
        'voting_box',
        'Crystal_',
        'transportation_inv',
        'arrow.svg',
        'Quill_and_ink-US.svg',
        'Decrease2.svg',
        'Rating-',
        'template',
        'Nuvola_apps_',
        'Mergefrom.svg',
        'Portal-',
        'Translation_to_',
        '/School.svg',
        'arrow',
        'Symbol_',
        'stub',
        'Unbalanced_scales.svg',
        '-logo.',
        'P_vip.svg',
        'Books-aj.svg_aj_ashton_01.svg',
        'Film',
        '/Gnome-',
        'cap.svg',
        'Missing',
        'silhouette',
        'Star_empty.svg',
        'Music_film_clapperboard.svg',
        'IPA_Unicode',
        'symbol',
        '_highlighting_',
        'pictogram',
        'Red_pog.svg',
        '_medal_with_cup',
        '_balloon',
        'Feature',
        'Aiga_'
    );

    foreach ($generic_strings as $str)
    {
        if (stripos($url, $str) !== false) return true;
    }

    return false;
}

欢迎评论。

score 3 · Accepted Answer

让我们以页面http://en.wikipedia.org/wiki/index.html?curid=57570为例来获取主图片

查看

道具=页面道具

动作=查询&pageids=57570&prop=pageprops&format=json

结果页面数据

{ "pages" : { "57570":{
                    "pageid":57570,
                    "ns":0,
                    "title":"Sachin Tendulkar",
                    "pageprops" : {
                         "defaultsort":"Tendulkar,Sachin",
                         "page_image":"Sachin_at_Castrol_Golden_Spanner_Awards_(crop).jpg",
                         "wikibase_item":"Q9488"
                    }
            }
          }
 }}

我们得到这个结果的主图片文件名

** (wikiId).pageprops.page_image = Sachin_at_Castrol_Golden_Spanner_Awards_(crop).jpg**

现在我们有了图像文件名，我们将不得不进行另一个 Api 调用以从文件名中获取完整的图像路径，如下所示

action=query&titles=图片:INSERT_EXAMPLE_FILE_NAME_HERE.jpg&prop=imageinfo&iiprop=url

例如。

action=query&titles=图片:Sachin_at_Castrol_Golden_Spanner_Awards_(crop).jpg&prop=imageinfo&iiprop=url

返回图像数据数组，其中包含 http://upload.wikimedia.org/wikipedia/commons/3/35/Sachin_at_Castrol_Golden_Spanner_Awards_%28crop%29.jpg

score 3 · Accepted Answer

我有一种方法可以可靠地获取维基百科页面的主图像 - 名为 PageImages 的扩展

PageImages 扩展收集有关页面上使用的图像的信息。

其目的是返回与文章关联的最合适的单个缩略图，试图只返回有意义的图像，例如，不返回来自维护模板、存根或标志图标的图像。目前它使用页面中使用的第一个无意义的图像。

https://www.mediawiki.org/wiki/Extension:PageImages

只需将道具 pageimages 添加到您的 API 查询中：

/w/api.php?action=query&prop=pageimages&titles=Somepage&format=xml

这可以可靠地过滤掉烦人的默认图像，并防止您自己过滤它们！该扩展安装在所有主要的维基百科页面上......

score 3 · Accepted Answer

就像 Anuraj 提到的，pageimages 参数就是它。看看下面的 url，它会带来一些漂亮的东西：

https://en.wikipedia.org/w/api.php?action=query&prop=info|extracts|pageimages|images&inprop=url&exsentences=1&titles=india

她是一些有趣的参数：

extracts和exsentences这两个参数为您提供了可以使用的简短描述。（exsentences 是您要在摘录中包含的句子数）
info 和inprop=url参数为您提供页面的 url
prop 属性有多个参数，由条形符号分隔
如果您在其中插入format=json，那就更好了

score 1 · Accepted Answer

请参阅有关 Wikipedia 的 API 的相关问题。但是，我不知道是否可以通过 API 检索缩略图。

您还可以考虑仅解析网页以查找图像 URL，并以这种方式检索图像。

score 1 · Accepted Answer

你也可以使用名为SDWebImage 的cocoa Pod

代码示例（记得还要添加import SDWebImage）：

func requestInfo(flowerName: String) {

        let parameters : [String:String] = [
            "format" : "json",
            "action" : "query",
            "prop" : "extracts|pageimages",//pageimages allows fetch imagePath
            "exintro" : "",
            "explaintext" : "",
            "titles" : flowerName,
            "indexpageids" : "",
            "redirects" : "1",
            "pithumbsize" : "500"//specify image size in px
        ]


        AF.request(wikipediaURL, method: .get, parameters: parameters).responseJSON { (response) in
            switch response.result {
            case .success(let value):
                print("Got the wikipedia info.")
                print(response)

                let flowerJSON : JSON = JSON(response.value!)
                let pageid = flowerJSON["query"]["pageids"][0].stringValue

                let flowerDescription = flowerJSON["query"]["pages"][pageid]["extract"].stringValue

                let flowerImageURL = flowerJSON["query"]["pages"][pageid]["thumbnail"]["source"].stringValue //fetching Image URL

                self.wikiInfoLabel.text = flowerDescription
                self.imageView.sd_setImage(with: URL(string : flowerImageURL))//imageView updated with Wiki Image

            case .failure(let error):
                print(error)
            }
        }
    }

score 1 · Accepted Answer

这是我为 95% 的文章找到的 XPath 列表。主要的是 1、2、3 和 4。很多文章的格式不正确，这些都是边缘情况：

您可以使用 DOM 解析库来使用 XPath 获取图像。

static NSString   *kWikipediaImageXPath2    =   @"//*[@id=\"mw-content-text\"]/div[1]/div/table/tr[2]/td/a/img";
static NSString   *kWikipediaImageXPath3    =   @"//*[@id=\"mw-content-text\"]/div[1]/table/tr[1]/td/a/img";
static NSString   *kWikipediaImageXPath1    =   @"//*[@id=\"mw-content-text\"]/div[1]/table/tr[2]/td/a/img";
static NSString   *kWikipediaImageXPath4    =   @"//*[@id=\"mw-content-text\"]/div[2]/table/tr[2]/td/a/img";
static NSString   *kWikipediaImageXPath5    =   @"//*[@id=\"mw-content-text\"]/div[1]/table/tr[2]/td/p/a/img";
static NSString   *kWikipediaImageXPath6    =   @"//*[@id=\"mw-content-text\"]/div[1]/table/tr[2]/td/div/div/a/img";
static NSString   *kWikipediaImageXPath7    =   @"//*[@id=\"mw-content-text\"]/div[1]/table/tr[1]/td/div/div/a/img";

我在 libxml2.2 周围使用了一个名为 Hpple 的 ObjC 包装器来提取图像 url。希望这可以帮助

score 0 · Accepted Answer

0

我认为不是，但您可以使用链接解析器 HTML 文档捕获图像

于 2011-12-02T22:38:30.710 回答

php - 通过API访问维基百科页面的主图

14 回答 14

道具=页面道具

Related

Reference