2

我试图获取历史上这一天发生的四五件事,并将其明文表示添加到 PHP 中的数组中。

到目前为止,我正在使用以下代码:

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://en.wikipedia.org/w/api.php?action=featuredfeed&feed=onthisday&feedformat=rss');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_TIMEOUT, '3');
curl_setopt($ch, CURLOPT_USERAGENT, 'My random user agent'); // Needed for Wikipedia to prevent IP blocking
$contents = trim(curl_exec($ch));
curl_close($ch);

$xml = simplexml_load_string($contents);
$json = json_encode($xml);
$array = json_decode($json, true);


$noOfDays = count($array['channel']['item']);
$r = $noOfDays - 1;
$input = $array['channel']['item'][$r]['description'];

我知道这不是非常动态和有效的,但是一个人每天会调用一次这个页面,所以它不是非常重要。

此时,$input包含一个 HTML 块,看起来像这样:

<p><b><a href="/wiki/April_6" title="April 6">April 6</a></b>: <b><a href="/wiki/Good_Friday" title="Good Friday">Good Friday</a></b> (Western Christianity, 2012); <b><a href="/wiki/Fast_of_the_Firstborn" title="Fast of the Firstborn">Fast of the Firstborn</a></b> begins at dawn and <b><a href="/wiki/Passover" title="Passover">Passover</a></b> begins at sunset (Judaism, 2012)
</p>
<div style="float:right;margin-left:0.5em">
<p><a href="/wiki/File:Sir_Arthur_Wellesley,_1st_Duke_of_Wellington.png" class="image" title="Arthur Wellesley, the Earl of Wellington"><img alt="Arthur Wellesley, the Earl of Wellington" src="//upload.wikimedia.org/wikipedia/commons/thumb/8/83/Sir_Arthur_Wellesley%2C_1st_Duke_of_Wellington.png/78px-Sir_Arthur_Wellesley%2C_1st_Duke_of_Wellington.png" width="78" height="100" /></a>
</p>
</div>
<li style="-moz-float-edge: content-box">
<a href="/wiki/1250" title="1250">1250</a> – <a href="/wiki/Seventh_Crusade" title="Seventh Crusade">Seventh Crusade</a>: Egyptian <a href="/wiki/Ayyubid" title="Ayyubid" class="mw-redirect">Ayyubids</a> <b><a href="/wiki/Battle_of_Fariskur" title="Battle of Fariskur">annihilated the crusader army</a></b> and captured King <a href="/wiki/Louis_IX_of_France" title="Louis IX of France">Louis&#160;IX of France</a> as a hostage.
<li style="-moz-float-edge: content-box">
<a href="/wiki/1320" title="1320">1320</a> – The <b><a href="/wiki/Declaration_of_Arbroath" title="Declaration of Arbroath">Declaration of Arbroath</a></b>, a declaration of <a href="/wiki/Scottish_independence" title="Scottish independence">Scottish independence</a>, was adopted.
<li style="-moz-float-edge: content-box">
<a href="/wiki/1812" title="1812">1812</a> – <a href="/wiki/Peninsular_War" title="Peninsular War">Peninsular War</a>: After a <b><a href="/wiki/Siege_of_Badajoz_(1812)" title="Siege of Badajoz (1812)">three-week siege</a></b>, the <a href="/wiki/Anglo-Portuguese_Army" title="Anglo-Portuguese Army">Anglo-Portuguese Army</a>, under the <a href="/wiki/Arthur_Wellesley,_1st_Duke_of_Wellington" title="Arthur Wellesley, 1st Duke of Wellington">Earl of Wellington</a> <i>(pictured)</i>, captured <a href="/wiki/Badajoz" title="Badajoz">Badajoz</a>, Spain and forced the surrender of the French garrison.
<li style="-moz-float-edge: content-box">
<a href="/wiki/1947" title="1947">1947</a> – The <a href="/wiki/1st_Tony_Awards" title="1st Tony Awards">first</a> <b><a href="/wiki/Tony_Award" title="Tony Award">Tony Awards</a></b>, recognizing achievement in live American <a href="/wiki/Theatre" title="Theatre">theatre</a>, were handed out at the <a href="/wiki/Waldorf-Astoria_Hotel" title="Waldorf-Astoria Hotel">Waldorf-Astoria Hotel</a> in <a href="/wiki/New_York_City" title="New York City">New York City</a>.
<li style="-moz-float-edge: content-box">
<a href="/wiki/2008" title="2008">2008</a> – Egyptian workers staged <b><a href="/wiki/2008_Egyptian_general_strike" title="2008 Egyptian general strike">an illegal general strike</a></b>, two days before <a href="/wiki/Egyptian_municipal_elections,_2008" title="Egyptian municipal elections, 2008">key municipal elections</a>.
</li>
</ul>
<p>More anniversaries: <span class="nowrap"><a href="/wiki/April_5" title="April 5">April 5</a> &#8211;</span> <span class="nowrap"><b><a href="/wiki/April_6" title="April 6">April 6</a></b> &#8211;</span> <span class="nowrap"><a href="/wiki/April_7" title="April 7">April 7</a></span>
</p>
<div style="text-align: right;" class="noprint"><span class="nowrap"><b><a href="/wiki/Wikipedia:Selected_anniversaries/April" title="Wikipedia:Selected anniversaries/April">Archive</a></b> &#8211;</span> <span class="nowrap"><b><a href="https://lists.wikimedia.org/mailman/listinfo/daily-article-l" class="extiw" title="mail:daily-article-l">By email</a></b> &#8211;</span> <span class="nowrap"><b><a href="/wiki/List_of_historical_anniversaries" title="List of historical anniversaries">List of historical anniversaries</a></b></span></div>
<div style="text-align: right;"><small>It is now <span class="nowrap">April 6, 2012</span> (<a href="/wiki/Coordinated_Universal_Time" title="Coordinated Universal Time">UTC</a>) &#8211; <span class="plainlinks" id="purgelink"><span class="nowrap"><a class="external text" href="//en.wikipedia.org/w/index.php?title=MediaWiki:Ffeed-onthisday-transcludeme&amp;action=purge">Refresh this page</a></span></span></small></div>

我唯一感兴趣的是每个之间的位<li style="-moz-float-edge: content-box">

我不知道他们为什么没有<li>正确关闭这些标签,但是你去吧。

所以我想要的本质是获取实际信息,剥离链接并将每个链接添加到一个数组中,它应该看起来像这样:

Array (
    [0] => 1250 – Seventh Crusade: Egyptian Ayyubids annihilated the crusader army and captured King Louis&#160;IX of France as a hostage.
    [1] => Next one...
    [2] => And another...
)

&#160;在这一行的末尾还有一个小问题。我如何将其翻译成纯文本?我有一种感觉 HTML 解析可能是答案。

我已经尝试过正则表达式和 HTML 解析,但由于标签没有关闭,我在执行此操作时遇到了一些困难。

有什么建议么?

4

1 回答 1

1

正如@zzzzBov 指出的那样,结束标记在 HTML(但不是 XHTML)中是可选的。不幸的是,这是使其与 XML(和 XML 解析器)不兼容的几个事实之一。对于您的任务,我建议使用phpQueryPHP Simple HTML DOM Parser之类的库来解析 DOM 。

在 phpQuery 中,您的代码将如下所示:

$doc   = phpQuery::newDocumentHTML( $input );
$items = $doc->find('li');

foreach($items as $item) {
  echo pq($item)->text();
}

// Or... (PHP 5.3+)

$items = array_map( function( $item ) {
  return pq( $item )->text();
}, $doc->find('li') );

至于&#160;,试试html_entity_decode()

于 2012-04-06T16:08:35.483 回答