php - 帮助 PHP 中的正则表达式（解析维基百科标记）

Question

我有这段文字，我想从我从维基百科获取的页面中删除。

{{Historical populations|type=USA
| 1698|4937
| 1712|5840
| 1723|7248
| 1737|10664
| 1746|11717
| 1756|13046
| 1771|21863
| 1790|33131
| 1800|60515
| 1810|96373
| 1820|123706
| 1830|202589
| 1840|312710
| 1850|515547
| 1860|813669
| 1870|942292
| 1880|1206299
| 1890|1515301
| 1900|3437202
| 1910|4766883
| 1920|5620048
| 1930|6930446
| 1940|7454995
| 1950|7891957
| 1960|7781984
| 1970|7894862
| 1980|7071639
| 1990|7322564
| 2000|8008288
| 2008*|8363710
|footnote=Beginning 1900, figures are for consolidated city of five boroughs. Sources: 1698–1771,{{cite book|last=Greene and Harrington|first=|title=American Population Before the Federal Census of 1790|publisher=|location=New York|year=1932|isbn=|pages=}}, as cited in: {{cite book|last=Rosenwaike|first=Ira|title=Population History of New York City|publisher=Syracuse University Press|location=Syracuse, N.Y.|year=1972|isbn=0815621558|page=8}} 1790–1990,Gibson, Campbell.[http://www.census.gov/population/www/documentation/twps0027.html Population of the 100 Largest Cities and Other Urban Places in the United States:1790 to 1990], [[United States Census Bureau]], June 1998. Retrieved June 12, 2007. *2008 est[http://factfinder.census.gov/servlet/SAFFPopulation?_event=Search&geo_id=16000US3403940&_geoContext=01000US%7C04000US34%7C16000US3403940&_street=&_county=new+york+city&_cityTown=new+york+city&_state=04000US36&_zip=&_lang=en&_sse=on&ActiveGeoDiv=geoSelect&_useEV=&pctxt=fph&pgsl=160&_submenuId=population_0&ds_name=null&_ci_nbr=null&qr_name=null&reg=null%3Anull&_keyword=&_industry=Census Data for New York city, New York], [[United States Census Bureau]]. Retrieved June 12, 2007.
}}

以下部分我也希望保留为纯文本（但不包括用“{{”和“}}”包裹的部分

New York is the most populous city in the United States, with an estimated 2008 population of 8,363,710(up from 7.3 million in 1990). This amounts to about 40.0% of New York State's population and a similar percentage of the metropolitan regional population. Over the last decade the city's population has been increasing and demographers estimate New York's population will reach between 9.2 and 9.5 million by 2030.{{cite web |title=New York City Population Projections by Age/Sex and Borough, 2000-2030 |publisher=[[New York City Department of City Planning]] |month=December | year=2006 |url=http://www.nyc.gov/html/dcp/pdf/census/projections_report.pdf |format=PDF |accessdate=2008-09-01}} See also {{cite news |last=Roberts, Sam |title=By 2025, Planners See a Million New Stories in the Crowded City |publisher=New York Times |date=February 19, 2006 |url=http://www.nytimes.com/2006/02/19/nyregion/19population.html?ex=1298005200&en=c586d38abbd16541&ei=5090&partner=rssuserland&emc=rss |accessdate=2008-09-01}}

谢谢。

score 3 · Accepted Answer

我正在使用的当前代码如下清理 Wiki 页面，例如这个：

http://en.wikipedia.org/wiki/Tel_Aviv（您可以通过单击“编辑此页面”查看标记

我得到这个返回：

“并让位于“不夜城”的美誉。国土报社论它是该国的金融首都和主要的表演艺术和商业中心。特拉维夫的市区是中东第二大城市经济体，位居中东第二大城市经济体根据《外交政策》2008 年全球城市指数，在全球城市中排名第 42 位。它也是该地区最昂贵的城市，也是世界上第 17 位最昂贵的城市。以色列的生活成本很高，特拉维夫是其生活成本最高的城市据总部位于纽约的人力资源咨询公司美世 (Mercer) 称，截至 2008 年，特拉维夫是中东最昂贵的城市，也是世界上第 14 位最昂贵的城市，仅次于新加坡和巴黎。悉尼和都柏林在这方面。相比之下，纽约市排名第 22 位。”

这是不正确的，预期的结果应该是：

特拉维夫-雅法（希伯来语：תֵּל־אָבִיב-יָפוֹ；阿拉伯语：تل أبيب‎，Tall ʼAbīb），通常称为特拉维夫，是以色列第二大城市，估计人口为 393,900。该市位于以色列地中海海岸线上，土地面积为 51.8 平方公里（20.0 平方英里）。它是 Gush Dan 大都市区最大、人口最多的城市，截至 2008 年拥有 315 万人口。该市由 Ron Huldai 领导的特拉维夫-雅法市管辖。

对于这个 PHP 代码：

function clean_wiki_text($text)
  {
    // first get rid of UGC HTML tags
    $text = strip_tags($text);

    // keep convert tag
    $text = preg_replace("/\{\{convert\|([^\|]+)\|([^\|]+)\|[^\}]+\}\}/", "$1$2", $text);

    // remove large blocks (treat as tags)
    $text = preg_replace("/(<![^>]+>)/", '', $text);
    $text = preg_replace('/\{\{\s?/', '<', $text);
    $text = str_replace('}}', ' />', $text);

    $text = str_replace('<! />', '', $text);

    // more wiki formatting
    $text = preg_replace("/'{2,6}/", '', $text);
    $text = preg_replace("/[=\s]+External [lL]inks[\s=]+/", '', $text);
    $text = preg_replace("/[=\s]+See [aA]lso[\s=]+/", '', $text);
    $text = preg_replace("/[=\s]+References[\s=]+/", '', $text);
    $text = preg_replace("/[=\s]+Notes[\s=]+/", '', $text);
    $text = preg_replace('/\{\{([^\}]+)\}\}/', '', $text);

    // drop page link text
    $text = preg_replace('/\[\[([^:\|\]]+)\|([^:\]]+)\]\]/', "$2", $text);
    // or keep it with preg_replace('/\[\[([^:\|\]]+)\|([^:\]]+)\]\]/', "$1 ($2)", $text);

    $text = preg_replace('/\(\[[^\]]+\]\)/', '', $text);
    $text = preg_replace('/\[\[([^:\]]+)\]\]/', "$1", $text);
    $text = preg_replace('/\*?\s?\[\[([^\]]+)\]\]/', '', $text);
    $text = preg_replace('/\*\s?\[([^\s]+)\s([^\]]+)\]/', "$2", $text);
    $text = preg_replace('/\n(\*+\s?)/', '', $text);
    $text = preg_replace('/\n{3,}/', "\n\n", $text);
    $text = preg_replace('/<ref[^>]?>[^>]+>/', '', $text);
    $text = preg_replace('/<cite[^>]?>[^>]+>/', '', $text);

    $text = preg_replace('/={2,}/', '', $text);
    $text = preg_replace('/{?class="[^"]+"/', "", $text);
    $text = preg_replace('/!?\s?width="[^"]+"/', "", $text);
    $text = preg_replace('/!?\s?height="[^"]+"/', "", $text);
    $text = preg_replace('/!?\s?style="[^"]+"/', "", $text);
    $text = preg_replace('/!?\s?rowspan="[^"]+"/', "", $text);
    $text = preg_replace('/!?\s?bgcolor="[^"]+"/', "", $text);

    $text = trim($text);

    $text = preg_replace('/\n\n/', "<br />\n<br />\n", $text);
    $text = preg_replace('/\r\n\r\n/', "<br />\r\n<br />\r\n", $text);
/*
    $config = array(
      'show-body-only' => true,
      'clean'          => false, 
      'wrap'           => 0, 
      'show-warnings'  => 0,
      'show-errors'    => 0,
      'enclose-block-text'   => false,
      'vertical-space' => true,
      'output-html'    => true
    );

    // Tidy
    $tidy = new tidy;
    $tidy->parseString($text, $config, 'utf8');
    $tidy->cleanRepair();

    $text = $tidy->value;
*/
    $extras = array(
  //  "/\((.*?)\)/is" => "",
      "/\[(.*?)\]/is" => ""
    );
    $text = preg_replace(array_keys($extras), array_values($extras), $text);

    $text = str_replace(" ,", ',', $text);
    $text = str_replace(", ", ',', $text);
    $text = str_replace(",", ', ', $text);
    $text = str_replace("(, ", '(', $text);
    $text = str_replace(";,", ',', $text);

    // lets keep it plain plain plain
    $text = strip_tags($text);
//    $text = preg_replace('/\s\s+/', ' ', $text);

    $text = str_replace("|-", '', $text);
    $text = str_replace("|}", '', $text);
    $text = str_replace("|", '', $text);
    $text = str_replace('()', '', $text);
    $text = str_replace('&nbsp;', ' ', $text);

    $text = trim($text);

    $text_arr = preg_split('/[\r\n]+/', $text, -1, PREG_SPLIT_NO_EMPTY);
    $result = "";
    foreach ($text_arr as $paragraph) {
      if ( mb_strlen(trim($paragraph)) > 30 ) {
      $result[] = $paragraph;
      }
    }
    return $result;
  }

score 2 · Accepted Answer

只是在这里猜测，但是使用 Wikipedia 的标记库（与 Mediawiki 捆绑）将其转换为 HTML 然后使用您碰巧熟悉的任何 XML 库对其进行解析不是更容易和更安全吗？

API 文档可以在http://svn.wikimedia.org/doc/（在Parser模块中）找到，它看起来并不复杂。基本上，您所要做的就是如下所示：

<?php

require_once '/path/to/mediawiki/Parser.php';
// also include whatver classes Parser depends on or use Mediawiki's autoload
// mechanism if it has any

// retrieve the content of your page in $content

$parser = new Parser();
$html   = $parser->parse($content);

$simplexml = simplexml_load_string($html);

现在您可以使用一个非常方便的 SimpleXML 对象了。当然，这只有在 Mediawiki 的解析器生成有效的 XML 时才有效（我敢打赌它确实如此）。

此外，如果 Mediawiki 包含某种自动加载机制，则可以通过查找__autoload或spl_autoload_register在 Mediawiki 的代码库中轻松找到它。

希望能帮助到你！

score 0 · Accepted Answer

当只提供一个示例时，制作正则表达式真的很难——根据我自己清理维基百科页面的经验，我知道其他页面很可能看起来有点不同。只是为了匹配您的示例：

{{.+?}}\n

这仅在要删除的部分后有换行符并且您指定DOTALLand时才有效MULTILINE。将所有成对的双花括号和里面的东西匹配：

{{[^}]+}}

您可能会尝试进行多次运行，每次都删除另一个不需要的部分 - 我怀疑在单个正则表达式中匹配您需要的所有内容是否可行。

php - 帮助 PHP 中的正则表达式（解析维基百科标记）

3 回答 3

Related

Reference