php - 使用正则表达式解析维基百科列表和描述

Question

对正则表达式不太熟悉，我需要找到一种方法来解析来自维基百科的项目列表。我使用 Wikipedia 的 api.php 提取了内容，剩下的数据如下所示：

    ==Formal fallacies==
    A [[formal fallacy]] is an error in logic that...

    * [[Appeal to probability]] –  takes something for granted because...
    * [[Argument from fallacy]] –  assumes that if an argument ...
    * [[Base rate fallacy]] –  making a probability judgement...
    * [[Conjunction fallacy]] –  assumption that an outcome simultaneously...
    * [[Masked man fallacy]] –  ...

    ===Propositional fallacies===

    * [[Affirming a disjunct]] –  concluded that ...
    * [[Affirming the consequent]] –  the [[antecedent...
    * [[Denying the antecedent]] –  the [[consequent]] in...

所以，我需要一种方法来提取数据，以便：

我们只关注以 * [[ 开头的行
* [[ ]] 之间的任何内容都是名称
- 后面的剩余内容是描述

score 1 · Accepted Answer

这可以完成工作：

preg_match_all('~^\h*+\*\h*\[\[(?<name>[a-z ]++)]]\h*+[-–]\h*+(?<description>.++)$~imu', $text, $results, PREG_SET_ORDER);
foreach($results as &$result) { 
    foreach($result as $key=>$value) {
        if (is_numeric($key)) unset($result[$key]); }
}
echo '<pre>' . print_r($results, true) . '</pre>';

score 0 · Accepted Answer

首先更换

^((?!\*\s\[\[).)*$

与空白。这将删除不包含 * [[

删除换行符替换

^\n|\r$

与空白。

这是获取标题和描述的正则表达式：

^\s+\*\s\[\[([^\]\]]*)\]\]\s–(.*)
Title: "$1", Description: "$2"

php - 使用正则表达式解析维基百科列表和描述

2 回答 2

Related

Reference