objective-c - 如何解析一些 Wiki 标记

Question

大家好，给定一个纯文本数据集，如下所示：

==Events==
* [[312]] &ndash; [[Constantine the Great]] is said to have received his famous [[Battle of Milvian Bridge#Vision of Constantine|Vision of the Cross]].
* [[710]] &ndash; [[Saracen]] invasion of [[Sardinia]].
* [[939]] &ndash; [[Edmund I of England|Edmund I]] succeeds [[Athelstan of England|Athelstan]] as [[King of England]].
*[[1275]] &ndash; Traditional founding of the city of [[Amsterdam]].
*[[1524]] &ndash; [[Italian Wars]]: The French troops lay siege to [[Pavia]].
*[[1553]] &ndash; Condemned as a [[Heresy|heretic]], [[Michael Servetus]] is [[burned at the stake]] just outside [[Geneva]].
*[[1644]] &ndash; [[Second Battle of Newbury]] in the [[English Civil War]].
*[[1682]] &ndash; [[Philadelphia]], [[Pennsylvania]] is founded.

我想以一个NSDictionary或其他形式的集合结束，以便我可以将年份（左侧的数字）映射到摘录（右侧的文本）。所以这就是“模板”的样子：

*[[YEAR]] &ndash; THE_TEXT

虽然我希望摘录是纯文本，也就是说，没有 wiki 标记所以没有[[集合。实际上，使用别名链接（例如[[Edmund I of England|Edmund I]].

我对正则表达式的经验并不多，所以我有几个问题。我应该先尝试“美化”数据吗？例如，删除将始终为的第一行==Events==，并删除[[and]]出现？

或者也许是一个更好的解决方案：我应该在通行证中这样做吗？因此，例如，第一遍我可以将每一行分成* [[710]]和[[Saracen]] invasion of [[Sardinia]]。并将它们存储到不同的NSArrays.

然后经历第一NSArray年，只得到文本[[]]（我说文本而不是数字，因为它可能是公元前 530 年），所以* [[710]]变成710.

然后对于 excerpt NSArray，通过，如果[[some_article|alias]]找到 an ，使其仅以[[alias]]某种方式存在，然后删除所有[[and]]集？

这可能吗？我应该使用正则表达式吗？对于正则表达式，您有什么想法可能会有所帮助吗？

谢谢！对此，我真的非常感激。

编辑：很抱歉造成混淆，但我只想解析上述数据。假设这是我会遇到的唯一类型的标记。我不一定期待解析 wiki 标记，除非已经有一个预先存在的库可以做到这一点。再次感谢！

score 3 · Accepted Answer

此代码假定您正在使用RegexKitLite：

NSString *data = @"* [[312]] &ndash; [[Constantine the Great]] is said to have received his famous [[Battle of Milvian Bridge#Vision of Constantine|Vision of the Cross]].\n\
    * [[710]] &ndash; [[Saracen]] invasion of [[Sardinia]].\n\
    * [[939]] &ndash; [[Edmund I of England|Edmund I]] succeeds [[Athelstan of England|Athelstan]] as [[King of England]].\n\
    *[[1275]] &ndash; Traditional founding of the city of [[Amsterdam]].";

    NSString *captureRegex = @"(?i)(?:\\* *\\[\\[)([0-9]*)(?:\\]\\] \\&ndash; )(.*)"; 

    NSRange captureRange;
    NSRange stringRange;
    stringRange.location = 0;
    stringRange.length = data.length;

    do 
    {
        captureRange = [data rangeOfRegex:captureRegex inRange:stringRange];
        if ( captureRange.location != NSNotFound )
        {
            NSString *year = [data stringByMatching:captureRegex options:RKLNoOptions inRange:stringRange capture:1 error:NULL];
            NSString *textStuff = [data stringByMatching:captureRegex options:RKLNoOptions inRange:stringRange capture:2 error:NULL];
            stringRange.location = captureRange.location + captureRange.length;
            stringRange.length = data.length - stringRange.location;
            NSLog(@"Year:%@, Stuff:%@", year, textStuff);
        }
    }
    while ( captureRange.location != NSNotFound );

请注意，您确实需要学习 RegEx 才能很好地构建这些，但这就是我所说的：

(?i)

忽略大小写，因为我不匹配字母，所以我可以忽略它。

(?:\* *\[\[)

?: 表示不捕获此块，我转义 * 以匹配它，然后有零个或多个空格（“*”）然后我转义出两个括号（因为括号也是正则表达式中的特殊字符）。

([0-9]*)

抓住任何数字。

(?:\]\] \&ndash; )

这是我们再次忽略东西的地方，基本上匹配“ - ”。注意正则表达式中的任何“\”，我必须在上面的Objective-C字符串中添加另一个，因为“\”是字符串中的特殊字符......是的，这意味着匹配正则表达式转义的单个“\”结尾在 Obj-C 字符串中作为“\\”。

(.*)

只需抓住其他任何东西，默认情况下，RegEX 引擎将在行尾停止匹配，这就是它不匹配其他所有内容的原因。您必须添加代码才能从文本中删除 [[LINK]] 内容。

NSRange 变量用于在不重新匹配原始匹配的情况下通过文件保持匹配。可以这么说。

不要忘记添加 RegExKitLite 类文件后，还需要添加特殊的链接器标志，否则会出现大量链接错误（RegexKitLite 站点有安装说明）。

score 0 · Accepted Answer

我不擅长正则表达式，但这听起来像是他们的工作。我想一个正则表达式会很容易地为你解决这个问题。

看看 RegexKitLite 库。

score 0 · Accepted Answer

如果您希望能够解析一般的 Wikitext，那么您还有很多工作要做。只有一个复杂的因素是模板。你想付出多少努力去应对这些？

如果您对此很认真，您可能应该寻找一个现有的解析 Wikitext 的库。简单浏览一下就找到了这个 CPAN 库，但我没有使用过，所以我不能将其作为个人推荐引用。

或者，您可能希望采用更简单的方法并决定您将处理 Wikitext 的哪些特定部分。例如，这可能是链接和标题，但不是列表。然后你必须专注于其中的每一个，并将 Wikitext 变成你想要的样子。是的，正则表达式在这方面会有很大帮助，所以请阅读它们，如果您有具体问题，请回来询问。

祝你好运！

objective-c - 如何解析一些 Wiki 标记

3 回答 3

Related

Reference