1

我正在使用 Wikipedia JSON API,例如,我可以在没有链接的情况下检索页面内容,

https://en.wikipedia.org/w/api.php?action=query&format=json&titles=May_21&prop=revisions&rvprop=content&rvsection=1

例如:

[[293]] – Roman Emperors [[Diocletian]] and [[Maximian]] appoint [[Galerius]] as [[Caesar (title)|''Caesar'']] to Diocletian, beginning the period of four rulers known as the [[Tetrarchy]].

替换&ndash-

[[Caesar (title)|''Caesar'']]应该Caesar

我正在使用 Objective-C

如何检索相同的页面内容,但没有链接字符?

谢谢!

4

4 回答 4

2

使用 HTML 到文本转换器(例如链接或一些浏览器模拟器,例如PhantomJS)。比将 wikitext 转换为文本要少得多,在这种情况下,您将不得不处理模板。

于 2012-05-22T06:52:51.163 回答
1

那应该是:-)

NSString * stringToParse = @"{\"query\":{\"normalized\":[{\"from\":\"May_21\",\"to\":\"May 21\"}],\"pages\":{\"19684\":{\"pageid\":19684,\"ns\":0,\"title\":\"May 21\",\"revisions\":[{\"*\":\"==Events==\\n* [[293]] – Roman Emperors [[Diocletian]] and [[Maximian]] appoint [[Galerius]] as [[Caesar (title)|''Caesar'']] to Diocletian, beginning the period of four rulers known as the [[Tetrarchy]].\\n* [[878]] – [[Syracuse, Italy]], is [[Muslim conquest of Sicily|captured]] by the ...";

//Replace &ndash with -
stringToParse = [stringToParse stringByReplacingOccurrencesOfString:@"&ndash" withString:@"-"];

//[[Caesar (title)|''Caesar'']] Should be Caesar
//and [[Maximian]] should be Maximian
//same for [[1972]] -> 1972
NSString *regexToReplaceWikiLinks = @"\\[\\[([A-Za-z0-9_ ()]+?\\|)?(\\'\\')?(.+?)(\\'\\')?\\]\\]";

NSError *error = NULL;
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:regexToReplaceWikiLinks
                                                                       options:NSRegularExpressionCaseInsensitive
                                                                         error:&error];

// attention, the found expression is replacex with the third parenthesis
NSString *modifiedString = [regex stringByReplacingMatchesInString:stringToParse
                                                           options:0
                                                             range:NSMakeRange(0, [stringToParse length])
                                                      withTemplate:@"$3"];

NSLog(@"%@", modifiedString);

结果是:

{"query":{"normalized":[{"from":"May_21","to":"May 21"}],"pages":{"19684":{"pageid":19684,"ns":0,"title":"May 21","revisions":[{"*":"==Events==\n* 293 -; Roman Emperors Diocletian and Maximian appoint Galerius as Caesar to Diocletian, beginning the period of four rulers known as the Tetrarchy.\n* 878 -; Syracuse, Italy, is captured by the ...
于 2012-05-27T15:14:15.070 回答
0

正则表达式是解决这个问题的方法;这是一个使用 JavaScript 的示例(但您可以将相同的解决方案应用于任何具有正则表达式的语言);

<dl>
    <script type="text/javascript">

        var source = "[[293]] &ndash; Roman Emperors [[Diocletian]] and [[Maximian]] appoint [[Galerius]] as [[Caesar (title)|''Caesar'']] to Diocletian, beginning the period of four rulers known as the [[Tetrarchy]].";

        document.writeln('<dt> Original </dt>');
        document.writeln('<dd>' + source + '</dd>');

        // Replace links with any found titles
        var matchTitles = /\[\[([^\]]+?)\|\'\'(.+?)\'\']\]/ig; /* <- Answer */
        source = source.replace(matchTitles, '$2');

        document.writeln('<dt> First Pass </dt>');
        document.writeln('<dd style="color: green;">' + source + '</dd>');

        // Replace links with contents
        var matchLinks = /\[\[(.+?)\]\]/ig;
        source = source.replace(matchLinks, '$1');

        document.writeln('<dt> Second Pass </dt>');
        document.writeln('<dd>' + source + '</dd>');
    </script>
</dl>

你也可以在这里看到这个工作:http: //jsfiddle.net/NujmB/

于 2012-05-24T08:40:06.627 回答
0

我不知道目标 C,但这是我用于相同目的的 javascript 代码
(它可以作为您的伪代码并帮助 javascript 中的其他用户)

 var url = 'http://en.wikipedia.org/w/api.php?callback=?&action=parse&page=facebook&prop=text&format=json&section=0';
     // Section = 0 for taking first section of wiki page i.e. introduction only     
            $.getJSON(url,function(response){
                // Taking only the first paragraph from introduction
                var intro = $(response.parse.text['*']).filter('p:eq(0)').html();
                var wikiBox = $('#wikipediaBox .wikipedia div.overview');
                wikiBox.empty().html(intro);
                // Converting relative links into absolute ones and links into outer links
                wikiBox.find("a:not(.references a)").attr("href", function(){ return "http://www.wikipedia.org" + $(this).attr("href");});
                wikiBox.find("a").attr("target", "_blank");
                // Removing edits markers
                wikiBox.find('sup.reference').remove(); 
            });
于 2012-05-25T15:49:20.480 回答