1

这是关于Objective-C的问题。我编写了使用正则表达式获取整个 HTML 的程序。我已将程序上传到 GitHub。但是,会发生异常。

这个程序的目的是通过正则表达式匹配得到“og:image”。这是通过在 Facebook 中写入 URL 来显示的图像。要设置此图像,请使用 HTML 编写如下:

<meta property="og:image"
content="http://business.nikkeibp.co.jp/article/NBD/20120727/235043/zu1.jpg">

所以我编写了获取整个 HTML 并找到 og:image 部分的程序。代码如下:

// Web page address
NSURL *url = [NSURL URLWithString:textField.text];

// Get the web page HTML
NSString *string = 
[NSString stringWithContentsOfURL:url encoding:NSUTF8StringEncoding error:nil];

// prepare regular expression to find text
NSError *error   = nil;
NSRegularExpression *regexp =
[NSRegularExpression regularExpressionWithPattern:
 @"<meta property=\"og:image\" content=\".+\""
                                          options:0
                                            error:&error];

@try {
    // find by regular expression
    NSTextCheckingResult *match =
    [regexp firstMatchInString:string options:0 range:NSMakeRange(0, string.length)];

    // get the first result
    NSRange resultRange = [match rangeAtIndex:0];
    NSLog(@"match=%@", [string substringWithRange:resultRange]); 

    if (match) {

        // get the og:image URL from the find result
        NSRange urlRange = NSMakeRange(resultRange.location + 35, resultRange.length - 35 - 1);
        NSURL *urlOgImage = [NSURL URLWithString:[string substringWithRange:urlRange]];
        imageView.image = [UIImage imageWithData:[NSData dataWithContentsOfURL:urlOgImage]];
    }
}

整个代码在 GitHub 中,如下所示:

https://github.com/weed/p120728_GetOgImage/blob/master/GetOgImage/ViewController.m

但是,有时这个程序会通过异常。

  • success case:<a href="http://www.nicovideo.jp/watch/1343369790" rel="nofollow">http://www.nicovideo.jp/watch/1343369790

  • failure case:<a href="http://business.nikkeibp.co.jp/article/NBD/20120727/235043/?ST=pc" rel="nofollow">http://business.nikkeibp.co.jp/article/NBD/20120727/235043/?ST=pc

Screen shots is here: https://github.com/weed/p120728_GetOgImage/blob/master/readme.md

Why exception occurs? Please teach me. Thank you for your help.

4

2 回答 2

1

My friend kindly pointed about considering Character Encoding. The character encoding of first URL page is UTF-8, and the second one is EUC-JP.

With the code below I could get the og:image of second URL I showed above.

- (NSString *)encodedStringWithContentsOfURL:(NSURL *)url
{
    // Get the web page HTML
    NSData *data = [NSData dataWithContentsOfURL:url];

    // response
    int enc_arr[] = {
        NSUTF8StringEncoding,           // UTF-8
        NSShiftJISStringEncoding,       // Shift_JIS
        NSJapaneseEUCStringEncoding,    // EUC-JP
        NSISO2022JPStringEncoding,      // JIS
        NSUnicodeStringEncoding,        // Unicode
        NSASCIIStringEncoding           // ASCII
    };
    NSString *data_str = nil;
    int max = sizeof(enc_arr) / sizeof(enc_arr[0]);
    for (int i=0; i<max; i++) {
        data_str = [
               [NSString alloc]
               initWithData : data
               encoding : enc_arr[i]
               ];
        if (data_str!=nil) {
               break;
        }
    }
    return data_str;    
}

I made the check library of character encoding named NSString+Encode. The whole code is in GitHub:

https://github.com/weed/p120728_OgImageLibrary

于 2012-07-28T10:00:02.060 回答
0

It looks like your regular expression is not matching the result for the second page, have you tested the html source of that page with your regular expression in a regex tester?

Something like this should do the trick: http://regexpal.com/

于 2012-07-28T09:07:43.220 回答