我正在考虑构建一个显示月刊的应用程序。期刊没有 XML,他们只是每个月更改 PDF 的标题标题和 URL。这总是存储在源代码中的相同位置,所以我正在寻找在
div class=entry clearfix post /div
标记,然后提取第一个 URL。我以前曾研究过解析 XML,但从来没有解析过 HTML。我最好的选择是什么?
更新:
页面仅在源代码中的某一点说To Download the PDF, click here
. 所以,我设置了以下扫描仪:
NSURL *url = [NSURL URLWithString:@"http://www.thejenkinsinstitute.com/Journal/"];
NSString *content = [NSString stringWithContentsOfURL:url];
NSString * aString = content;
NSMutableArray *substrings = [NSMutableArray new];
NSScanner *scanner = [NSScanner scannerWithString:aString];
[scanner scanUpToString:@"<p>To Download the PDF, <a href=\"http://michaelwhitworth.com/wp-content/HE22.pdf\">" intoString:nil]; // Scan all characters before #
while(![scanner isAtEnd]) {
NSString *substring = nil;
[scanner scanString:@"<p>To Download the PDF, <a href=\"" intoString:nil]; // Scan the # character
if([scanner scanUpToString:@"\"" intoString:&substring]) {
// If the space immediately followed the #, this will be skipped
[substrings addObject:substring];
}
[scanner scanUpToString:@"#" intoString:nil]; // Scan all characters before next #
}
NSLog(@"Here is the Substring%@", substrings);
// do something with substrings
[substrings release];
在控制台中,首先要返回的是 PDF 的 URL,但它包含更多内容。这是一个简短的摘录。
"2012-11-23 15:33:36.383 Jenkins[8306:c07] Here is the Substring(
"http://michaelwhitworth.com/wp-content/HE22.pdf",
"#8220;As the Bible School Goes So Goes the Congregation” by Ira North</a></p>\n<p style=","
我做错了什么来阻止它只给我 URL,仅此而已?