objective-c - 使用 Hpple 解析器和 NSXMLParser 迭代解析内部 HTML

Question

我一直在为 iPad 平台开发校报应用程序。我正在使用 NSXMLParser 来获取每篇文章的标题、简要说明和链接。为了从每个解析的链接中获取 HTML 项，我决定使用 Hpple 解析器。我认为我正在正确解析和存储 RSS 项目，但是当我尝试使用 for 循环从每个已解析链接解析 HTML 项目时，它告诉我我有一个用于 RSS 项目的空数组。但是，我可以在控制台上显示 RSS 项目持有者的内容。所以，它不是空的。我将放置我的部分代码并从控制台显示。请帮帮我。这个项目的截止日期很快。提前致谢。

下面是我如何开始加载我的 RSS 解析器（articleParser）：

- (void)loadData {
    [self loadInitData];

    //[self loadDataWithLink];

}

- (void)loadInitData {
    if (sections == nil) {
        [activityIndicator startAnimating];

        NSLog(@"STARTING ARTICLE PARSER FROM MAIN URL!!!");

        Parser *articleParser = [[Parser alloc] init];
        [articleParser parseRssFeed:@"http://theaggie.org/rss/headlines.xml" withDelegate:self];
        [articleParser release];
    } else {

    }

}

下面是我如何将收到的文章项目存储在称为“部分”的 NSMutable 数组中。然后我使用 for 循环遍历已解析文章的每个链接。

- (void)receivedArticleItems:(Article *)theArticle {
    if (sections == nil) {
        sections = [[NSMutableArray alloc] init];
    }
    [sections addObject:theArticle];

    NSLog(@"We recieved the article!");
    NSLog(@"Article: %@", theArticle);
    NSLog(@"What is in sections: %@", sections);

for (int i = 1; i < 5; i++) {
        NSLog(@"articleItems: %@",[sections objectAtIndex:0]);
        NSLog(@"articleItems at index 0: %@",[[[sections objectAtIndex:0] articleItems] objectAtIndex:0]);

        [self loadDataWithLink:[[[[sections objectAtIndex:0] articleItems] objectAtIndex:0] objectForKey:@"link"]];
    }
    [activityIndicator stopAnimating];
}

下面是我如何使用 TFFHpple 解析器从每个解析的链接中获取 HTML 项：

- (void)loadDataWithLink:(NSString *)urlString{

 NSData *htmlData = [NSData dataWithContentsOfURL:[NSURL URLWithString:urlString]];

 // Create parser
 TFHpple *xpathParser = [[TFHpple alloc] initWithHTMLData:htmlData];

 //Get all the cells main body
 htmlElements  = [xpathParser search:@"//div[@id='main']/div[@id='mainCol1']/div[@id='main-body']"];

 // Access the first cell
 TFHppleElement *htmlElement = [htmlElements objectAtIndex:0];

 // NSString *title = [htmlElement content];

 NSLog(@"What is in element: %@", htmlElement);

 [xpathParser release];
 //[htmlData release];
}

这就是我在控制台上得到的：

2011-05-02 22:58:35.355 TheCalAggie[2443:207] Parsing started for article!
2011-05-02 22:58:35.356 TheCalAggie[2443:207] Adding story title: Students say, 'No time for books'
2011-05-02 22:58:35.356 TheCalAggie[2443:207] From the link: http://theaggie.org/article/2011/05/03/students-say-no-time-for-books
2011-05-02 22:58:35.357 TheCalAggie[2443:207] Summary: The last book managerial economics major Kiyan Parsa read for fun was The Lord of the Rings. That was in high school.
2011-05-02 22:58:35.358 TheCalAggie[2443:207] Published on: Tue, 03 May 2011 00:00:00 -0700
2011-05-02 22:58:35.359 TheCalAggie[2443:207] Parsing started for article!
2011-05-02 22:58:35.360 TheCalAggie[2443:207] Adding story title: UC Davis craft center one of largest college crafting centers
2011-05-02 22:58:35.360 TheCalAggie[2443:207] From the link: http://theaggie.org/article/2011/05/02/uc-davis-craft-center-one-of-largest-college-crafting-centers
2011-05-02 22:58:35.361 TheCalAggie[2443:207] Summary: Hidden away in the South Silo, the UC Davis Craft Center offers 10 craft studios and more than a hundred classes for students looking to learn or perfect their crafting skills.
2011-05-02 22:58:35.362 TheCalAggie[2443:207] Published on: Mon, 02 May 2011 00:00:00 -0700
2011-05-02 22:58:35.362 TheCalAggie[2443:207] We recieved the article!
2011-05-02 22:58:35.363 TheCalAggie[2443:207] Article: *nil description*
2011-05-02 22:58:35.364 TheCalAggie[2443:207] What is in sections: (
    (null)
)
2011-05-02 22:58:35.374 TheCalAggie[2443:207] articleItems: *nil description*
2011-05-02 22:58:35.375 TheCalAggie[2443:207] articleItems at index 0: {
    link = "http://theaggie.org/article/2011/05/03/peaceful-rally-held-on-campus-after-killing-of-bin-laden\n";
    pubDate = "Tue, 03 May 2011 00:00:00 -0700";
    summary = "The announcement of Osama bin Laden's death sent a wave of patriotism across the nation and UC Davis. Bin Laden was the leader of al-Qaeda - the organization allegedly behind the Sept. 11, 2001 attacks that killed over 3,000 Americans.\n";
    title = "Peaceful rally held on campus after killing of bin Laden \n";
}
2011-05-02 22:59:35.376 TheCalAggie[2443:207] Unable to parse.
2011-05-02 22:59:35.379 TheCalAggie[2443:207] *** Terminating app due to uncaught exception 'NSRangeException', reason: '*** -[NSMutableArray objectAtIndex:]: index 0 beyond bounds for empty array'
*** Call stack at first throw:

任何帮助将不胜感激。再次感谢。

score 3 · Accepted Answer

2011-05-02 22:59:35.376 TheCalAggie [2443:207] 无法解析。

解析器正在努力解析 HTML。该解析器在解析 HTML 方面并不完美。解析在可能损坏/无效的 HTML 文档上运行 XPath 是一件复杂的事情。

在这里通过 W3C 验证器传递您尝试解析的链接会引发一些错误；所以它不是完全有效的 HTML。如果它太坏而无法使用该解析器进行解析，则您必须调试并找出答案。要真正深入了解这一点，您需要在您使用的TFHpple 解析器中设置断点以了解更多信息。

score 0 · Accepted Answer

达米安是对的。首先，您必须修复 html 以使您的代码正常工作。它每次解析的数据都不一样。这证明 HTML 是错误的。所以代码可能在某些情况下有效。尝试运行几次。你会看到它偶尔工作。

objective-c - 使用 Hpple 解析器和 NSXMLParser 迭代解析内部 HTML

2 回答 2

Related

Reference