2

好的,所以我正在开发一个网络爬虫,它可以获取网页并将它们转换为文本段落。为了删除标签本身,我在 Stack Overflow 上找到了这个:

- (NSString *) stripTags:(NSString *)str
{
    NSMutableString *ms = [NSMutableString stringWithCapacity:[str length]];

    NSScanner *scanner = [NSScanner scannerWithString:str];
    [scanner setCharactersToBeSkipped:nil];
    NSString *s = nil;
    while (![scanner isAtEnd])
    {
        [scanner scanUpToString:@"<" intoString:&s];
        if (s != nil)
            [ms appendString:s];
        [scanner scanUpToString:@">" intoString:NULL];
        if (![scanner isAtEnd])
            [scanner setScanLocation:[scanner scanLocation]+1];
        s = nil;
    }

    return ms;
}

它有效,但是,它只删除tags,而不是 script 和 style 标签之间的内容(显然我不希望删除所有标签之间的内容,因为这会导致空字符串)。

有什么办法可以专门截断脚本和样式标签?

提前非常感谢。

编辑:

我尝试将我的代码更改为:

- (NSString *) stripTags:(NSString *)str
{
    NSMutableString *ms = [NSMutableString stringWithCapacity:[str length]];

    NSScanner *scanner = [NSScanner scannerWithString:str];
    [scanner setCharactersToBeSkipped:nil];
    NSString *s = nil;
    while (![scanner isAtEnd])
    {
        [scanner scanUpToString:@"<script" intoString:&s];
        if (s != nil)
            [ms appendString:s];
        [scanner scanUpToString:@"script>" intoString:NULL];
        if (![scanner isAtEnd])
            [scanner setScanLocation:[scanner scanLocation]+1];
        [scanner scanUpToString:@"<" intoString:&s];
        if (s != nil)
            [ms appendString:s];
        [scanner scanUpToString:@">" intoString:NULL];
        if (![scanner isAtEnd])
            [scanner setScanLocation:[scanner scanLocation]+1];
        s = nil;
    }

    return ms;
}

但仍然包含脚本和 CSS

4

1 回答 1

1

您可以编辑扫描仪代码,以便检查标签。如果标签是您要删除的标签,那么您可以扫描到结束标签并丢弃字符串。你不是那么你可以存储/附加字符串。


读取标签开始 ( <)' 然后读取标签,以便您检查它是什么。然后阅读标签关闭并删除或保存它。


从类似(内联输入且未以任何方式测试)的内容开始:

while (![scanner isAtEnd])
{
    [scanner scanUpToString:@"<" intoString:&s];
    if (s != nil)
        [ms appendString:s];
    [scanner scanUpToString:@">" intoString:&t];
    if ([t isEqualToString:@"tagToIgnore"]) {
        [scanner scanUpToString:@"<" intoString:NULL];
        [scanner setScanLocation:[scanner scanLocation]-1];
        s = nil;
        t = nil;
        continue;
    }
    if (![scanner isAtEnd])
        [scanner setScanLocation:[scanner scanLocation]+1];
    s = nil;
    t = nil;
}
于 2013-07-21T21:03:17.567 回答