ios - XPath 在 iOS 中使用 libxml2 提取具有多个标签的文本

Question

在带有 libxml2 的 iOS 应用程序中，在解析这个 HTML 片段（它是大页面的一部分）时 -

...
<span class="ingredient">
    <span class="amount">
        <span class="value">500 </span> 
        <span class="type">g</span>
    </span>    
    <a href="...">bread flour</a> 
    or 
    <span class="ingredient">
        <span class="amount">
            <span class="value">500 </span> 
            <span class="type">g</span>
        </span>  
        <span class="name">
            <a href="...">all-purpose flour</a>
        </span>
    </span>
</span>
...

我只需要提取文本：“500 克面包粉或 500 克通用面粉”。

//span[@class="ingredient"]XPath 查询的解析 NSDictionary 结果返回 -

{
    nodeAttributeArray =     (
                {
            attributeName = class;
            nodeContent = ingredient;
        }
    );
    nodeChildArray =     (
                {
            nodeAttributeArray =             (
                                {
                    attributeName = class;
                    nodeContent = amount;
                }
            );
            nodeChildArray =             (
                                {
                    nodeAttributeArray =                     (
                                                {
                            attributeName = class;
                            nodeContent = value;
                        }
                    );
                    nodeContent = 500;
                    nodeName = span;
                },
                                {
                    nodeAttributeArray =                     (
                                                {
                            attributeName = class;
                            nodeContent = type;
                        }
                    );
                    nodeContent = g;
                    nodeName = span;
                }
            );
            nodeContent = "";
            nodeName = span;
        },
                {
            nodeAttributeArray =             (
                                {
                    attributeName = href;
                    nodeContent = "http://www.food.com/library/flour-64";
                }
            );
            nodeContent = "bread flour";
            nodeName = a;
        },
                {
            nodeAttributeArray =             (
                                {
                    attributeName = class;
                    nodeContent = ingredient;
                }
            );
            nodeChildArray =             (
                                {
                    nodeAttributeArray =                     (
                                                {
                            attributeName = class;
                            nodeContent = amount;
                        }
                    );
                    nodeChildArray =                     (
                                                {
                            nodeAttributeArray =                             (
                                                                {
                                    attributeName = class;
                                    nodeContent = value;
                                }
                            );
                            nodeContent = 500;
                            nodeName = span;
                        },
                                                {
                            nodeAttributeArray =                             (
                                                                {
                                    attributeName = class;
                                    nodeContent = type;
                                }
                            );
                            nodeContent = g;
                            nodeName = span;
                        }
                    );
                    nodeContent = "";
                    nodeName = span;
                },
                                {
                    nodeAttributeArray =                     (
                                                {
                            attributeName = class;
                            nodeContent = name;
                        }
                    );
                    nodeChildArray =                     (
                                                {
                            nodeAttributeArray =                             (
                                                                {
                                    attributeName = href;
                                    nodeContent = "http://www.food.com/library/flour-64";
                                }
                            );
                            nodeContent = "all-purpose flour";
                            nodeName = a;
                        }
                    );
                    nodeContent = "";
                    nodeName = span;
                }
            );
            nodeContent = "";
            nodeName = span;
        }
    );
    nodeContent = or;
    nodeName = span;
}

问题是字典根的“nodeContent”是文本“或”，并且所有标签都作为根节点的子节点，所以片段的顺序丢失了 - 我不能说或者实际上在中间在连接所有文本时，我得到以下字符串：“或 500 克面包粉 500 克通用面粉”。

任何人都可以想出一种方法来在 1 个 XPath 查询中提取纯文本，或者使用 XPath 引擎来读取元素的有序列表吗？

score 0 · Accepted Answer

由于您需要所有文本节点，因此可以使用以下方法轻松完成

//text()

这将返回所有节点。您的内容中的空白存在一些问题，您可以使用以下方式省略所有仅空白节点

//text()[not(matches(., '$[\s]+$', 'm'))]

之后您仍然需要在 Objective C 中进行一些修整（例如“g”），但是您应该获得包含可打印字符的所有文本节点的有序结果集。

ios - XPath 在 iOS 中使用 libxml2 提取具有多个标签的文本

1 回答 1

Related

Reference