xml - 网络收获——去除不寻常的字符

Question

我正在尝试抓取在锚点后有一些空格的页面：

</a>&nbsp;&nbsp;|&nbsp;&nbsp;

我似乎找不到指定文本的方法，我要么触发处理器错误，要么无法检测到字符串本身。之后的所有内容都会导致 html-to-xml 转换失败，因为包含字符时 xml 格式不正确。因此，我需要删除之后的所有内容（请注意，在文档的其他部分之后还有其他部分有 div 标签或其他内容）。

我的代码：

<xpath expression="/">
     <regexp replace="true">
            <regexp-pattern>(nbsp;)</regexp-pattern>
                <regexp-source>
                    <html-to-xml omitcomments="true" advancedxmlescape="true" prunetags="head,script,meta,meta ,p,base,br,link,img,image,input,option,nbsp;">
                       <http url="http://mysite.org/map/aindex/" method="get" />
                    </html-to-xml>
                </regexp-source>
                <regexp-result>
                    <template></template>
                </regexp-result>
      </regexp>
</xpath>

我认为我的问题在于正则表达式模式。我试过了：



 &nbsp;  
    \& nbsp;  (without the space in between -- SO doesn't display that correctly
    \s+\|\s+

除其他事项外。我什至尝试将表达式放在 CDATA 元素中，但我也无法让它工作。

有什么想法吗？

score 2 · Accepted Answer

2

对于 正则表达式模式，您可以尝试使用\u00A0

于 2012-12-08T22:21:01.500 回答

xml - 网络收获——去除不寻常的字符

1 回答 1

Related

Reference