ruby - 如何在ruby中获取两个带有特殊字符的字符串之间的文本？

Question

我有一个包含 HTML 代码的字符串 (@description)，我想提取两个元素之间的内容。它看起来像这样

<b>Content title<b><br/>
*All the content I want to extract*
<a href="javascript:print()">

我已经设法做这样的事情

@want = @description.match(/Content title(.*?)javascript:print()/m)[1].strip

但显然这个解决方案远非完美，因为我的@want 字符串中有一些不需要的字符。

谢谢你的帮助

编辑：

根据评论中的要求，这是完整的代码：

我已经在解析一个 HTML 文档，其中包含以下代码：

@description = @doc.at_css(".entry-content").to_s
puts @description

返回：

<div class="post-body entry-content">
<a href="http://www.photourl"><img alt="Photo title" height="333"     src="http://photourl.com" width="500"></a><br><br><div style="text-align: justify;">
Some text</div>
<b>More text</b><br><b>More text</b><br><br><ul>
<li>Numered item</li>
<li>Numered item</li>
<li>Numered item</li>
</ul>
<br><b>Content Title</b><br>
Some text<br><br>
Some text(with links and images)<br>
Some text(with links and images)<br>
Some text(with links and images)<br>
<br><br><a href="javascript:print()"><img src="http://url.com/photo.jpg"></a>
<div style="clear: both;"></div>
</div>

文本可以包含更多的段落、链接、图像等，但它始终以“内容标题”部分开头，以 javascript 引用结尾。

score 1 · Accepted Answer

$vStart此 XPath 表达式选择节点和之间的所有（兄弟）节点$vEnd：

  $vStart/following-sibling::node()
           [count(.|$vEnd/preceding-sibling::node())
           =
            count($vEnd/preceding-sibling::node())
           ]

要获得在您的特定情况下使用的完整 XPath 表达式，只需替换$vStart为：

/*/b[. = 'Content Title']

并替换$vEnd为：

/*/a[@href = 'javascript:print()']

替换后的最终 XPath 表达式是：

/*/b[. = 'Content Title']/following-sibling::node()
         [count(.|/*/a[@href = 'javascript:print()']/preceding-sibling::node())
         =
          count(/*/a[@href = 'javascript:print()']/preceding-sibling::node())
         ]

说明：

$ns1这是两个节点集和交集的 Kayessian 公式的简单推论$ns2：

$ns1[count(.|$ns2) = count($ns2)]

在我们的例子中，节点之间的所有节点的集合$vStart是$vEnd两个节点集的交集：所有后续的兄弟姐妹$vStart和之前的所有兄弟姐妹$vEnd。

基于 XSLT 的验证：

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:variable name="vStart" select="/*/b[. = 'Content Title']"/>
 <xsl:variable name="vEnd" select="/*/a[@href = 'javascript:print()']"/>

 <xsl:template match="/">
     <xsl:copy-of select=
     "$vStart/following-sibling::node()
               [count(.|$vEnd/preceding-sibling::node())
               =
                count($vEnd/preceding-sibling::node())
               ]
     "/>
==============

     <xsl:copy-of select=
     "/*/b[. = 'Content Title']/following-sibling::node()
               [count(.|/*/a[@href = 'javascript:print()']/preceding-sibling::node())
               =
                count(/*/a[@href = 'javascript:print()']/preceding-sibling::node())
               ]
     "/>
 </xsl:template>
</xsl:stylesheet>

当此转换应用于提供的 XML 文档（转换为格式良好的 XML 文档）时：

<div class="post-body entry-content">
    <a href="http://www.photourl">
        <img alt="Photo title" height="333"     src="http://photourl.com" width="500"/>
    </a>
    <br />
    <br />
    <div style="text-align: justify;">
    Some text</div>
    <b>More text</b>
    <br />
    <b>More text</b>
    <br />
    <br />
    <ul>
        <li>Numered item</li>
        <li>Numered item</li>
        <li>Numered item</li>
    </ul>
    <br />
    <b>Content Title</b>
    <br />
    Some text
    <br />
    <br />
    Some text(with links and images)
    <br />
    Some text(with links and images)
    <br />
    Some text(with links and images)
    <br />
    <br />
    <br />
    <a href="javascript:print()">
        <img src="http://url.com/photo.jpg"/>
    </a>
    <div style="clear: both;"></div>
</div>

计算两个 XPath 表达式（有和没有变量引用），并且在每种情况下选择的节点，方便地分隔，被复制到输出：

<br/>
    Some text
    <br/>
<br/>
    Some text(with links and images)
    <br/>
    Some text(with links and images)
    <br/>
    Some text(with links and images)
    <br/>
<br/>
<br/>
==============

     <br/>
    Some text
    <br/>
<br/>
    Some text(with links and images)
    <br/>
    Some text(with links and images)
    <br/>
    Some text(with links and images)
    <br/>
<br/>
<br/>

score 0 · Accepted Answer

为了测试您的 HTML，我在您的代码周围添加了标签，然后将其粘贴到文件中

xmllint --html --xpath '/html/body/div/text()' /tmp/l.html

输出：

Some text
Some text
Some text
Some text

现在，您可以在其中使用 Xpath 模块ruby并重新使用 Xpath 表达式

您会在 stackoverflow 网站搜索中找到许多示例。

ruby - 如何在ruby中获取两个带有特殊字符的字符串之间的文本？

2 回答 2

Related

Reference