ruby - 我无法从 Nokogiri 解析的字符串中删除空格

Question

我无法从字符串中删除空格。

我的 HTML 是：

<p class='your-price'>
Cena pro Vás: <strong>139&nbsp;<small>Kč</small></strong>
</p>

我的代码是：

#encoding: utf-8
require 'rubygems'
require 'mechanize'

agent = Mechanize.new
site  = agent.get("http://www.astratex.cz/podlozky-pod-raminka/doplnky")
price = site.search("//p[@class='your-price']/strong/text()")

val = price.first.text  => "139 "
val.strip               => "139 "
val.gsub(" ", "")       => "139 "

gsub,strip等不工作。为什么，我该如何解决这个问题？

val.class      => String
val.dump       => "\"139\\u{a0}\""      !
val.encoding   => #<Encoding:UTF-8>

__ENCODING__               => #<Encoding:UTF-8>
Encoding.default_external  => #<Encoding:UTF-8>

我使用的是 Ruby 1.9.3，所以 Unicode 应该不是问题。

score 23 · Accepted Answer

strip仅删除 ASCII 空格，而您在此处获得的字符是 Unicode 不间断空格。

删除角色很容易。您可以gsub通过提供带有字符代码的正则表达式来使用：

gsub(/\u00a0/, '')

你也可以打电话

gsub(/[[:space:]]/, '')

删除所有 Unicode 空格。有关详细信息，请查看Regexp 文档。

score 0 · Accepted Answer

如果我想删除不间断空格"\u00A0"AKA  ，我会执行以下操作：

require 'nokogiri'

doc = Nokogiri::HTML("&nbsp;")

s = doc.text # => " "

# s is the NBSP
s.ord.to_s(16)                   # => "a0"

# and here's the translate changing the NBSP to a SPACE
s.tr("\u00A0", ' ').ord.to_s(16) # => "20"

所以tr("\u00A0", ' ')让你到达你想去的地方，在这一点上，NBSP 现在是一个空间：

tr非常快速且易于使用。

 另一种方法是在从 HTML 中提取实际编码字符“”之前对其进行预处理。这是简化的，但它适用于整个 HTML 文件以及字符串中的单个实体：

s = "&nbsp;"

s.gsub('&nbsp;', ' ') # => " "

对目标使用固定字符串比使用正则表达式更快：

s = "&nbsp;" * 10000

require 'fruity'

compare do
  fixed { s.gsub('&nbsp;', ' ') }
  regex { s.gsub(/&nbsp;/, ' ') }
 end

# >> Running each test 4 times. Test will take about 1 second.
# >> fixed is faster than regex by 2x ± 0.1

如果您需要正则表达式的功能，它们会很有用，但它们会大大降低代码速度。

ruby - 我无法从 Nokogiri 解析的字符串中删除空格

2 回答 2

Related

Reference