期待Nokogiri
满足您的需求。这是一个很棒的网页抓取宝石。
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.nytimes.com/2013/06/20/sports/baseball/for-the-mets-an-afterglow-then-realitys-harsh-light.html?ref=sports&_r=1&'))
str = doc.at_css('div.articleBody > nyt_text > p').text
puts str
# >> ATLANTA — From the sublime emotional high provided by Matt Harvey and Zack Wheeler, the Mets’ young, hard-throwing right-handers, the team on Wednesday descended back to the realities of its everyday existence.
str.scan(/\w+/)
# => ["ATLANTA",
# "From",
# "the",
# "sublime",
# "emotional",
# "high",
# "provided",
# "by",
# "Matt",
# "Harvey",
# "and",
# "Zack",
# "Wheeler",
# "the",
# "Mets",
# "young",
# "hard",
# "throwing",
# "right",
# "handers",
# "the",
# "team",
# "on",
# "Wednesday",
# "descended",
# "back",
# "to",
# "the",
# "realities",
# "of",
# "its",
# "everyday",
# "existence"]
我知道那篇文章超过15个字:
str.scan(/\w+/).size > 15 # => true
与符号 ' ' 或 ',' 或 '-' 或 ':' 或 '.' 连接:
[' ',',','-',':','.'].map{|i| str.include? i}
# => [true, true, true, false, false]