ruby - 如何使用正则表达式解析网页中的文章？

Question

我在页面上有一篇文章，我需要解析所有文本。

我知道一篇文章超过15个字，用符号''或'，'或'-'，或'：'，或'.'连接。

如何使用 Ruby 编写正则表达式来分析页面上的文章并解析它？

例如：http ://www.nytimes.com/2013/06/20/sports/baseball/for-the-mets-an-afterglow-then-realitys-harsh-light.html?ref=sports&_r=0

我需要解析正文：ATLANTA — From the sublime emotional high provided by Matt Harvey and Zack Wheeler, the Mets’ young, hard-throwing right-handers, the team on Wednesday descended back to the realities of its everyday existence...

我知道如何解析和获取页面的内容，但我不知道如何在 Regexp 上编写它！要分析带有所需文本的父 HTML 标记，我必须编写一些 Regexp 来检查规则：文章超过 15 个单词，仅使用符号 ' ' 或 ',' 或 '-' 或 ':' 或 '.' 连接。

score 1 · Accepted Answer

期待Nokogiri满足您的需求。这是一个很棒的网页抓取宝石。

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open('http://www.nytimes.com/2013/06/20/sports/baseball/for-the-mets-an-afterglow-then-realitys-harsh-light.html?ref=sports&_r=1&'))
str = doc.at_css('div.articleBody > nyt_text > p').text 

puts str
# >> ATLANTA — From the sublime emotional high provided by Matt Harvey and Zack Wheeler, the Mets’ young, hard-throwing right-handers, the team on Wednesday descended back to the realities of its everyday existence.  

str.scan(/\w+/)
# => ["ATLANTA",
#     "From",
#     "the",
#     "sublime",
#     "emotional",
#     "high",
#     "provided",
#     "by",
#     "Matt",
#     "Harvey",
#     "and",
#     "Zack",
#     "Wheeler",
#     "the",
#     "Mets",
#     "young",
#     "hard",
#     "throwing",
#     "right",
#     "handers",
#     "the",
#     "team",
#     "on",
#     "Wednesday",
#     "descended",
#     "back",
#     "to",
#     "the",
#     "realities",
#     "of",
#     "its",
#     "everyday",
#     "existence"]

我知道那篇文章超过15个字：

str.scan(/\w+/).size > 15 # => true

与符号 ' ' 或 ',' 或 '-' 或 ':' 或 '.' 连接：

[' ',',','-',':','.'].map{|i| str.include? i}
# => [true, true, true, false, false]

ruby - 如何使用正则表达式解析网页中的文章？

1 回答 1

Related