这是我第一次尝试使用 Nokogiri 解析网页。
我正在尝试从网页中提取地址并将它们存储在 CSV 文件中。到目前为止,我只能提取 City、State 和 Zip 字段。
我不知道如何提取设施名称、地址、电话、号码和公司信息。地址可能包含一个或两个街道组件。
对于电话,可能有一个或多个电话号码。电话号码可以是普通号码或传真号码,但它们仅在文本中显示,而不是在标签中。对于公司,我希望能够提取 URL 和名称。
页面上的每个地址都包含如下:
<!-- address entry -->
<div id='1234' class='address'>
<div class='address_header'>
<h1 class='header_name'>
<strong><a href='{URL}'>Facility Name</a></strong>
</h1>
<h2 class='header_city'>
New York
</h2>
</div>
<div class='address_details'>
<div class='info'>
<p class='address'>
<span class='street'>123 ABC St</span><br />
<span class='street'>Unit 1</span><br />
<span class='city'>New York</span>,
<span class='state'>NY</span>
<span class='zip'>10022</span>
</p>
<p class='phone'>
Phone: <span class='tel'>999.999.9999</span>
</p>
<p class='phone'>
Fax: <span class='tel'>888.888.8888</span>
</p>
<p class='company'>
Company: <a href='{URL}'>Company Name</a>
</p>
</div>
</div>
</div>
<!-- address entry -->
<!-- address entry -->
<div id='4567' class='address'>
<div class='address_header'>
<h1 class='header_name'>
<strong><a href='{URL}'>Facility Name</a></strong>
</h1>
<h2 class='header_city'>
New York
</h2>
</div>
<div class='address_details'>
<div class='info'>
<p class='address'>
<span class='street'>456 DEF Rd</span><br />
<span class='city'>New York</span>,
<span class='state'>NY</span>
<span class='zip'>10022</span>
</p>
<p class='phone'>
Phone: <span class='tel'>555.555.5555</span>
</p>
<p class='company'>
Company: <a href='{URL}'>Company Name</a>
</p>
</div>
</div>
</div>
<!-- address entry -->
这是我非常基本的设置。
require 'nokogiri'
require 'open-uri'
require 'csv'
doc = Nokogiri::HTML(open('[URL]'))
Cities = Array.new
States = Array.new
Zips = Array.new
doc.css("p[class='address']").css("span[class='city']").each do |city|
Cities << city.content
end
doc.css("p[class='address']").css("span[class='state']").each do |state|
States << state.content
end
doc.css("p[class='address']").css("span[class='zip']").each do |zip|
Zips << zip.content
end
CSV.open("myCSV.csv", "wb") do |row|
row << ["City", "State", "Zip"]
(0..Cities.length - 1).each do |index|
row << [Cities[index], States[index], Zips[index]]
end
end
在这里将信息存储在单独的数组中似乎非常笨拙。我基本上想为源文档中地址节点的每次出现在 CSV 表中创建一个行条目,然后用字段(如果存在)填充它:
Facility St_1 St_2 City State Zip Phone Fax URL Company
======== ===== ===== ===== ====== ==== ====== ==== ==== ============
xxxxxxxx xxxx xxxx xxxxx xxxx xxxxx xxxx xxxxxxxx
xxxxxxxx xxxx xxxxx xxxx xxxxx xxxx xxxxx xxxx xxxx xxxxxxxx
有人能帮我吗?