这对于学习 Ruby 来说真的不是问题。数据笨拙且不可靠,对于您相对熟悉的语言来说,这将是一个有用的编程挑战。当你学习一门语言时,你需要一些相对简单的任务,但会测试你对语言本身的了解。
我已经写了这个,它至少符合你的例子 chapter 07
。
它通过从具有多行的页面中选择一个(唯一的)表来工作。然后它遍历这些行,提取字段数组,将不可破坏的空格转换为普通空格,并去除前导和尾随空格。所有空字段都被丢弃,如果不包含数据,则忽略整行。
然后第一列以十进制数字开头的行表示章节的第一行,或者如果前面有连字符,则表示同一章节的章节信息。
如果源中没有字段(描述字段和部分标题),我通常会选择从中间数据中省略它。但是,我已将这些字段默认为空字符串,以符合您的预期 JSON 输出示例。(不存在的散列元素与值为 的散列元素之间存在差异nil
。)
我希望这有帮助。
require 'open-uri'
require 'nokogiri'
require 'json'
open('http://www.s-techent.com/ATA100.htm') do |f|
doc = Nokogiri::HTML(f)
table = doc.at_xpath('//table[count(tr) > 1]')
chapters = []
chapter = nil
table.xpath('tr').each do |tr|
td = tr.xpath('td')
td = td.map { |td| td.content.gsub("\u00A0", ' ').strip }
td = td.select { |txt| not txt.empty? }
next if td.empty?
if td[0] =~ /^\d+/
chapters << chapter if chapter
chapter = {
'chapter' => td[0],
'title' => td[1],
'description' => td[2] || ''
}
elsif td[0] =~ /^-(\d+)/
section = {
'number' => $1,
'title' => td[1] || '',
'description' => td[2] || ''
}
chapter['section'] ||= []
chapter['section'] << section
end
end
chapters << chapter if chapter
puts JSON.pretty_generate(chapters)
end
(部分)输出
{
"chapter": "07",
"title": "LIFTING AND SHORING",
"description": "This chapter shall include the necessary procedures to lift and shore aircraft in any of the conditions to which it may be subjected. Includes lifting and shoring procedures that may be employed during aircraft maintenance and repair.",
"section": [
{
"number": "00",
"title": "GENERAL",
"description": ""
},
{
"number": "10",
"title": "JACKING",
"description": "Provides information relative to jack points, adapters, tail supports, balance weights, jacks and jacking procedures utilized during aircraft maintenance and repair."
},
{
"number": "20",
"title": "SHORING",
"description": "Those instructions necessary to support the aircraft during maintenance and repair. Includes information on shoring materials and equipment, contour dimensions, shoring locations, etc."
}
]
},