0

I'm trying to get table with content of MMEL codes from this site and I'm trying to accomplish it with CSS Selectors.

What I've got so far is:

require_relative 'sources/Downloader'
require 'nokogiri'

html_content = Downloader.download_page('http://www.s-techent.com/ATA100.htm')
parsed_html = Nokogiri::HTML(html_content)

tmp = parsed_html.css("tr[*]")

puts tmp.text

And I'm getting error while trying to get this tr with attribute. How can I complete this task to get this table in simple form because I want to parse it to JSON. It would be nice go get this in sections and call it in.each block.


EDIT: I'd be nic if I can get things in block like this (look into pages source)

<TR><TD WIDTH="10%" VALIGN="TOP" ROWSPAN=5>
<B><FONT FACE="Arial" SIZE=2><P ALIGN="CENTER">11</B></FONT></TD>
<TD WIDTH="40%" VALIGN="TOP"  COLSPAN=2>
<B><FONT FACE="Arial" SIZE=2><P>PLACARDS AND MARKINGS</B></FONT></TD>
<TD WIDTH="50%" VALIGN="TOP">
<FONT FACE="Arial" SIZE=2><P ALIGN="LEFT">All procurable placards, labels, etc., shall be included in the illustrated Parts Catalog.  They shall be illustrated, showing the part number, Legend and Location.  The Maintenance Manual shall provide the approximate Location (i.e., FWD -UPPER -RH) and illustrate each placard, label, marking, self -illuminating sign, etc., required for safety information, maintenance significant information or by government regulations.  Those required by government regulations shall be so identified.</FONT></TD>
</TR>
4

2 回答 2

1

您也可以使用以下方法执行相同操作xpath

以下是OP在帖子中给出的网页第一个表格的内容:

require 'nokogiri'
require 'open-uri'

doc = Nokogiri.HTML(open('http://www.s-techent.com/ATA100.htm'))
doc.xpath('(//table)[1]/tr').each do |tr|
  puts tr.to_html(:encoding => 'utf-8')
end

输出:

  <tr>
  <td width="33%" valign="MIDDLE" colspan="2">
  <p><img src="S-Tech-Logo-Blue2.gif" width="274" height="127"></p>
  </td>
  <td width="67%" valign="MIDDLE">
  <b><i><font face="Arial" color="#0000ff">
  <p align="CENTER"><big>AIRCRAFT PARTS MANUFACTURING ASSISTANCE (PMA)</big><br><big>DAR SERVICES</big></p></font></i></b>
  </td>
  </tr>

现在,如果要收集最后的表行,请执行以下操作:

require 'nokogiri'
require 'open-uri'

doc = Nokogiri.HTML(open('http://www.s-techent.com/ATA100.htm'))
p doc.xpath('(//table)[3]/tr').to_a.size # => 1
doc.xpath('(//table)[3]/tr').each do |tr|
  puts tr.to_html(:encoding => 'utf-8')
end

输出:

<tr>
<td width="40%" valign="TOP" height="10">
<p align="CENTER"><b><font face="Arial" size="2" color="#0000ff">149 AZALEA CIRCLE • LIMERICK, PA 19468-1330</font></b></p>
</td>
<td width="30%" valign="TOP" height="10">
<p align="CENTER"><b><font face="Arial" size="2" color="#0000ff">610-495-6898 (Office) • 484-680-0507 (Cell)</font></b></p>
</td>
<td width="110%" valign="TOP" height="10">
<p align="CENTER"><a href="Contact.htm"><b><font face="Arial" size="2">E-mail S-Tech</font></b></a></p>
</td>
</tr>
于 2013-09-07T22:10:04.283 回答
1

这应该在第 96 行从源代码中打印所有这些 TR。该页面中有三个表格,并且table[1]包含您需要的所有文本:

require 'nokogiri'

doc = Nokogiri::HTML(open('http://www.s-techent.com/ATA100.htm'))
doc.css("table")[1].css("tr").each do |i|
  puts i #=> prints the exact html between TR tags (including)
  puts i.text #=> prints the text
end

例如:

puts doc.css("table")[1].css("tr")[2] 

打印以下内容:

<tr>
<td valign="TOP" colspan="3">
<b><font face="Arial" size="2"><p align="CENTER">GROUP DEFINITION - AIRCRAFT</p></font></b>
</td>
<td valign="TOP">
<font face="Arial" size="2"><p align="LEFT">The complete operational unit.  Includes dimensions and
areas, lifting and shoring,    leveling and weighing, towing and taxiing, parking and mooring, requi
red placards, servicing.</p></font>
</td>
</tr>
于 2013-09-07T21:27:47.273 回答