2

我正在使用 Ruby 和 Nokogiri 来解析 HTML 源代码,并让它以可识别的模式以下列格式列出项目:

<span class="x">junk</span>
<small class="y">ParameterA</small>
<span class="z">
    <b>ParameterB</b>
    <i>Possible ParameterC</i>
</span>
<script type="text/javascript">useless stuff</script>
<object><noscript>other useless stuff</noscript></object>

<span class="x">junk</span>
<small class="y">ParameterA</small>
<span class="z">
    <b>ParameterB</b>
    <i>Possible ParameterC</i>
</span>
<script type="text/javascript">useless stuff</script>
<object><noscript>other useless stuff</noscript></object>

等等多次。

如何在以下结构中创建具有所需参数的多维数组?

myarray = []
mystuff = Struct.new(:ParameterA, :ParameterB, :ParameterC)

无法找出我可以在这里运行什么样的循环以及如何避免解析无用的东西。

4

2 回答 2

1

我能够用一个正则表达式解决这个问题,它给了我正确的多维数组作为输出:

[["ParameterA", "ParameterB", "Possible ParameterC"], ["ParameterA", "ParameterB", "Possible ParameterC"]]

工作代码:

str = <<EOF
<span class="x">junk</span>
<small class="y">ParameterA</small>
<span class="z">
<b>ParameterB</b>
    <i>Possible ParameterC</i>
</span>
<script type="text/javascript">useless stuff</script>
<object><noscript>other useless stuff</noscript></object>

<span class="x">junk</span>
<small class="y">ParameterA</small>
<span class="z">
    <b>ParameterB</b>
    <i>Possible ParameterC</i>
</span>
<script type="text/javascript">useless stuff</script>
<object><noscript>other useless stuff</noscript></object>
EOF

m = str.scan(/<small [^>]+>([^<]+)<.*?<b>([^<]+)<\/b>\s+<i>([^<]+)<\/i>/m)
puts m.inspect
于 2012-09-08T05:42:26.403 回答
0

我会使用这样的东西:

require 'nokogiri'
require 'ostruct'

doc = Nokogiri::HTML(<<EOT)
<span class="x">junk</span>
<small class="y">ParameterA</small>
<span class="z">
    <b>ParameterB</b>
    <i>Possible ParameterC</i>
</span>
<script type="text/javascript">useless stuff</script>
<object><noscript>other useless stuff</noscript></object>

<span class="x">junk</span>
<small class="y">ParameterA</small>
<span class="z">
    <b>ParameterB</b>
    <i>Possible ParameterC</i>
</span>
<script type="text/javascript">useless stuff</script>
<object><noscript>other useless stuff</noscript></object>
EOT

mystuff = doc.search('small.y').map { |span_y|
  [
    span_y.content,
    span_y.next_element.at('b').content,
    span_y.next_element.at('i') ? span_y.next_element.at('i').content : nil
  ]
}

pp mystuff

看起来像:

[
  [
    "ParameterA",
    "ParameterB",
    "Possible ParameterC"
  ],
  [
    "ParameterA",
    "ParameterB",
    "Possible ParameterC"
  ]
]
于 2012-09-10T05:30:24.713 回答