1

I have an html page with many tables.

<html>
<table>
  POINTER_TEXT
  some other stuff
  <table that i want START>
  </table that i want END>
  some other stuff
  <table bad>
  </table bad>
</table>
</html>

I wish to grab a table that comes after a specific text. I am good until this stage.

curl -silent http://xyz.com/1.htm | sed -n '/POINTER_TEXT/,$p'

This gives me

  POINTER_TEXT
  some other stuff
  <table that i want START>
  </table that i want END>
  some other stuff
  <table bad>
  </table bad>
</table>
</html>

Then I add this:

curl -silent http://xyz.com/1.htm | sed -n '/POINTER_TEXT/,$p' | sed -n '/<table*/,/<\/table>/p'

which gives me this:

  <table that i want START>
  </table that i want END>
  <table bad>
  </table bad>

My problem is I just need this:

  <table that i want START>
  </table that i want END>

Help me please guys!


Add

| sed '\=</table={p;Q}'

at the end. This should throw away everything after the first table end.

But, what will your script do if there are no newlines in the html? It is far more robust to use a real parser to process HTML.

4

4 回答 4

1

添加

| sed '\=</table={p;Q}'

在最后。这应该在第一张桌子结束后扔掉所有东西。

但是,如果 html 中没有换行符,您的脚本会做什么?使用真正的解析器来处理 HTML要健壮得多。

于 2012-07-27T12:04:12.033 回答
0

Here is the guide you will need: click

(1) The general solution is to use GNU sed or ssed, with one of these range expressions. The first script ("print only the first match") works with any version of sed:

 sed -n '/RE/{p;q;}' file       # print only the first match
 sed '0,/RE/{//d;}' file        # delete only the first match
 sed '0,/RE/s//to_that/' file   # change only the first match
于 2012-07-27T12:05:17.120 回答
0

This might work for you (GNU sed):

sed '/POINTER_TEXT/,${/<table/,/<\/table/{/<\/table/!b;q}};d' file
于 2012-07-27T12:47:24.030 回答
0

Depending on what you're trying to do you might fare better with a real parser as choroba suggested. Conveniently, W3C already provides one which accepts CSS3 selectors.

Example input "infile":

<html>
<table>
  POINTER_TEXT
  some other stuff
  <table>
  Wanted data
  </table>
  some other stuff
  <table>
  Not wanted
  </table>
</table>
</html>

In order to extract the first <table> descendant of <table>, use hxselect like this:

hxselect 'table > table:first-child' < infile
于 2012-07-27T13:39:33.630 回答