0

我需要你的帮助来解决一个我找不到的问题...

我有一个带有 tr 和 td 的 html 表:

例如:

<table border="0" cellpadding="0" cellspacing="0">
    <tr>
     <td>
     </td>
    </tr>
    <tr>
     <td colspan="2">
      <br />
      <h2>
       Macros
      </h2>
     </td>
    </tr>
    <tr>
     <td>
      #define&nbsp;
     </td>
     <td>
      <a class="el" href="#g3e3da223d2db3b49a9b6e3ee6f49f745">
       SND_LSTINDIC
      </a>
     </td>
    </tr>
    <tr>
     <td class="mdescLeft">
      &nbsp;
     </td>
     <td class="mdescRight">
      liste sons indication
      <br />
     </td>
    </tr>
    <tr>
     <td colspan="2">
      <br />
      <h2>
       Définition de type
      </h2>
     </td>
    </tr>
    <tr>
     <td class="memItemLeft" nowrap="nowrap" align="right" valign="top">
      typedef void(*&nbsp;
     </td>
     <td class="memItemRight" valign="bottom">
      <a class="el" href="#g73cba8bd62d629eb05495a5c1a7b2844">
       f_sndChangeFunc
      </a>
      )(
      <a class="el" href="#g4ab7db37a42f244764583a63997489a8">
       e_sndSound
      </a>
      i_eSound,
    aBOOL
    i_bStart,
    aBYTE
    i_byDisableModule)
     </td>
    </tr>
    <tr>
     <td class="mdescLeft">
      &nbsp;
     </td>
     <td class="mdescRight">
      Fonction rappel sur départ/arrêt bip.
      <a href="#g73cba8bd62d629eb05495a5c1a7b2844">
      </a>
      <br />
     </td>
    </tr>
    <tr>
     <td colspan="2">
      <br />
      <h2>
       Énumérations
      </h2>
     </td>
    </tr>
    <tr>
     <td class="memItemLeft" nowrap="nowrap" align="right" valign="top">
      enum &nbsp;
     </td>
     <td class="memItemRight" valign="bottom">
      <a class="el" href="#g4ab7db37a42f244764583a63997489a8">
       e_sndSound
      </a>
      {
      }
     </td>
    </tr>
    <tr>
     <td class="mdescLeft">
      &nbsp;
     </td>
     <td class="mdescRight">
      identificateurs sons
      <a href="group__Sound.html#g4ab7db37a42f244764583a63997489a8">
       Plus de détails...
      </a>
      <br />
     </td>
    </tr>
</table>

我试着把这张桌子分成几张。我想出去

标题并使用以下行创建一个表。

例如,这里的预期结果应该是这样的:

<h2>
  Macros
</h2>
<table border="0" cellpadding="0" cellspacing="0">
    <tr>
     <td>
     </td>
    </tr>
    <tr>
     <td colspan="2">
      <br />
     </td>
    </tr>
    <tr>
     <td>
      #define&nbsp;
     </td>
     <td>
      <a class="el" href="#g3e3da223d2db3b49a9b6e3ee6f49f745">
       SND_LSTINDIC
      </a>
     </td>
    </tr>
    <tr>
     <td class="mdescLeft">
      &nbsp;
     </td>
     <td class="mdescRight">
      liste sons indication
      <br />
     </td>
    </tr>
  </table>

  <h2>
    Définition de type
  </h2>
  <table>
    <tr>
     <td class="memItemLeft" nowrap="nowrap" align="right" valign="top">
      typedef void(*&nbsp;
     </td>
     <td class="memItemRight" valign="bottom">
      <a class="el" href="#g73cba8bd62d629eb05495a5c1a7b2844">
       f_sndChangeFunc
      </a>
      )(
      <a class="el" href="#g4ab7db37a42f244764583a63997489a8">
       e_sndSound
      </a>
      i_eSound,
    aBOOL
    i_bStart,
    aBYTE
    i_byDisableModule)
     </td>
    </tr>
    <tr>
     <td class="mdescLeft">
      &nbsp;
     </td>
     <td class="mdescRight">
      Fonction rappel sur départ/arrêt bip.
      <a href="#g73cba8bd62d629eb05495a5c1a7b2844">
      </a>
      <br />
     </td>
    </tr>
  </table>

  <h2>
    Énumérations
  </h2>
  <table>
    <tr>
     <td class="memItemLeft" nowrap="nowrap" align="right" valign="top">
      enum &nbsp;
     </td>
     <td class="memItemRight" valign="bottom">
      <a class="el" href="#g4ab7db37a42f244764583a63997489a8">
       e_sndSound
      </a>
      {
      }
     </td>
    </tr>
    <tr>
     <td class="mdescLeft">
      &nbsp;
     </td>
     <td class="mdescRight">
      identificateurs sons
      <a href="group__Sound.html#g4ab7db37a42f244764583a63997489a8">
       Plus de détails...
      </a>
      <br />
     </td>
    </tr>
</table>

我使用 python 和 BeautifulSoup 来解析我的 html 代码。我首先尝试了这个:

from BeautifulSoup import BeautifulSoup, NavigableString
import sys
import os

soup = BeautifulSoup(allHtml)

for table in htmlSoup.findAll("table"):
   h2s = table.findAll("h2")
      if h2s is not []:      
         FirstH2 = True
         LastH2 = False
         for i, h2 in enumerate(h2s):
            if h2 is not []:
               LastH2 = ( i == len(h2s) - 1 )

               h2.parent.replaceWithChildren() # <td> deleted
               h2.parent.replaceWithChildren() # <tr> deleted
               print h2.parent
               if FirstH2:
                  h2.replaceWith( h2.prettify() + '<table>' )
                  #h2_tag_idx = h2.parent.contents.index(h2) # other method to add Tags
                  #h2.parent.insert(h2_tag_idx + 1, '<b>OK</b>')
               else:
                  h2.replaceWith( '</table>' + h2.prettify() + '<table>' )

               FirstH2 = False

print soup.prettify()

但没办法,它用 HTML 等价的 ASCII 代码替换我的标签......

我还尝试获取表中的所有内容,然后尝试重建几张表,然后将其再次放入汤中,但失败了...

我还尝试在字符串中获取表格并使用分隔符拆分字符串并将所有子表放入汤中,但它也失败了......

如果有人有想法,那就太好了!

提前致谢!

4

1 回答 1

0

我做了这个功能,它工作......

def getOutTitleFromTable(htmlSoup):
   for ii, table in enumerate(htmlSoup.findAll("table")):
      h2s = table.findAll("h2") # on cherche tous les <h2></h2> dans le tableau
      #print h2s
      if len(h2s) > 0: #si on a au moins 1 <h2> dans le tableau   
         FirstH2 = True
         LastH2 = False
         newTables = BeautifulSoup() # contiendra nos tableaux reconstitués
         for i, h2 in enumerate(h2s):
            if h2 is not []:
               LastH2 = ( i == len(h2s) - 1 )
               h2.parent.replaceWithChildren() # on supprime le <td>
               h2.parent.replaceWithChildren() # on supprime le <tr>

               idT = "table"+str(ii)+str(i) # création d'un id de tableau pour une meilleure lisibilité
               wrapTable = Tag(htmlSoup, "table")
               wrapTable["id"]=idT
               wrapTable["border"]=0
               wrapTable["cellpadding"]=0
               wrapTable["cellspacing"]=0
               #print h2.parent.contents.index(h2) # index du h2 dans l'arbre table
               table.insert(h2.parent.contents.index(h2)+1, wrapTable) # on ajoute <table></table> après chaque <h2>"title"</h2>
               #newTable = table.findAll("table")
               newTable = table.find(name="table", attrs={"id" : idT})
               fillTable = False
               #print table.findAll(["h2","tr"])
               for tr in table.findAll(["h2","tr"]):
                  if fillTable:
                     if tr in h2s:
                        #print "fin du nouveau tableau"
                        #print tr
                        fillTable = False
                        break
                     else:
                        if tr.find("h2") not in h2s:
                           #print "ajout d'une nouvelle ligne: "
                           newTable.contents.append(tr)
                           #print newTable.contents

                  if str(tr) == str(h2):
                     #print "Début du nouveau tableau"
                     #print tr
                     fillTable = True

               newTables.append(h2)
               newTables.append(newTable)

               #os.system("pause")

               #print h2
               #print FirstH2
               #print LastH2
               FirstH2 = False

         #print newTables
         table.contents = newTables
         table.name = "div" # On change la balise table en div... on triche mais je n'arrive absolument pas à retirer le wrap <table></table>

如果有人有更好的解决方案,我会很乐意看看。

再见

于 2013-08-08T10:36:42.567 回答