haskell - 需要使用 HXT 转换 Microsoft Office XML

Question

我有 Microsoft Office 生成的 HTML，如下所示：

      <p class="MsoListParagraph" style="text-indent:-.25in;mso-list:l0 level1 lfo1"><span style="font-family:Symbol">
          <span style="mso-list:Ignore">·<span style="font:7.0pt &quot;Times New Roman&quot;">        
</span></span>
        </span>It’s a media conglomerate, need to understand the parts<o:p/></p>
      <p class="MsoListParagraph" style="text-indent:-.25in;mso-list:l0 level1 lfo1"><![if !supportLists]><span style="font-family:Symbol">
          <span style="mso-list:Ignore">·<span style="font:7.0pt &quot;Times New Roman&quot;">        
</span></span>
        </span><![endif]>Largest TV broadcaster in Mexico<o:p/></p>
      <p class="MsoListParagraph" style="margin-left:1.0in;text-indent:-.25in;mso-list:l0 level2 lfo1">
<![if !supportLists]><span style="font-family:&quot;Courier New&quot;">
          <span style="mso-list:Ignore">o<span style="font:7.0pt &quot;Times New Roman&quot;">  
</span></span>
        </span><![endif]>There’s 7 free air channels in Mexico and they have 4<o:p/></p>
      <p class="MsoListParagraph" style="margin-left:1.0in;text-indent:-.25in;mso-list:l0 level2 lfo1">
<![if !supportLists]><span style="font-family:&quot;Courier New&quot;">
          <span style="mso-list:Ignore">o<span style="font:7.0pt &quot;Times New Roman&quot;">  
</span></span>
        </span><![endif]>70% of citizens watch their channels<o:p/></p>

我想使用 HXT 来转换 DOM 结构，以便

我将所有<p>样式为“mso-list:l0 level1”的内容变为a <ul><li class="level1">，并将<p>样式为“mso-list:l0 level2”的内容变为a<ul><li class="level2">
在它们之前的第一个 level1 项目内嵌套连续的 level2 项目。

Control.Arrow.ArrowNavigatableTree我已经使用函数和getXPathTreesfrom尝试了各种 HXT 实验Text.XML.HXT.XPath.Arrows，但 4 小时后没有运气。

有什么建议么？我怀疑该解决方案涉及折叠兄弟<p>XmlTrees 的列表。

编辑

这是我到目前为止提出的解决方案：

GitHub Gist 版本

{-# LANGUAGE Arrows, NoMonomorphismRestriction #-}
module Main where
import Text.XML.HXT.Core
import System.Environment
import Data.List
import Control.Arrow.ArrowNavigatableTree
import Text.XML.HXT.XPath.Arrows
import Data.Tree.NTree.TypeDefs
import Data.Tree.NavigatableTree.Class

isListPara = isElem >>> hasName "p"  >>> hasAttrValue "class" (== "MsoListParagraph")

isListParaLevel n = isListPara >>> hasAttrValue "style" (("level"++(show n)) `isInfixOf`)

-- Nests a flat list of <p> tags into parent <p> tags by "leveln" style infix

fixLevel n = 
    withoutNav (isListParaLevel n <+> (neg isListPara))
    >>>
    replaceChildren (
      getChildren 
      <+>
      (listA 
        (followingSiblingAxis >>> filterAxis isListPara) 
        >>> 
        arr (takeFollowingSibsAtLevel (n + 1)) 
        >>> 
        unlistA 
        >>> fixLevel (n+1)
      )
    )
  where takeFollowingSibsAtLevel n ys@(x:_) = takeWhile (isLevel n . toTree) ys
        takeFollowingSibsAtLevel _ _  = []
        isLevel n t@(NTree (XTag _ xs) _ )
          | (not.null) 
            [ x | 
              x@(NTree (XAttr q) [NTree (XText s)_]) <- xs,
              localPart q == "style",
              ("mso-list:l0 level" ++ (show n ++) " lfo1") `isInfixOf` s
            ] 
            = True
        isLevel _ _ = False

-- this converts nested <p> tags into <ul><li>

pToUL = isListPara 
    >>> 
    changeElemName (const $ mkName "ul") 
    >>> 
    removeAttr "style"  -- remove MSO styling
    >>> 
    replaceChildren (
      (selem "li" [
        (getChildren >>> (neg isListPara))
        <+>
        (getChildren >>> isListPara >>> pToUL)
      ])
    )

main = do  
  args <- getArgs
  input <- case args of
            [infile] -> readFile infile
            otherwise -> getContents
  res <- runX (readString [ 
            withValidate no,withWarnings no ,withParseHTML yes
            ,withInputEncoding utf8
          ] input
          >>>
          {-
          processTopDown (deep none `when` (hasText ("![endif]" `isInfixOf`)))
          >>>
          processTopDown (deep none `when` (hasText (" `isInfixOf`)))
          >>>
          -}

          processTopDown 
          ( 
            (
              deep isElem 
              >>> addNav >>> getChildren 
              >>> fixLevel 1 >>> remNav
              >>> pToUL 
            )  
            `when` 
            (hasAttrValue "class" ("WordSection" `isPrefixOf`))
          )
          >>>
          processTopDown (deep none `when` (isElem >>> hasName "style"))
          >>>
          processTopDown  (deep none `when` (isAttr >>> hasQName (mkName "style")))
          >>>
          processTopDown (deep none `when` (isElem >>> hasAttrValue "style" (== "mso-list:Ignore")))
          >>>
          processTopDown (deep none `when` (isElem >>> hasAttrValue "style" (== "font-family:Symbol")))
          >>>
          writeDocument [
                         -- withShowTree yes,
                         withIndent yes
                        ,withOutputEncoding usAscii
                        ] "-"
          )
  return ()

haskell - 需要使用 HXT 转换 Microsoft Office XML

0 回答 0

Related

Reference