html - 在 Haskell 中使用 TagSoup 解析标签

Question

我一直在尝试学习如何在 Haskell 中从 HTML 文件中提取数据，并且遇到了困难。我完全没有使用 Haskell 的经验，我之前的知识来自 Python（以及用于 HTML 解析的 BeatifulSoup）。

我正在使用 TagSoup 来查看我的 HTML（似乎是推荐的），并且对它的工作原理有一个基本的了解。这是我有问题的代码的基本部分（自包含，并输出用于测试的信息）：

import System.IO
import Network.HTTP
import Text.HTML.TagSoup
import Data.List

main :: IO ()
main = do
    http <- simpleHTTP (getRequest "http://www.cbssports.com/nba/scoreboard/20130310") >>= getResponseBody
    let tags = dropWhile (~/= TagOpen "div" []) (parseTags http)
    done tags where
        done xs = case xs of
            [] -> putStrLn $ "\n"
            _ -> do
                putStrLn $ show $ head xs
                done (tail xs)

但是，我并不想使用任何“div”标签。我想以如下格式删除标签之前的所有内容：

TagOpen "div" [("id","scores-1997830"),("class","scoreBox spanCol2")]
TagOpen "div" [("id","scores-1997831"),("class","scoreBox spanCol2 lastCol")]

我试过写出来：

let tags = dropWhile (~/= TagOpen "div" [("id", "scores-[0-9]+"), ("class", "scoreBox( spanCol[0-9]?)+( lastCol)?")]) (parseTags http)

但随后它试图找到文字 [0-9]+。我还没有找到 Text.Regex.Posix 模块的解决方法，并且转义字符不起作用。这里有什么解决方案？

score 4 · Accepted Answer

~==不做正则表达式，你必须自己写一个匹配器，类似于

import Data.Maybe
import Text.Regex

goodTag :: TagOpen -> Bool
goodTag tag = tag ~== TagOpen "div" []
    && fromAttrib "id" tag `matches` "scores-[0-9]+"

-- Just a wrapper around Text.Regex.matchRegex
matches :: String -> String -> Bool
matches string regex = isJust $ mkRegex regex `matchRegex` string

html - 在 Haskell 中使用 TagSoup 解析标签

1 回答 1

Related

Reference