0

这实际上是我几天前提出的问题的延续。我采用了应用函子路线并制作了自己的实例。

我需要逐行解析文件中的大量 json 语句。一个示例 json 语句是这样的 -

{"question_text": "How can NBC defend tape delaying the Olympics when everyone has
Twitter?", "context_topic": {"followers": 21, "name": "NBC Coverage of the London
Olympics (July & August 2012)"}, "topics": [{"followers": 2705,
"name": "NBC"},{"followers": 21, "name": "NBC Coverage of the London 
Olympics (July & August 2012)"},
{"followers": 17828, "name": "Olympic Games"},
{"followers": 11955, "name": "2012 Summer Olympics in London"}], 
"question_key": "AAEAABORnPCiXO94q0oSDqfCuMJ2jh0ThsH2dHy4ATgigZ5J",
"__ans__": true, "anonymous": false}

对 json 格式感到抱歉。变坏了

我有大约 10000 个这样的 json 语句,我需要解析它们。我写的代码是这样的 -

parseToRecord :: B.ByteString -> Question
parseToRecord bstr = (\(Ok x) -> x) decodedObj where decodedObj = decode (B.unpack bstr) :: Result Question

main :: IO()
main = do
  -- my first line in the file tells how many json statements
  -- are there followed by a lot of other irrelevant info...
  ts <- B.getContents >>= return . fst . fromJust . B.readInteger . head . B.lines
  json_text <- B.getContents >>= return . tail . B.lines
  let training_data = take (fromIntegral ts) json_text
  let questions = map parseToRecord training_data
  print $ questions !! 8922

这段代码给了我一个运行时错误Non-exhaustive patterns in lambda。代码中引用的错误\(Ok x) -> x。通过反复试验,我得出的结论是,该程序在第 8921 个索引之前工作正常,但在第 8922 个迭代时失败。

我检查了相应的 json 语句,并尝试通过调用它的函数来独立解析它,它可以工作。但是,当我调用map时它不起作用。我真的不明白发生了什么。在“学习haskell for a great good”中学习了一点haskell后,我想潜入一个真实世界的编程项目,但似乎被困在这里。

编辑::完整代码如下

{-# LANGUAGE BangPatterns #-}
{-# OPTIONS_GHC -O2 -optc-O2 #-}
{-# OPTIONS_GHC -fno-warn-incomplete-uni-patterns #-}
import qualified Data.ByteString.Lazy.Char8 as B
import Data.Maybe
import NLP.Tokenize

import           Control.Applicative
import           Control.Monad
import           Text.JSON

data Topic = Topic
  { followers :: Integer,
    name :: String
  } deriving (Show)

data Question = Question
  { question_text :: String,
    context_topic :: Topic,
    topics :: [Topic],
    question_key :: String,
    __ans__ :: Bool,
    anonymous :: Bool
  } deriving (Show)

(!) :: (JSON a) => JSObject JSValue -> String -> Result a
(!) = flip valFromObj

instance JSON Topic where
  -- Keep the compiler quiet
  showJSON = undefined

  readJSON (JSObject obj) =
    Topic       <$>
    obj ! "followers" <*>
    obj ! "name"
  readJSON _ = mzero

instance JSON Question where
  -- Keep the compiler quiet
  showJSON = undefined

  readJSON (JSObject obj) =
    Question      <$>
    obj ! "question_text"   <*>
    obj ! "context_topic" <*>
    obj ! "topics" <*>
    obj ! "question_key" <*>
    obj ! "__ans__" <*>
    obj ! "anonymous"
  readJSON _ = mzero

isAnswered (Question _ _ _ _ status _) = status
isAnonymous (Question _ _ _ _ _ status) = status

parseToRecord :: B.ByteString -> Question
parseToRecord bstr = handle decodedObj
                        where handle (Ok k)     = k
                              handle (Error e)  = error (e ++ "\n" ++ show bstr)
                              decodedObj = decode (B.unpack bstr) :: Result Question
--parseToRecord bstr = (\(Ok x) -> x) decodedObj where decodedObj = decode (B.unpack bstr) :: Result Question

main :: IO()
main = do
  ts <- B.getContents >>= return . fst . fromJust . B.readInteger . head . B.lines
  json_text <- B.getContents >>= return . tail . B.lines
  let training_data = take (fromIntegral ts) json_text
  let questions = map parseToRecord training_data
  let correlation = foldr (\x acc -> if (isAnonymous x == isAnswered x) then (fst acc + 1, snd acc + 1) else (fst acc, snd acc + 1)) (0,0) questions
  print $ fst correlation

这是可以作为可执行文件输入的数据。我正在使用 ghc 7.6.3。如果程序名称是 ans.hs,我按照这些步骤操作。

$ ghc --make ans.hs
$ ./ans < path/to/the/file/sample/answered_data_10k.in

多谢!

4

2 回答 2

2

lambda 函数(\(Ok x) -> x)是部分的,因为它只能匹配成功解码的对象。如果您遇到这种情况,则表明您的JSON解析器由于某种原因无法解析记录。

使parseToRecord函数提供更多信息将帮助您找到错误。尝试实际报告错误,而不是报告失败的模式匹配。

parseToRecord :: B.ByteString -> Question
parseToRecord bstr = handle decodedObj 
    where handle (Ok k)    = k
          handle (Error e) = error e
          decodedObj = decode (B.unpack bstr) :: Result Question

如果您需要更多帮助,包含解析器代码可能会很有用。

更新

根据您的代码和示例JSON,您的代码nullcontext_topic您的JSON. 您当前的代码无法处理 a null,因此无法解析。我的修复将类似于以下内容,但您可以想出其他方法来处理它。

data Nullable a = Null
                | Full a
    deriving (Show)

instance JSON a => JSON (Nullable a) where
    showJSON Null     = JSNull
    showJSON (Full a) = showJSON a

    readJSON JSNull   = Ok Null
    readJSON c        = Full `fmap` readJSON c

data Question = Question
  { question_text :: String,
    context_topic :: Nullable Topic,
    topics :: [Topic],
    question_key :: String,
    __ans__ :: Bool,
    anonymous :: Bool
  } deriving (Show)

它似乎在第 9002 行也失败了,该行有一个“1000”的裸值,并且该行之后的几个值似乎JSON缺少该 '__ans__'字段。

于 2013-07-29T22:41:31.233 回答
0

我建议使用Maybe它来解析空值:

data Question = Question
  { question_text :: String
  , context_topic :: Maybe Topic
  , topics :: [Topic]
  , question_key :: String
  , __ans__ :: Bool
  , anonymous :: Bool
  } deriving (Show)

readJSON然后按如下方式更改函数(此外,可以通过返回不成功的解析尝试来修复缺少的ans字段):False

instance JSON Question where
  -- Keep the compiler quiet
  showJSON = undefined

  readJSON (JSObject obj) = Question <$>
    obj ! "question_text"   <*>
    (fmap Just (obj ! "context_topic") <|> return Nothing) <*>
    obj ! "topics" <*>
    obj ! "question_key" <*>
    (obj ! "__ans__" <|> return False) <*>
    obj ! "anonymous"
  readJSON _ = mzero

在摆脱了10009000 行(如提到的 sabauma)之后,我得到了4358结果。那么也许这些细微的变化就足够了?

于 2013-07-30T18:40:04.460 回答