haskell - 在 Haskell 中定义数据结构的建议

Question

我无法在 Haskell 中建模数据结构。假设我正在经营一家动物研究机构，我想跟踪我的老鼠。我想跟踪老鼠被分配到笼子和实验中的情况。我还想记录我的老鼠的重量，我的笼子的体积，并记录我的实验。

在 SQL 中，我可能会这样做：

create table cages (id integer primary key, volume double);
create table experiments (id integer primary key, notes text)
create table rats (
    weight double,
    cage_id integer references cages (id),
    experiment_id integer references experiments (id)
);

（我意识到这允许我将来自不同实验的两只老鼠分配到同一个笼子里。这是有意的。我实际上并没有经营动物研究机构。）

两个必须可能的操作：（1）给定一只老鼠，找出它笼子的体积；（2）给定一只老鼠，得到它所属实验的笔记。

在 SQL 中，那些将是

select cages.volume from rats
  inner join cages on cages.id = rats.cage_id
  where rats.id = ...; -- (1)
select experiments.notes from rats
  inner join experiments on experiments.id = rats.experiment_id
  where rats.id = ...; -- (2)

我如何在 Haskell 中建模这个数据结构？

一种方法是

type Weight = Double
type Volume = Double

data Rat = Rat Cage Experiment Weight
data Cage = Cage Volume
data Experiment = Experiment String

data ResearchFacility = ResearchFacility [Rat]

ratCageVolume :: Rat -> Volume
ratCageVolume (Rat (Cage volume) _ _) = volume

ratExperimentNotes :: Rat -> String
ratExperimentNotes (Rat _ (Experiment notes) _) = notes

但是这个结构不会引入一堆Cages 和Experiments 的副本吗？还是我不应该担心它并希望优化器能解决这个问题？

score 16 · Accepted Answer

这是我用于测试的一个短文件：

type Weight = Double
type Volume = Double

data Rat = Rat Cage Experiment Weight deriving (Eq, Ord, Show, Read)
data Cage = Cage Volume               deriving (Eq, Ord, Show, Read)
data Experiment = Experiment String   deriving (Eq, Ord, Show, Read)

volume     = 30
name       = "foo"
weight     = 15
cage       = Cage volume
experiment = Experiment name
rat        = Rat cage experiment weight

然后我启动了 ghci 和 import System.Vacuum.Cairo，可以从令人愉快的vacuum-cairo包中获得。

*Main System.Vacuum.Cairo> view (rat, Rat (Cage 30) (Experiment "foo") 15)

不共享

*Main System.Vacuum.Cairo> view (rat, Rat (Cage 30) experiment 15)

共享实验

（我不太确定为什么这个箭头是双箭头，但你可以忽略/折叠它们。）

*Main System.Vacuum.Cairo> view (rat, Rat cage experiment weight)

共享参数

*Main System.Vacuum.Cairo> view (rat, rat)

共享所有

*Main System.Vacuum.Cairo> view (rat, Rat cage experiment (weight+1))

共享修改

如上所示，经验法则是，当您调用构造函数时，会准确地创建新对象。否则，如果您只是命名一个已经创建的对象，则不会创建新对象。在 Haskell 中这样做是安全的，因为它是一种不可变的语言。

score 6 · Accepted Answer

模型的更自然的 Haskell 表示是让笼子包含实际的老鼠对象而不是它们的 id：

data Rat = Rat RatId Weight
data Cage = Cage [Rat] Volume
data Experiment = Experiment [Rat] String

然后，您将ResearchFacility使用智能构造函数创建对象，以确保它们遵循规则。它看起来像：

research_facility :: [Rat] -> Map Rat Cage -> Map Rat Experiment -> ResearchFacility
research_facility rats cage_assign experiment_assign = ...

其中cage_assignand是包含与 sql 中的和外键experiment_assign相同信息的映射。cage_idexperiment_id

score 4 · Accepted Answer

第一个观察：你应该学会使用记录。Haskell 中的记录字段名称被视为函数，因此这些定义至少会让您输入更少：

data Rat = Rat { getCage       :: Cage
               , getExperiment :: Experiment
               , getWeight     :: Weight }

data Cage = Cage { getVolume :: Volume }

-- Now this function is so trivial to define that you might actually not bother:
ratCageVolume :: Rat -> Volume
ratCageVolume = getVolume . getCage

至于数据表示，我可能会沿着这些路线走下去：

type Weight = Double
type Volume = Double

-- Rats and Cages have identity that goes beyond their properties;
-- two distinct rats of the same weight can be in the same cage, and
-- two cages can have same volume.
-- 
-- So should we give each Rat and Cage an additional field to
-- represent its key?  We could do that, or we could abstract that out
-- into this:

data Identity i a = Identity { getId  :: i
                             , getVal :: a }
            deriving Show

instance Eq i => Eq (Identity i a) where
    a == b = getId a == getId b

instance Ord i => Ord (Identity i a) where
    a `compare` b = getId a `compare` getId b


-- And to simplify a common case:
type Id a = Identity Int a


-- Rats' only real intrinsic property is their weight.  Cage and Experiment?
-- Situational, I say.
data Rat = Rat { getWeight :: Weight  }

data Cage = Cage { getVolume :: Volume }

data Experiment = Experiment { getNotes :: String }
                  deriving (Eq, Show)

-- The data that you're manipulating is really this:
type RatData = (Id Rat, Id Cage, Id Experiment)

type ResearchFacility = [RatData]

score 3 · Accepted Answer

我在日常工作中大部分时间都使用 Haskell，我遇到了这个问题。我的经验是，创建多少数据结构的副本不是问题，更多的是涉及到的数据依赖关系。我们使用类似的数据结构来帮助与存储实际数据的关系数据库进行交互。这意味着我们有这样的查询。

getCageById       :: IdType -> IO (Maybe Cage)
getRatById        :: IdType -> IO (Maybe Rat)
getExperimentById :: IdType -> IO (Maybe Experiment)

我们从像您一样构建的数据结构开始，其中包含链接的数据结构。事实证明这是一个巨大的错误。问题是，如果您对 Rat 使用以下定义...

data Rat = Rat Cage Experiment Weight

...然后 getRatById 函数必须运行三个数据库查询才能返回结果。起初这似乎是一种很好的便捷方式，但它最终成为一个巨大的性能问题，特别是如果我们希望查询返回一堆结果。即使我们只想要老鼠表中的行，数据结构也会强制我们进行连接。额外的数据库查询是问题所在，而不是 RAM 中额外对象的可能性。

现在我们的策略是，当我们创建与数据库表相对应的数据结构时，我们总是像表一样对它们进行非规范化。所以你的例子会变成这样：

type IdType = Int
type Weight = Double
type Volume = Double

data Rat = Rat
    { ratId        :: IdType
    , cageId       :: IdType
    , experimentId :: IdType
    , weight       :: Weight
    }
data Cage = Cage IdType Volume
data Experiment = Experiment IdType String

（您甚至可能想要使用 newtypes 来区分不同的 ID。）获取整个结构需要更多的工作，但它可以让您有效地获取结构的某些部分。当然，如果您永远不需要获取结构的各个部分，那么我的建议可能不合适。但我的经验是，部分查询很常见，我不想人为地让它们变慢。如果您想要一个为您执行连接的函数的便利，那么您当然可以编写一个。但不要使用将您锁定在这种使用模式中的数据模型。

haskell - 在 Haskell 中定义数据结构的建议

4 回答 4

Related

Reference