scala - Scala 的理解效率？

Question

在“Programming In Scala”一书第 23 章中，作者举了一个例子：

case class Book(title: String, authors: String*)
val books: List[Book] = // list of books, omitted here
// ﬁnd all authors who have published at least two books

for (b1 <- books; b2 <- books if b1 != b2;
    a1 <- b1.authors; a2 <- b2.authors if a1 == a2)
yield a1

作者说，这会翻译成：

books flatMap (b1 =>
   books filter (b2 => b1 != b2) flatMap (b2 =>
      b1.authors flatMap (a1 =>
        b2.authors filter (a2 => a1 == a2) map (a2 =>
           a1))))

但是，如果您查看 map 和 flatmap 方法定义（TraversableLike.scala），您可能会发现，它们被定义为 for 循环：

   def map[B, That](f: A => B)(implicit bf: CanBuildFrom[Repr, B, That]): That = {
    val b = bf(repr)
    b.sizeHint(this) 
    for (x <- this) b += f(x)
    b.result
  }

  def flatMap[B, That](f: A => Traversable[B])(implicit bf: CanBuildFrom[Repr, B, That]): That = {
    val b = bf(repr)
    for (x <- this) b ++= f(x)
    b.result
  }

好吧，我猜这个 for 将不断被翻译成 foreach 然后翻译成 while 语句，这是一个构造而不是表达式，scala 没有 for 构造，因为它希望 for 总是产生一些东西。

那么，我想和你讨论的是，为什么 Scala 会这样做“为了翻译”？作者的例子使用了4个生成器，最后会被翻译成4级嵌套的for循环，我想当它books很大时它的性能会非常糟糕。

Scala 鼓励人们使用这种“语法糖”，你总是可以看到大量使用 filter、map 和 flatmap 的代码，这似乎程序员忘记了他们真正做的是将一个循环嵌套在另一个循环中，而实现的只是使代码看起来更短。你的想法是什么？

score 7 · Accepted Answer

因为推导式是单子转换的语法糖，因此在各种地方都很有用。在那一点上，它们在 Scala 中比等效的 Haskell 构造更冗长（当然，Haskell 默认情况下是非严格的，因此不能像在 Scala 中那样谈论构造的性能）。

同样重要的是，这种结构可以使正在做的事情保持清晰，并避免快速升级缩进或不必要的私有方法嵌套。

至于最后的考虑，无论是否隐藏了复杂性，我都会假设：

for {
  b1 <- books
  b2 <- books
  if b1 != b2
  a1 <- b1.authors
  a2 <- b2.authors 
  if a1 == a2
} yield a1

很容易看到正在做什么，复杂性很明显：b^2 * a^2（过滤器不会改变复杂性），用于书籍数量和作者数量。现在，用 Java 编写相同的代码，无论是深度缩进还是私有方法，并尝试快速确定代码的复杂性。

所以，恕我直言，这并没有隐藏复杂性，相反，它清楚地表明了这一点。

至于你提到的map//定义，它们不属于flatMap或任何其他类，所以它们不会被应用。基本上，filterList

for(x <- List(1, 2, 3)) yield x * 2

被翻译成

List(1, 2, 3) map (x => x * 2)

那不是一回事

map(List(1, 2, 3), ((x: Int) => x * 2)))

这就是您传递的定义的调用方式。map记录一下， on的实际实现List是：

def map[B, That](f: A => B)(implicit bf: CanBuildFrom[Repr, B, That]): That = {
  val b = bf(repr)
  b.sizeHint(this) 
  for (x <- this) b += f(x)
  b.result
}

score 6 · Accepted Answer

我编写代码以便于理解和维护。然后我简介。如果有瓶颈，那就是我关注的地方。如果它像您所描述的那样，我将以不同的方式解决问题。在那之前，我爱上了“糖”。它省去了我写东西或认真思考的麻烦。

score 4 · Accepted Answer

There are actually 6 loops. One loop for each filter/flatMap/map

The filter->map pairs can be done in one loop by using lazy views of the collections (iterator method)

In general, tt is running 2 nested loops for books to find all book pairs and then two nested loops to find if the author of one book is in the list of authors of the other.

Using simple data structures, you would do the same when coding explicitly.

And of course, the example here is to show a complex 'for' loop, not to write the most efficient code. E.g., instead of a sequence of authors, one could use a Set and then find if the intersection is non empty:

for (b1 <- books; b2 <- books; a <- (b1.authors & b2.authors)) yield a

score 2 · Accepted Answer

请注意，在 2.8 中，filter调用已更改withFilter为惰性调用，并且会避免构造中间结构。请参阅从过滤器移动到 withFilter 的指南？.

我相信for翻译为map, flatMapand的原因withFilter（以及如果存在的值定义）是为了使 monad 的使用更容易。

一般来说，我认为如果您正在进行的计算涉及循环 4 次，那么使用for循环就可以了。如果计算可以更有效地完成并且性能很重要，那么您应该使用更有效的算法。

score 0 · Accepted Answer

@IttayD 关于算法效率的回答的后续行动。值得注意的是，原始帖子（和书中）中的算法是嵌套循环连接。实际上，对于大型数据集，这不是一种有效的算法，大多数数据库会在这里使用哈希聚合。在 Scala 中，哈希聚合看起来像：

(for (book <- books;
      author <- book.authors) yield (book, author)
).groupBy(_._2).filter(_._2.size > 1).keys

scala - Scala 的理解效率？

5 回答 5

Related

Reference