我试图了解 FPTree 类的“添加”和“提取”方法:(https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ mllib/fpm/FPGrowth.scala)。
- “摘要”变量的目的是什么?
- 组列表在哪里?我假设它是以下内容,我是否正确:
val numParts = if (numPartitions > 0) numPartitions else data.partitions.length val partitioner = new HashPartitioner(numParts)
- 对于 {a,b,c} 、 {a,b} 、 {b,c} 的 3 个事务,所有频繁的“摘要”将包含什么?
def add(t: Iterable[T], count: Long = 1L): FPTree[T] = { require(count > 0) var curr = root curr.count += count t.foreach { item => val summary = summaries.getOrElseUpdate(item, new Summary) summary.count += count val child = curr.children.getOrElseUpdate(item, { val newNode = new Node(curr) newNode.item = item summary.nodes += newNode newNode }) child.count += count curr = child } this } def extract( minCount: Long, validateSuffix: T => Boolean = _ => true): Iterator[(List[T], Long)] = { summaries.iterator.flatMap { case (item, summary) => if (validateSuffix(item) && summary.count >= minCount) { Iterator.single((item :: Nil, summary.count)) ++ project(item).extract(minCount).map { case (t, c) => (item :: t, c) } } else { Iterator.empty } } }