scala - 如何使用 scala 从文件中读取输入并将文件的数据行转换为 List[Map[Int,String]]？

Question

我的查询是，从文件中读取输入并使用 scala 将文件的数据行转换为 List[Map[Int,String]]。这里我给出一个数据集作为输入。我的代码是，

  def id3(attrs: Attributes,
      examples: List[Example],
      label: Symbol
       ) : Node = {
level = level+1


  // if all the examples have the same label, return a new node with that label

  if(examples.forall( x => x(label) == examples(0)(label))){
  new Leaf(examples(0)(label))
  } else {
  for(a <- attrs.keySet-label){          //except label, take all attrs
    ("Information gain for %s is %f".format(a,
      informationGain(a,attrs,examples,label)))
  }


  // find the best splitting attribute - this is an argmax on a function over the list

  var bestAttr:Symbol = argmax(attrs.keySet-label, (x:Symbol) =>
    informationGain(x,attrs,examples,label))




  // now we produce a new branch, which splits on that node, and recurse down the nodes.

  var branch = new Branch(bestAttr)

  for(v <- attrs(bestAttr)){


    val subset = examples.filter(x=> x(bestAttr)==v)



    if(subset.size == 0){
      // println(levstr+"Tiny subset!")
      // zero subset, we replace with a leaf labelled with the most common label in
      // the examples
      val m = examples.map(_(label))
      val mostCommonLabel = m.toSet.map((x:Symbol) => (x,m.count(_==x))).maxBy(_._2)._1
      branch.add(v,new Leaf(mostCommonLabel))

    }
    else {
      // println(levstr+"Branch on %s=%s!".format(bestAttr,v))

      branch.add(v,id3(attrs,subset,label))
    }
   }
  level = level-1
  branch
  }
  }
  }
object samplet {
def main(args: Array[String]){

var attrs: sample.Attributes = Map()
attrs += ('0 -> Set('abc,'nbv,'zxc))
attrs += ('1 -> Set('def,'ftr,'tyh))
attrs += ('2 -> Set('ghi,'azxc))
attrs += ('3 -> Set('jkl,'fds))
attrs += ('4 -> Set('mno,'nbh))



val examples: List[sample.Example] = List(
  Map(
    '0 -> 'abc,
    '1 -> 'def,
    '2 -> 'ghi,
    '3 'jkl,
    '4 -> 'mno
  ),
  ........................
  )


// obviously we can't use the label as an attribute, that would be silly!
val label = 'play

println(sample.try(attrs,examples,label).getStr(0))

}
}

但是我如何将此代码更改为 - 接受来自.csv文件的输入？

score 4 · Accepted Answer

我建议你使用 Java 的 io / nio 标准库来读取你的 CSV 文件。我认为这样做没有相关的缺点。

但是我们需要回答的第一个问题是在代码中哪里读取文件？解析后的输入似乎替换了的值examples。这一事实也提示我们解析后的 CSV 输入必须具有什么类型，即List[Map[Symbol, Symbol]]. 所以让我们声明一个新类

class InputFromCsvLoader(charset: Charset = Charset.defaultCharset()) {
  def getInput(file: Path): List[Map[Symbol, Symbol]] = ???
}

请注意，Charset仅当我们必须区分不同编码的 CSV 文件时才需要。

好的，那么我们如何实现该方法呢？它应该执行以下操作：

创建适当的输入阅读器
阅读所有行
在逗号分隔符处拆分每一行
将每个子字符串转换成它所代表的符号
attributes使用as 键从符号列表构建地图
创建并返回地图列表

或者用代码表示：

class InputFromCsvLoader(charset: Charset = Charset.defaultCharset()) {
  val Attributes = List('outlook, 'temperature, 'humidity, 'wind, 'play)
  val Separator = ","

  /** Get the desired input from the CSV file. Does not perform any checks, i.e., there are no guarantees on what happens if the input is malformed. */
  def getInput(file: Path): List[Map[Symbol, Symbol]] = {
    val reader = Files.newBufferedReader(file, charset)
    /* Read the whole file and discard the first line */
    inputWithHeader(reader).tail
  }

  /** Reads all lines in the CSV file using [[java.io.BufferedReader]] There are many ways to do this and this is probably not the prettiest. */
  private def inputWithHeader(reader: BufferedReader): List[Map[Symbol, Symbol]] = {
    (JavaConversions.asScalaIterator(reader.lines().iterator()) foldLeft Nil.asInstanceOf[List[Map[Symbol, Symbol]]]){
      (accumulator, nextLine) =>
        parseLine(nextLine) :: accumulator
    }.reverse
  }

  /** Parse an entry. Does not verify the input: If there are less attributes than columns or vice versa, zip creates a list of the size of the shorter list */
  private def parseLine(line: String): Map[Symbol, Symbol] = (Attributes zip (line split Separator map parseSymbol)).toMap

  /** Create a symbol from a String... we could also check whether the string represents a valid symbol */
  private def parseSymbol(symbolAsString: String): Symbol = Symbol(symbolAsString)
}

警告：只期望有效输入，我们确定单个符号表示不包含逗号分隔字符。如果不能假设，那么代码将无法拆分某些有效的输入字符串。

要使用这个新代码，我们可以更改main-method 如下：

def main(args: Array[String]){
  val csvInputFile: Option[Path] = args.headOption map (p => Paths get p)
  val examples = (csvInputFile map new InputFromCsvLoader().getInput).getOrElse(exampleInput)
  // ... your code

在这里，examples使用 value ，如果没有指定输入参数exampleInput，它是当前的硬编码值。examples

重要提示：为方便起见，在代码中省略了所有错误处理。在大多数情况下，从文件读取时可能会发生错误，并且必须将用户输入视为无效，因此可悲的是，程序边界处的错误处理通常不是可选的。

旁注：

尽量不要null在你的代码中使用。ReturningOption[T]是比 Returning 更好的选择null，因为它使“nullness”显式化，并通过类型系统提供静态安全性。
Scala中return不需要 -keyword，因为总是返回方法的最后一个值。如果您发现代码更具可读性或者您想在方法的中间中断（这通常是个坏主意），您仍然可以使用关键字。
优先val于var，因为不可变值比可变值更容易理解。
代码将因提供的 CSV 字符串而失败，因为它包含符号TRUE并且FALSE根据您的程序逻辑是不合法的（它们应该是true并且false相反）。
将所有信息添加到您的错误消息中。您的错误消息仅告诉我该属性的值'wind是什么不好，但它并没有告诉我实际值是什么。

score 1 · Accepted Answer

读取 csv 文件，

val datalines = Source.fromFile(filepath).getLines()

因此，此数据线包含 csv 文件中的所有行。

接下来，将每一行转换为Map[Int,String]

val datamap = datalines.map{ line =>
    line.split(",").zipWithIndex.map{ case (word, idx) => idx -> word}.toMap
    }

在这里，我们用","分割每一行。然后构造一个映射，其中键作为列号，值作为拆分后的每个单词。

接下来，如果我们想要List[Map[Int,String]]，

val datamap = datalines.map{ line =>
    line.split(",").zipWithIndex.map{ case (word, idx) => idx -> word}.toMap
    }.toList

scala - 如何使用 scala 从文件中读取输入并将文件的数据行转换为 List[Map[Int,String]]？

2 回答 2

Related

Reference