xml - 使用具有多个属性的 scala-xml API 进行解析

Question

我有 XML 我正在尝试使用Scala XML API。我有 XPath 查询来从 XML 标记中检索数据。我想从中检索<price>标签值，<market>但使用两个属性_id和type. 我想写一个条件，&&这样我就可以得到每个价格标签的唯一值，例如 where MARKET _ID = 1 && TYPE = "A"。

作为参考，请在下面找到 XML：

<publisher>
    <book _id = "0"> 
        <author _id="0">Dev</author>
        <publish_date>24 Feb 1995</publish_date>
        <description>Data Structure - C</description>
        <market _id="0" type="A">
            <price>45.95</price>            
        </market>
        <market _id="0" type="B">
            <price>55.95</price>
        </market>
    </book>
    <book _id="1"> 
        <author _id = "1">Ram</author>
        <publish_date>02 Jul 1999</publish_date>
        <description>Data Structure - Java</description>
        <market _id="1" type="A">
            <price>145.95</price>           
        </market>   
        <market _id="1" type="B">
            <price>155.95</price>           
        </market>
    </book>
</publisher>

以下代码工作正常

import scala.xml._

object XMLtoCSV extends App {

  val xmlLoad = XML.loadFile("C:/Users/sharprao/Desktop/FirstTry.xml")  

  val price = (((xmlLoad \ "book" filter { _ \ "@_id" exists (_.text == "0")}) \ "market" filter { _ \ "@_id" exists (_.text == "0")}) \ "price").text  //45.95
  val price1 = (((xmlLoad \ "book" filter { _ \ "@_id" exists (_.text == "1")}) \ "market" filter { _ \ "@_id" exists (_.text == "1")}) \ "price").text  //155.95

  println("price = " + price)
  println("price1 = " + price1)
}

输出是：

price = 45.9555.95
price1 = 145.95155.95

我上面的代码给了我两个值，因为我无法设置 && 条件。

除了过滤我可以使用的 SCALA 函数之外，请提供建议。
还让我知道如何获取所有属性名称。
如果可能，请告诉我从哪里可以阅读所有这些 API。

提前致谢。

score 2 · Accepted Answer

您可以编写一个自定义谓词来检查多个属性：

def checkMarket(marketId: String, marketType: String)(node: Node): Boolean = {
  node.attribute("_id").exists(_.text == marketId) &&
  node.attribute("type").exists(_.text == marketType)
}

然后将其用作过滤器：

val price1 = (((xmlLoad \ "book" filter (_ \ "@_id" exists (_.text == "0"))) \ "market" filter checkMarket("0", "A")) \ "price").text
// 45.95

val price2 = (((xmlLoad \ "book" filter (_ \ "@_id" exists (_.text == "1"))) \ "market" filter checkMarket("1", "B")) \ "price").text
// 155.95

score 1 · Accepted Answer

如果您有兴趣获取数据的 CSV 文件，这将是编写它的方式：

(xmlload \ "book").flatMap { bk =>
  (bk \ "market").flatMap { mkt =>
    (mkt \ "price").map { p =>
      Seq(
        bk \@ "_id",
        mkt \@ "_id",
        mkt \@ "type",
        p.text.toFloat
      )
    }
  }
}.map { cols =>
  cols.mkString("\t")
}.foreach { 
  println
}

它将输出以下内容：

0       0       A       45.95
0       0       B       55.95
1       1       A       145.95
1       1       B       155.95

在编写 Scala 时要识别的一个常见模式是：大多数flatMap flatMap...map可以重写为for-comprehensions：

for {
    book <- xmlload \ "book"
    market <- book \ "market"
    price <- market \ "price"
} yield {
  val cols = Seq(
    book \@ "_id",
    market \@ "_id",
    market \@ "type",
    price.text.toFloat
  )
  println(cols.mkString("\t"))
}

score -1 · Accepted Answer

我使用了 Spark 和 hiveContext，我能够解析 xPath。

object xPathReader extends App{

    System.setProperty("hadoop.home.dir","D:\\IBM\\DB\\Hadoop\\winutils")   // Path for my winutils.exe

    val sparkConf = new SparkConf().setAppName("XMLParcing").setMaster("local[2]")
    val sc = new SparkContext(sparkConf)
    val hiveContext = new HiveContext(sc)
    val myXmlPath = "D:\\IBM\\DB\\xml"
    val xmlRDDList = XmlFileUtil.withCharset(sc, myXmlPath, "UTF-8", "publisher") //XmlFileUtil - this is a private class in scala hence I created a Java class to use it.

    import hiveContext.implicits._

    val xmlDf = xmlRDDList.toDF("tempXMLTable")
    xmlDf.registerTempTable("tempTable")

    hiveContext.sql("select xpath_string(tempXMLTable,\"/book/@_id\") as BookId, xpath_float(tempXMLTable,\"/book/market[@_id='1' and @type='B']/price\") as Price from tempTable").show()      

    /*  Output
        +------+------+
        |BookId| Price|
        +------+------+
        |     0| 55.95|
        |     1|155.95|
        +------+------+
    */
}

xml - 使用具有多个属性的 scala-xml API 进行解析

3 回答 3

Related

Reference