【Big-Data】Scala

Scala Basics

Learning Materials

scala core
scala spark unit test
scala underscore explained: Scala中下划线“_”的用法小结 - 简书 (jianshu.com)

How to create a maven scala project

Using Idea

archetype

net.alchim31.maven:scala-archetype-simple

Project dependency

Reference

spark version v.s. scala version

Any version of Spark requries a specific version Scala. When you build an application to be written Scala, you want to make sure yourScala version is compatible with Spark version you have.

If you change the dependencies once and the IDE does not agree with what you just did. Right click on pom.xml and reimport the whole project. It should be working just fine.

Usages

Merge two array list and deduplicate

  def mergeSiloList(a: Array[(String, String, String)], b: Array[(String, String, String)]): Array[(String, String, String)] = {
    val resultMap = scala.collection.mutable.Map[(String, String), String]()

    for ((id, idType, belongsTo) <- a ++ b) {
      val key = id -> belongsTo
      if (!resultMap.contains(key)) {
        resultMap.put(key, idType)
      }
    }
    resultMap.toArray.map { case ((id, belongsTo), idType) => (id, idType, belongsTo) }
  }

Compare two lists

  def compareTwoArray(a1: Array[(String, String, String)], a2: Array[(String, String, String)]): Array[(String, String, String)] = {
    a1.filterNot(t1 => a2.exists(t2 => t1._1 == t2._1))
    // another way
    // attr.forall({case (id, _, _) => srcId != id})
  }

convert an array to a string

propertyList.foldLeft("")((acc, obj) => acc + s"(${obj._1}, ${obj._2}, ${obj._3})\t"))

Insert VertexRDD to hive

import org.apache.spark.sql.{Row, SparkSession}
   
// Assuming sparkSession is already created
   
// Convert the VertexRDD to a DataFrame
val vertexRows = vertexRDD.map{ case (vid, vprop) =>
  Row(vid, vprop.property1, vprop.property2, ...)
}
val vertexSchema = Seq("id", "property1", "property2", ...)
val vertexDF = sparkSession.createDataFrame(vertexRows, vertexSchema)
   
// Write the DataFrame to Hive
vertexDF.write.mode("overwrite").saveAsTable("database.table_name")

上篇【Basics】Git Rebase

下篇【Basics】Functional Programming