Sree's Blog: Spark-Scala-Compatibility and other notes

Downloaded Scala IDE - it came with default Scala SDK v2.12
Scala IDE build of Eclipse SDK Build id: 4.7.0-vfinal-2017-09-29T14:34:02Z-Typesafe

Tried to use Spark 2.10 jars however it was not compatible - thrown compatibility error, could not find main method etc

Then downloaded Spark 1.6.1 (which is the version my client used) - unzipped added all the jars from spark-1.6.1-bin-hadoop2.6\lib - 3 datanucleus jars (api, core, rdbms, spark-1.6.1.yarn-shuffle, spark-assembly-1.6.1-hadoop2.6.0jar) - spark-assembly jar is 183 MB - contains a lot of dependent jars
I had to change Scala compiler for the project to Scala 2.10 to make both compatible.

then it started running.
=============================================

val sparkConf = new SparkConf().setAppName("sparkSQLExamples").setMaster("local[*]")

    .setIfMissing("hive.execution.engine", "spark")

    .setIfMissing("spark.cassandra.connection.host", "127.0.0.1")

    .setIfMissing("spark.cassandra.connection.port", "9042")

val sparkContext = new SparkContext(sparkConf)

Spark SQLContext allows us to connect to different Data Sources to write or read data from them, but it has limitations, namely that when the program ends or the Spark shell is closed, all links to the data sources we have created are temporary and will not be available in the next session.

This limitation is solved with HiveContext, since it uses a MetaStore to store the information of those “external” tables. In our example, this MetaStore is MySql. This configuration is included in a resource file (hive-site.xml) used by Hive.

Hive External table vs Managed Table

https://data-flair.training/blogs/hive-internal-tables-vs-external-tables/

External - data owned by the user, just the metadata gets deleted on table drop - not the data, partitions are managed by the user,

https://www.geeksforgeeks.org/scala-closures/

Scala Closures are functions which uses one or more free variables and the return value of this function is dependent of these variable. The free variables are defined outside of the Closure Function and is not included as a parameter of this function. So the difference between a closure function and a normal function is the free variable. A free variable is any kind of variable which is not defined within the function and not passed as the parameter of the function. A free variable is not bound to a function with a valid value. The function does not contain any values for the free variable.
Example:
If we define a function as shown below:

def example(a:double) = a*p / 100 (here p is a free variable)

Example: udf, rdd, dataframe

@transient lazy val

Given a you have a Scala object holding some data that you want to store or send around by serializing the object. It turns out that the object is also capable of performing some complex logic and it stores the results of these calculations in its field values. While it might be efficient to store the calculation results in memory for later lookup, it might be a bad idea to also serialize these fields as this will consume space you do not want to sacrifice or as this will increase network throughput (e.g., in Spark) resulting in more time being consumed than it requires to recalculate the fields. Now one could write a custom serializer for this task, but let us be honest: thats not really the thing we want to spent our time on.

This is where the @transient lazy val pattern comes in. In Scala lazy val denotes a field that will only be calculated once it is accessed for the first time and is then stored for future reference. With @transient on the other hand one can denote a field that shall not be serialized.

Putting this together we can now write our "recalculate rather than serialize logic":

http://fdahms.com/2015/10/14/scala-and-the-transient-lazy-val-pattern/

In Scala lazy val denotes a field that will only be calculated once it is accessed for the first time and is then stored for future reference.

With @transient on the other hand one can denote a field that shall not be serialized.

So @transient lazy val --> as used in udf's utility Class --> tell the execution Engine that

dont serialize the variable (say cryptoLib object) while broadcasting the to be UDF object

(the UtilClass with the fuction to expose as udf closure) and once the object is @ spark

executor - create it only once ("lazy") for all records.

Sree's Blog

Thursday, February 4, 2021

Spark-Scala-Compatibility and other notes

Hive External table vs Managed Table

No comments:

Post a Comment