Downloaded Scala IDE - it came with default Scala SDK v2.12
Scala IDE build of Eclipse SDK Build id: 4.7.0-vfinal-2017-09-29T14:34:02Z-Typesafe
Tried to use Spark 2.10 jars however it was not compatible - thrown compatibility error, could not find main method etc
Then downloaded Spark 1.6.1 (which is the version my client used) - unzipped added all the jars from spark-1.6.1-bin-hadoop2.6\lib - 3 datanucleus jars (api, core, rdbms, spark-1.6.1.yarn-shuffle, spark-assembly-1.6.1-hadoop2.6.0jar) - spark-assembly jar is 183 MB - contains a lot of dependent jars
I had to change Scala compiler for the project to Scala 2.10 to make both compatible.
then it started running.
=============================================
Hive External table vs Managed Table
Scala Closures are functions which uses one or more free variables and the return value of this function is dependent of these variable. The free variables are defined outside of the Closure Function and is not included as a parameter of this function. So the difference between a closure function and a normal function is the free variable. A free variable is any kind of variable which is not defined within the function and not passed as the parameter of the function. A free variable is not bound to a function with a valid value. The function does not contain any values for the free variable.
Example:
If we define a function as shown below:
Example: udf, rdd, dataframe
Given a you have a Scala object holding some data that you want to store or send around by serializing the object. It turns out that the object is also capable of performing some complex logic and it stores the results of these calculations in its field values. While it might be efficient to store the calculation results in memory for later lookup, it might be a bad idea to also serialize these fields as this will consume space you do not want to sacrifice or as this will increase network throughput (e.g., in Spark) resulting in more time being consumed than it requires to recalculate the fields. Now one could write a custom serializer for this task, but let us be honest: thats not really the thing we want to spent our time on.
This is where the @transient lazy val
pattern comes in. In Scala lazy val
denotes a field that will only be calculated once it is accessed for the first time and is then stored for future reference. With @transient
on the other hand one can denote a field that shall not be serialized.
Putting this together we can now write our "recalculate rather than serialize logic":
http://fdahms.com/2015/10/14/scala-and-the-transient-lazy-val-pattern/
In Scala lazy val
denotes a field that will only be calculated once it is accessed for the first
time and is then stored for future reference.
With @transient
on the other hand one can denote a field that shall not be serialized.
So @transient lazy val --> as used in udf's utility Class --> tell the execution Engine that
dont serialize the variable (say cryptoLib object) while broadcasting the to be UDF object
(the UtilClass with the fuction to expose as udf closure) and once the object is @ spark
executor - create it only once ("lazy") for all records.
No comments:
Post a Comment