Tuesday, March 13, 2018

Spark Learn



scala> spark.version
scala> spark.readStream.format("rate").load.write
scala> import org.apache.spark.sql.streaming._ // to get Trigger
scala> import scala.concurrent.duration._ // to get .seconds for Integer
scala> spark.readStream.format("rate").load.writeStream.format("console").trigger(Trigger.ProcessingTime(10.seconds)).start
Note: Spark Structured Streaming by default gets triggered every milli sec 
Note: Spark Structured Streaming has No "Streaming" tab in Web UI
Here the .start actually starts a DataStreamWriter which is when the streaming starts --> meaning, u cant have .show on it
RateStreamSource is a streaming source that generates consecutive numbers with timestamp that can be useful for testing and PoCs.

---- Notes before March 2018----------------------

If provided paths are partition directories, please set "basePath" in the options 
If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table. 
If there are multiple root directories, please load them separately and then union them.

Actually data path had multiple dirs, one set of partition & another set of dated dirs
Means, there were paths like this --
/pathToMainData/mktCd=<xyz>/event_mo=<202008>/

/pathToMainData/mktCd=<xyz>/event_mo=<202009>/ 

/pathToMainData/mktCd=<xyz>/<abc_date1_xyz>/ 

 

 

Partition - repartition, coalece and minPartitions
Set the minimal number of partitions that will be created, while reading the file, by setting it in the optional parameter minPartitions of textFile
rating_data_raw = sc.textFile("/<path_to_csv_file>.csv", minPartitions=24)
Another way to achieve this is by using repartition or coalesce, if you need to reduce the number of partition you may use coalesce, otherwise you can use repartition.
rating_data_raw = sc.textFile("/<path_to_csv_file>.csv").repartition(24)

The scheduler splits the RDD graph into stages, based on the transformations. 
The narrow transformations (transformations without data movement) will be grouped (pipe-lined) together into a single stage. This physical plan has two stages, with everything before ShuffledRDD in the first stage.
Stages will have multiple tasks


Here is a summary of the components of execution:
  • Task: a unit of execution that runs on a single machine
  • Stage: a group of tasks, based on partitions of the input data, which will perform the same computation in parallel
  • Job: has one or more stages
  • Pipelining: collapsing of RDDs into a single stage, when RDD transformations can be computed without data movement
  • DAG: Logical graph of RDD operations
  • RDD: Parallel dataset with partitions


Here is how a Spark application runs:
  • A Spark application runs as independent processes, coordinated by the SparkContext object in the driver program.
  • The task scheduler launches tasks via the cluster manager; in this case it’s YARN.
  • The cluster manager assigns tasks to workers, one task per partition.
  • A task applies its unit of work to the elements in its partition, and outputs a new partition.
    • Partitions can be read from an HDFS block, HBase or other source and cached on a worker node (data does not have to be written to disk between tasks like with MapReduce).
  • Results are sent back to the driver application.

Spark Listener Events (y tube watch?v=mVP9sZ6K__Y)


Wednesday, January 17, 2018

My Quick Notes


Hashing

Used to map data of arbitrary size to data of fixed size code called hash code is same for different data then all those will go to the same bucket. 


 Used by HashMap, HashTable etc.

HMAC -  involves a cryptographinc hash function and a secret cryptographic key. It may be used to simultaneously verify both the data integrity and the authentication of a message

Though HashMap implementation provides constant time performance O(1) for get() and put() method but that is in the ideal case when the Hash function distributes the objects evenly among the buckets. 
Java 8 addresses this issue -

Monday, November 11, 2013

Referring 3rd party Jars within your Jar /lib : MANIFEST.mf Class-Path

As per my current understanding, you cannot refer any third party utility Jars which you placed in your apps /lib folder, when you export it as a Jar. 
Say yourJar.jar has lib/log4j.jar.  But your main class cannot load the log4j classes
http://www.velocityreviews.com/forums/t130458-executable-jar-classpath-problem.html 

The workaround for this situation: Place those utility/thrid party Jars in the same directory or inner directories, where your Jar is placed and making sure that that relative path is mentioned in the Manifest.mf Class-path Header.



<!-- This build script will help you build your own Jar, with utility Jars exported with
and all those Jars added to the MANIFEST.mf Class-Path header
-->
<project
name="buildJar" default="build-jar" basedir=".">
<!-- Add all the jar files in lib folder to the class path -->
<path id="build.classpath">
<fileset dir="${basedir}/">
<include name="lib/*.jar" />
</fileset>
</path>
<!-- Copy all Jars name to the Class-Path; keep an eye on the SPACE seperator -->
<pathconvert property="project.manifest.classpath" pathsep=" lib/">
<path refid="build.classpath" />
<mapper>
<chainedmapper>
<flattenmapper />
<globmapper from="*.jar" to="*.jar" />
</chainedmapper>
</mapper>
</pathconvert>
<target name="clean">
<delete dir="bin" />
<mkdir dir="bin/lib" />
<delete dir="build" />
<mkdir dir="build" />
</target>
<target name="copy-inner-jar-files">
<copy todir="bin/lib" includeemptydirs="false">
<fileset dir="lib" includes="*.jar" />
</copy>
</target>
<target name="copy-non-java-files">
<copy todir="bin" includeemptydirs="false">
<fileset dir="src" includes="**/*" />
</copy>
</target>
<target name="compile-jar-classes" depends="clean,copy-non-java-files, copy-inner-jar-files">
<javac srcdir="src" destdir="bin" classpathref="build.classpath" />
</target>
<target name="build-jar" depends="compile-jar-classes" >
<jar basedir="bin" jarfile="${basedir}/build/MyJar.jar">
<fileset dir="build" />
<manifest>
<attribute name="Main-Class" value="com.sree.test.jar.MyInnerJarTesterMain" />
<!-- <attribute name="Class-Path" value=" . lib/junit-4.10.jar lib/log4j.jar lib/ojdbc14.jar" /> -->
<attribute name="Class-Path" value=". lib/${project.manifest.classpath}" />
</manifest>
</jar>
</target>
>jar -xf MyJar.jar lib
-->
<!-- To take out just the lib/ directory jars out to the same directory as this JAR use below command</project>

<?xml version="1.0"?>