Sree's Blog

Tuesday, March 13, 2018

Spark Learn

scala> spark.version

scala> spark.readStream.format("rate").load.write

scala> import org.apache.spark.sql.streaming._ // to get Trigger

scala> import scala.concurrent.duration._ // to get .seconds for Integer

scala> spark.readStream.format("rate").load.writeStream.format("console").trigger(Trigger.ProcessingTime(10.seconds)).start

Note: Spark Structured Streaming by default gets triggered every milli sec

Note: Spark Structured Streaming has No "Streaming" tab in Web UI

Here the .start actually starts a DataStreamWriter which is when the streaming starts --> meaning, u cant have .show on it

Note: format "rate --> https://jaceklaskowski.gitbooks.io/spark-structured-streaming/content/spark-sql-streaming-RateStreamSource.html

RateStreamSource is a streaming source that generates consecutive numbers with timestamp that can be useful for testing and PoCs.

---- Notes before March 2018----------------------

# If provided paths are partition directories, please set "basePath" in the options

If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table.

If there are multiple root directories, please load them separately and then union them.

Actually data path had multiple dirs, one set of partition & another set of dated dirs

Means, there were paths like this --

/pathToMainData/mktCd=<xyz>/event_mo=<202008>/

/pathToMainData/mktCd=<xyz>/event_mo=<202009>/

/pathToMainData/mktCd=<xyz>/<abc_date1_xyz>/

Partition - repartition, coalece and minPartitions

Set the minimal number of partitions that will be created, while reading the file, by setting it in the optional parameter minPartitions of textFile

rating_data_raw = sc.textFile("/<path_to_csv_file>.csv", minPartitions=24)

Another way to achieve this is by using repartition or coalesce, if you need to reduce the number of partition you may use coalesce, otherwise you can use repartition.

rating_data_raw = sc.textFile("/<path_to_csv_file>.csv").repartition(24)

https://mapr.com/blog/getting-started-spark-web-ui/

The scheduler splits the RDD graph into stages, based on the transformations.

The narrow transformations (transformations without data movement) will be grouped (pipe-lined) together into a single stage. This physical plan has two stages, with everything before ShuffledRDD in the first stage.

Stages will have multiple tasks

Here is a summary of the components of execution:



Task: a unit of execution that runs on a single machine
Stage: a group of tasks, based on partitions of the input data, which will perform the same computation in parallel
Job: has one or more stages
Pipelining: collapsing of RDDs into a single stage, when RDD transformations can be computed without data movement
DAG: Logical graph of RDD operations
RDD: Parallel dataset with partitions

Here is how a Spark application runs:



A Spark application runs as independent processes, coordinated by the SparkContext object in the driver program.
The task scheduler launches tasks via the cluster manager; in this case it’s YARN.
The cluster manager assigns tasks to workers, one task per partition.
A task applies its unit of work to the elements in its partition, and outputs a new partition.
Partitions can be read from an HDFS block, HBase or other source and cached on a worker node (data does not have to be written to disk between tasks like with MapReduce).


Results are sent back to the driver application.

Spark Listener Events (y tube watch?v=mVP9sZ6K__Y)

Wednesday, January 17, 2018

Hashing

Used to map data of arbitrary size to data of fixed size code called hash code is same for different data then all those will go to the same bucket.

Used by HashMap, HashTable etc.

HMAC - involves a cryptographinc hash function and a secret cryptographic key. It may be used to simultaneously verify both the data integrity and the authentication of a message

Though HashMap implementation provides constant time performance O(1) for get() and put() method but that is in the ideal case when the Hash function distributes the objects evenly among the buckets.
Java 8 addresses this issue -

Monday, November 11, 2013

Referring 3rd party Jars within your Jar /lib : MANIFEST.mf Class-Path

As per my current understanding, you cannot refer any third party utility Jars which you placed in your apps /lib folder, when you export it as a Jar.

Say yourJar.jar has lib/log4j.jar. But your main class cannot load the log4j classes

http://www.velocityreviews.com/forums/t130458-executable-jar-classpath-problem.html

The workaround for this situation: Place those utility/thrid party Jars in the same directory or inner directories, where your Jar is placed and making sure that that relative path is mentioned in the Manifest.mf Class-path Header.

<!-- This build script will help you build your own Jar, with utility Jars exported with

and all those Jars added to the MANIFEST.mf Class-Path header

-->

<project

name="buildJar" default="build-jar" basedir=".">

</fileset>

</path>

</chainedmapper>

</mapper>

</pathconvert>

</target>

</copy>

</target>

</copy>

</target>

</target>

</manifest>

</jar>

</target>

>jar -xf MyJar.jar lib

-->

<!-- To take out just the lib/ directory jars out to the same directory as this JAR use below command</project>

<?xml version="1.0"?>

Tuesday, March 13, 2018

Spark Learn

Wednesday, January 17, 2018

My Quick Notes

Hashing

Used to map data of arbitrary size to data of fixed size code called hash code is same for different data then all those will go to the same bucket.

Monday, November 11, 2013

Referring 3rd party Jars within your Jar /lib : MANIFEST.mf Class-Path