Sree's Blog: February 2021

Thursday, February 4, 2021

Spark-Scala-Compatibility and other notes

Downloaded Scala IDE - it came with default Scala SDK v2.12
Scala IDE build of Eclipse SDK Build id: 4.7.0-vfinal-2017-09-29T14:34:02Z-Typesafe

Tried to use Spark 2.10 jars however it was not compatible - thrown compatibility error, could not find main method etc

Then downloaded Spark 1.6.1 (which is the version my client used) - unzipped added all the jars from spark-1.6.1-bin-hadoop2.6\lib - 3 datanucleus jars (api, core, rdbms, spark-1.6.1.yarn-shuffle, spark-assembly-1.6.1-hadoop2.6.0jar) - spark-assembly jar is 183 MB - contains a lot of dependent jars
I had to change Scala compiler for the project to Scala 2.10 to make both compatible.

then it started running.
=============================================

val sparkConf = new SparkConf().setAppName("sparkSQLExamples").setMaster("local[*]")

    .setIfMissing("hive.execution.engine", "spark")

    .setIfMissing("spark.cassandra.connection.host", "127.0.0.1")

    .setIfMissing("spark.cassandra.connection.port", "9042")

val sparkContext = new SparkContext(sparkConf)

Spark SQLContext allows us to connect to different Data Sources to write or read data from them, but it has limitations, namely that when the program ends or the Spark shell is closed, all links to the data sources we have created are temporary and will not be available in the next session.

This limitation is solved with HiveContext, since it uses a MetaStore to store the information of those “external” tables. In our example, this MetaStore is MySql. This configuration is included in a resource file (hive-site.xml) used by Hive.

Hive External table vs Managed Table

https://data-flair.training/blogs/hive-internal-tables-vs-external-tables/

External - data owned by the user, just the metadata gets deleted on table drop - not the data, partitions are managed by the user,

https://www.geeksforgeeks.org/scala-closures/

Scala Closures are functions which uses one or more free variables and the return value of this function is dependent of these variable. The free variables are defined outside of the Closure Function and is not included as a parameter of this function. So the difference between a closure function and a normal function is the free variable. A free variable is any kind of variable which is not defined within the function and not passed as the parameter of the function. A free variable is not bound to a function with a valid value. The function does not contain any values for the free variable.
Example:
If we define a function as shown below:

def example(a:double) = a*p / 100 (here p is a free variable)

Example: udf, rdd, dataframe

@transient lazy val

Given a you have a Scala object holding some data that you want to store or send around by serializing the object. It turns out that the object is also capable of performing some complex logic and it stores the results of these calculations in its field values. While it might be efficient to store the calculation results in memory for later lookup, it might be a bad idea to also serialize these fields as this will consume space you do not want to sacrifice or as this will increase network throughput (e.g., in Spark) resulting in more time being consumed than it requires to recalculate the fields. Now one could write a custom serializer for this task, but let us be honest: thats not really the thing we want to spent our time on.

This is where the @transient lazy val pattern comes in. In Scala lazy val denotes a field that will only be calculated once it is accessed for the first time and is then stored for future reference. With @transient on the other hand one can denote a field that shall not be serialized.

Putting this together we can now write our "recalculate rather than serialize logic":

http://fdahms.com/2015/10/14/scala-and-the-transient-lazy-val-pattern/

In Scala lazy val denotes a field that will only be calculated once it is accessed for the first time and is then stored for future reference.

With @transient on the other hand one can denote a field that shall not be serialized.

So @transient lazy val --> as used in udf's utility Class --> tell the execution Engine that

dont serialize the variable (say cryptoLib object) while broadcasting the to be UDF object

(the UtilClass with the fuction to expose as udf closure) and once the object is @ spark

executor - create it only once ("lazy") for all records.

AWS Dashboards

Decide what data goes into each dashboard based on who will use the dashboard and why they will use the dashboard.

https://aws.amazon.com/builders-library/building-dashboards-for-operational-visibility/

A very common tendency when designing dashboards is to overestimate or underestimate the domain knowledge of the target user.

It’s easy to build a dashboard that makes total sense to its creator.

However, this dashboard might not provide value to users.

We use the technique of working backwards from the customer (in this case, the dashboard users) to eliminate this risk.

Define KPI's - Key performance Index - need to be identified from dashboard user.

Before diving headfirst into dashboard design, sit with your end-users in order to gather requirements and define KPIs. Without doing that, you can design the most beautiful dashboard in the world, but it won’t change the way users make decisions in the long run.

https://aws.amazon.com/blogs/devops/monitoring-and-management-with-amazon-quicksight-and-athena-in-your-ci-cd-pipeline/

Python note: Boto/Boto3 is the Amazon Web Services (AWS) SDK for Python.

It enables Python developers to create, configure, and manage AWS services,

4 Key factors about dashboard design

Relationship: connection between two or more variables
Comparison: compare two or more variables side by side
Composition: breaking data into separate components
Distribution: range and grouping of values within data

Data Pipeline -

Demo: https://www.youtube.com/watch?v=txrG-w6oN1M

Design: https://aws.amazon.com/blogs/big-data/build-a-concurrent-data-orchestration-pipeline-using-amazon-emr-and-apache-livy/

Data Engineering | DW Design(Star/Showflake) | NF | AWS Monitoring | Graphs-Charts

Data Engineering: Row data to Useful info

Role of Data Engineer, skills to have:

Data Modeling: Simplify complex software design into simple ones (break it) - provide a visual representation

Design Schemas: Star & Snowflake

Structured & Unstructured Data

Reference link1

Big Data 4 Vs

FSCK - File System Check - to check for files and discrepancies in files (in hdfs)

Job tracker, task tracker, Name node ports: 50030, 50060, 50070 respectively

Hive Metastore: storage location of schema and tables (definitions, mapping) - later stored in RDBMS

spark.sql.warehouse.dir is a static configuration property that sets Hive’s hive.metastore.warehouse.dir property, i.e. the location of default database for the Hive warehouse.

Hive Collections - Array, Map, Struct, Union - refer link

SerDe - Serializer (object to storage form) - DeSerializer (stored format to prior form)

.hiverc - for initialization

*args & *kwargs (python - variable arguments passing... )

AWS Monitoring - starts

System Status check

A system status check failure indicates a problem with the AWS systems that your instance runs on.

Check for any outage, or get it resolved by itself or terminate/restart the instance

https://aws.amazon.com/premiumsupport/knowledge-center/ec2-windows-system-status-check-fail/

Instance status check

Determine whether Amazon EC2 has detected any problems that might prevent your instances from running applications

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-system-instance-status-check.html

Monitor your instances using CloudWatch

Collects and processes raw data from Amazon EC2 into readable, near real-time metrics. These statistics are recorded for a period of 15 months

By default, Amazon EC2 sends metric data to CloudWatch in 5-minute periods. To send metric data for your instance to CloudWatch in 1-minute periods, you can enable detailed monitoring on the instance.

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-cloudwatch.html

Amazon EventBridge? (formerly called Amazon CloudWatch Events)

Serverless event bus service that makes it easy to connect your applications with data from a variety of sources.

EventBridge delivers a stream of real-time data from your own applications, Software-as-a-Service (SaaS) applications, and AWS services and routes that data to targets such as AWS Lambda.

Events, Rules, Targets, Event Bus

https://docs.aws.amazon.com/eventbridge/latest/userguide/what-is-amazon-eventbridge.html

CloudWatch Agent

Collecting Metrics and Logs from Amazon EC2 Instances and On-Premises Servers with the CloudWatch Agent

https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html

AWS Monitoring - ends

Adhoc topic - EFS - Shared Access, Size Auto Scales to Petabytes on demand.

Amazon Elastic File System (Amazon EFS) provides a simple, scalable, fully managed elastic NFS file system for use with AWS Cloud services and on-premises resources.

It is built to scale on demand to petabytes without disrupting applications, growing and shrinking automatically as you add and remove files, eliminating the need to provision and manage capacity to accommodate growth.

Amazon EFS is designed to provide massively parallel shared access to thousands of Amazon EC2 instances, enabling your applications to achieve high levels of aggregate throughput and IOPS with consistent low latencies.

Data Warehouse design - Star Schema, Snowflake Schema (Fact & Dimension table)

Star Schema (de-normalized dimension tables)

link1 link2

Fact tables has measures ----> dimesion tables give more context to the fact table

Star

Snowflake Schema

Star-vs-Snowflake

Primary Key - Unique Key

Parameter	PRIMARY KEY	UNIQUE KEY
Basic	Used to serve as a unique identifier for each row in a table.	Uniquely determines a row which isn’t primary key.
NULL value acceptance	Cannot accept NULL values.	Can accept one NULL value.
Number of keys that can be defined in the table	Only one primary key	More than one unique key
Index	Creates clustered index	Creates non-clustered index

NORMAL FORMS (1NF, 2NF, 3NF) - Eliminating Data Redundancy (link1)

1NF - Single column cannot have multiple values (Atomicity)

modified to below format for atomicity

2NF = 1NF + Table should not have partial dependencies

Here office location only depends on the department Id - so split it.

3NF = 2NF + no transitive dependency on non-prime attributes

StudentId determines subject via subjectId --> transitive dependency (see below)

Boyce Codd NF (3.5NF) - super key (non-prime attribute depends on prime attribute)

Professor is a non-prime attribute - depends on the prime attribute - subject)

Solutions Architect notes

Database migration

AWS - DMS (homogenious and heterogenious source/sink)

Can migrate from RDBMS to DynamoDB, or MongoDB to DynamoDB etc.

CDC - Change Data Capture (or Continous Data Conversion as in AWS)

SCT - Schema Conversion Tool (for heterogenious migration)

RDBMS to DynamoDB migration approaches (AWS doc)

1) Using AWS DMS

2) Use EMR, Amazon Kinesis, and Lambda with custom scripts

Can possibly use DataSync agent to copy data from onPrem to S3

MySQL binlog (cdc ?)

CCreate DMS instance (on EC2), define source and destination endpoints , create migration tasks

To map data to a DynamoDB target, you use a type of table-mapping rule called object-mapping

Caching on AWS

EMR
Amazon EMR is the industry-leading cloud big data platform for processing vast amounts of data using open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto. Amazon EMR makes it easy to set up, operate, and scale your big data environments by automating time-consuming tasks like provisioning capacity and tuning clusters.

With EMR you can run petabyte-scale analysis at less than half of the cost of traditional on-premises solutions and over 3x faster than standard Apache Spark.

You can run workloads on Amazon EC2 instances, on Amazon Elastic Kubernetes Service (EKS) clusters, or on-premises using EMR on AWS Outposts.

Master Node, Core Node (Data), Task Node (No data, optional)

MasterNode - Single Point of Failure (can setup to save the log in S3, on cluster setup)

AWS Directory Service (like Active Directory)

Connects AWS resources with onPrem AD (AD info below)

ARN - Amazon Resource Name

IAM Policy JSON structure (attach it to a Role; then attach the role to an account or resource

IAM Permissions Boundary - restrict access

Resource Access Manager (RAM)
SSO

SSO - Use one context to login to another using SAML (Security Assertion M L)

DNS

IANA
Top level, 2nd level domains
Domain Registrar - WHO Database - SOA Record

NS - Name Server Records

A Record - name to IP address

CName (Canonical name) - resolve one domain address to another (like m.<domain>)

A Canonical Name or CNAME record is a type of DNS record that maps an alias name to a true or canonical domain name.

CNAME records are typically used to map a subdomain such as www or mail to the domain hosting that subdomain’s content. For example, a CNAME record can map the web address www.example.com to the actual web site for the domain example.com.

Alias Records - map resource record set in the hosted zone to ELB, Cloud Front, S3 static website.

Routing Policies

Simple routing policy - 1 A record with multiple IP
Weighted Routing - multiple A records (IP) with different weights (healthcheck if ?)
Latency - latency to the region makes the routing decision
Failover - active/passive setup - add health check - which is based on public IP which changes on restart - so make sure you update health check or use dedicated IP)

GeoLocation - based on user location

GeoProximity - complex rules (traffic only) - ignore

Multivalue Answer - Simple Routing with separate IP with health check

VPC

Private IP address range by IANA

& Amazon restricts CIDR block larger than /16 - means first 16 of 32 bits are masked - 255.255.x.x) /16 netmask

min /28 - 16IP addresses (4 bits) -

https://cidr.xyz

With new VPC - whats default & not.

- by default Route Table, NACL & SG (security group) created

- by default NO Subnet, no IG.

Special note - Security Group (SG) - default SG will have an Inbound rule to allow any traffic from the same SG only and outbound rule allowing any traffic to the Internet (outside world) - so if the subnet is public - can connect to the internet from the resource. Note- SGs are stateful (NACL is not)- even if the outbound rule is removed, if inbound is allowed, it can reply back to (outbound) for the same.

if you create a new SG - everything is blocked - there wont be any inbound rules (add manually as needed), but outbound will be open to all

SG can only "Allow" no "Deny"/block option - NACL has.

Can attach multiple SG to EC2/resources

Create Subnet (it cant span multiple AZ)

for one - modify auto-Assign public IP

Reserved IP addresses (5 are reserved )

Create Internet Gateway - and attach to the VPC (its HA)

Configure Route table - Routes, Subnet Association

Default route table (Main) - no public access by default. all subnets will be associated with this by default. (So dont add public route to Main route table)

So create a new Route table and make it as Public by adding a new route out to the internet (from 0.0.0.0/0 to IG) - and associate the subnet which needs to be Public to this route table

<Always keep Main Route table as Private (by not adding a route out to the internet) and use separate public route table>

create instance one in public and one in private subnet - one will have public IP

NACL inbound and outbound rules (default)

ACL - Rule# increaments of 100 (100, 200 .... & 101, 201 for IPv6)

New custom NACL - denies everything inbound & outbound

Rule is evaluated in tthe cronological order of the rule#.

So keep deny before allow to take its effect

Load Balancer - at least 2 public subnets are required (2 AZ for HA)

VPC Flowlogs - all traffic in/out of VPC - stored using Cloud Watch (VPC level, subnet level, network interface level)

Basin Host

How to communicate to a Private instance?

NAT instances (1) and NAT Gateways (HA) (Network Address Translation

Create NAT instance (EC2 NAT AMI), disable source/destination check

Then add a route in Main route table to allow internet access via NAT instance

Single point of failure -- so use NAT Gateway

Create new NAT gateway on the public subnet, create an elastic IP (uses ephemeral port) - then add route

Elastic Search - not just search, but analytics - massive scale, near-realtime, cheap (v7.9)

ELK (ES, Logstash (bring data in - pipeline), Kibana (visualize))

Document storage and retreival engine (Scaled Lucene engine)

Document (text, json)- documentId, types (schema & mapping - going away), indices (inverted indices)

Documents hashed to separate shards (shard - self contained lucene index - kind of mini search engine by itself)

Primary, replica nodes (write - to primary & then replicated)

Elastic Search Sercice - Managed Service (not serverless) (avoid installing n mamaging ES on EC2)

instance hours will cost always..

IoT --> ES for analysis - possible

Need to choose # of master nodes

Domains -> in ES means Cluster

Snapshot to S3 can be set

Kinesis - processing via Lambda

There are several blueprints @ Lambda - search for something to convert APACHE Access Log

copy index.js code

this converts apache log to JSON format

Kinesis firehose destination

Elastic Search - (APM - Application Performance managenet ) - Analyze application logs and system matrics

Predicting trend (# of calls etc) - via graphical representation

Anomany detection

Data is stored as documents (like row in RDBMS) with fields/values

Query using rest API

Logstash - if you want to bring data to ES and needs data enrichment prior that

If more NODES are added - the SHARDS are distributed evenly by ES.

ES Type/Index

Query String API

_search API,

all fileds

field=pasta

AWS - ES Domain comes with Kibana by default (if selected on setup)

set proper access to use it.

Kibana - create index pattern - first give the domain name , then timestamp.

Discover, Visualize, Dashboard....

Notebooks - Jupyter, Zeppelin

Athena - Glue - QuickSight

Glue - Crawler - from S3/JDBC/DynamoDB - will crawl n create table in your database in Glue.

Schema - auto detected if it is in header, else edit once done.

Athena - SQL like query using the Glue Catalog.

RedShift

client --> (jdbc/odbc) - Leader Node, Compute Node (1-128) (- multilple Node Slides

Compute Node -> Dense Storage/Desnse Compute

DynamoDB - common usecases

SQS vs Kinesis Data Streams

SQS, Kinesis DataStream, Kinesis Firehose, SQS FIFO

IoT (internet of things)

Thing Registry, Device Gateway, IoT message Broker,, IoT Rules Engine, Device Shadow

---> Kinesis, SQS, Lambda, DynamoDB, S3, SNS, ES, MQTT to ML model.. ...

IoT Greengrass - bring compute power (lambda) on the device

VPC Peering, VPC Private Link

if there are many VPCs, peering to each is a big task mnage multile peering relations. so use private link - using Network Load Balancer nd Elastic Network interface (ENI)

Direct connect

VPC gateway (dont go thro internet) - Gateway (only S3 & DynamoDB)/Interface

Cloud Formation - stack

crete from template or create new template in Designer

Designer preview

Amazon Managed Service for Grafana

Powerful, interactive data visualizations for builders, operators, and business leaders