Thursday, February 4, 2021

Data Engineering | DW Design(Star/Showflake) | NF | AWS Monitoring | Graphs-Charts

 

Data Engineering: Row data to Useful info

Role of Data Engineer, skills to have:

Data Modeling: Simplify complex software design into simple ones (break it) - provide a visual representation

Design Schemas: Star & Snowflake

Structured & Unstructured Data

Reference link1

Big Data 4 Vs

 

FSCK - File System Check - to check for files and discrepancies in files (in hdfs)

Job tracker, task tracker, Name node ports: 50030, 50060, 50070 respectively 

Hive Metastore: storage location of schema and tables (definitions, mapping) - later stored in RDBMS

spark.sql.warehouse.dir is a static configuration property that sets Hive’s hive.metastore.warehouse.dir property, i.e. the location of default database for the Hive warehouse.

Hive Collections - Array, Map, Struct, Union  - refer link

SerDe - Serializer (object to storage form) - DeSerializer (stored format to prior form)

.hiverc - for initialization

*args & *kwargs (python - variable arguments passing... )









AWS Monitoring - starts

System Status check

A system status check failure indicates a problem with the AWS systems that your instance runs on. 

Check for any outage, or get it resolved by itself or terminate/restart the instance 

https://aws.amazon.com/premiumsupport/knowledge-center/ec2-windows-system-status-check-fail/

 Instance status check

Determine whether Amazon EC2 has detected any problems that might prevent your instances from running applications

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-system-instance-status-check.html

Monitor your instances using CloudWatch

Collects and processes raw data from Amazon EC2 into readable, near real-time metrics. These statistics are recorded for a period of 15 months

By default, Amazon EC2 sends metric data to CloudWatch in 5-minute periods. To send metric data for your instance to CloudWatch in 1-minute periods, you can enable detailed monitoring on the instance.

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-cloudwatch.html

Amazon EventBridge? (formerly called Amazon CloudWatch Events)

Serverless event bus service that makes it easy to connect your applications with data from a variety of sources. 
EventBridge delivers a stream of real-time data from your own applications, Software-as-a-Service (SaaS) applications, and AWS services and routes that data to targets such as AWS Lambda.

Events, Rules, Targets, Event Bus

https://docs.aws.amazon.com/eventbridge/latest/userguide/what-is-amazon-eventbridge.html 

CloudWatch Agent 

Collecting Metrics and Logs from Amazon EC2 Instances and On-Premises Servers with the CloudWatch Agent

https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html

AWS Monitoring - ends

Adhoc topic - EFS - Shared Access, Size Auto Scales to Petabytes on demand.

Amazon Elastic File System (Amazon EFS) provides a simple, scalable, fully managed elastic NFS file system for use with AWS Cloud services and on-premises resources. 

It is built to scale on demand to petabytes without disrupting applications, growing and shrinking automatically as you add and remove files, eliminating the need to provision and manage capacity to accommodate growth. 

Amazon EFS is designed to provide massively parallel shared access to thousands of Amazon EC2 instances, enabling your applications to achieve high levels of aggregate throughput and IOPS with consistent low latencies. 

 

Data Warehouse design - Star Schema, Snowflake Schema (Fact & Dimension table)

    Star Schema (de-normalized dimension tables

link1 link2

Fact tables has measures ----> dimesion tables give more context to the fact table

Star


Snowflake Schema

 

Star-vs-Snowflake


Primary Key - Unique Key

ParameterPRIMARY KEYUNIQUE KEY
BasicUsed to serve as a unique identifier for each row in a table.Uniquely determines a row which isn’t primary key.
NULL value acceptanceCannot accept NULL values.Can accept one NULL value.
Number of keys that can be defined in the tableOnly one primary keyMore than one unique key
IndexCreates clustered indexCreates non-clustered index

NORMAL FORMS (1NF, 2NF, 3NF) - Eliminating Data Redundancy (link1)

1NF - Single column cannot have multiple values (Atomicity)
            modified to below format for atomicity

2NF = 1NF + Table should not have partial dependencies 



Here office location only depends on the department Id - so split it. 

 3NF = 2NF + no transitive dependency on non-prime attributes

StudentId determines subject via subjectId --> transitive dependency (see below) 

Boyce Codd NF (3.5NF)  - super key (non-prime attribute depends on prime attribute)

 
 Professor is a non-prime attribute - depends on the prime attribute - subject)



 

 

 


 

 

 






 

 

No comments:

Post a Comment