Data Engineering: Row data to Useful info
Role of Data Engineer, skills to have:
Data Modeling: Simplify complex software design into simple ones (break it) - provide a visual representation
Design Schemas: Star & Snowflake
Structured & Unstructured Data
Reference link1Big Data 4 Vs
FSCK - File System Check - to check for files and discrepancies in files (in hdfs)
Job tracker, task tracker, Name node ports: 50030, 50060, 50070 respectively
Hive Metastore: storage location of schema and tables (definitions, mapping) - later stored in RDBMS
spark.sql.warehouse.dir is a static configuration property that sets Hive’s hive.metastore.warehouse.dir
property, i.e. the location of default database for the Hive warehouse.
Hive Collections - Array, Map, Struct, Union - refer link
SerDe - Serializer (object to storage form) - DeSerializer (stored format to prior form)
.hiverc - for initialization
*args & *kwargs (python - variable arguments passing... )
AWS Monitoring - starts
System Status check
A system status check failure indicates a problem with the AWS systems that your instance runs on.
Check for any outage, or get it resolved by itself or terminate/restart the instance
https://aws.amazon.com/premiumsupport/knowledge-center/ec2-windows-system-status-check-fail/
Instance status check
Determine whether Amazon EC2 has detected any problems that might prevent your instances from running applications
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-system-instance-status-check.html
Monitor your instances using CloudWatch
Collects and processes raw data from Amazon EC2 into readable, near real-time metrics. These statistics are recorded for a period of 15 months
By default, Amazon EC2 sends metric data to CloudWatch in 5-minute periods. To send metric data for your instance to CloudWatch in 1-minute periods, you can enable detailed monitoring on the instance.
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-cloudwatch.html
Amazon EventBridge? (formerly called Amazon CloudWatch Events)
Serverless event bus service that makes it easy to connect your applications with data from a variety of sources.
EventBridge delivers a stream of real-time data from your own applications, Software-as-a-Service (SaaS) applications, and AWS services and routes that data to targets such as AWS Lambda.
Events, Rules, Targets, Event Bus
https://docs.aws.amazon.com/eventbridge/latest/userguide/what-is-amazon-eventbridge.html
CloudWatch Agent
Collecting Metrics and Logs from Amazon EC2 Instances and On-Premises Servers with the CloudWatch Agent
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html
AWS Monitoring - ends
Adhoc topic - EFS - Shared Access, Size Auto Scales to Petabytes on demand.
Amazon Elastic File System (Amazon EFS) provides a simple, scalable, fully managed elastic NFS file system for use with AWS Cloud services and on-premises resources.
It is built to scale on demand to petabytes without disrupting applications, growing and shrinking automatically as you add and remove files, eliminating the need to provision and manage capacity to accommodate growth.
Amazon EFS is designed to provide massively parallel shared access to thousands of Amazon EC2 instances, enabling your applications to achieve high levels of aggregate throughput and IOPS with consistent low latencies.
Data Warehouse design - Star Schema, Snowflake Schema (Fact & Dimension table)
Star Schema (de-normalized dimension tables)
link1 link2
Fact tables has measures ----> dimesion tables give more context to the fact table
Star
Star-vs-Snowflake
Primary Key - Unique Key
Parameter | PRIMARY KEY | UNIQUE KEY |
---|
Basic | Used to serve as a unique identifier for each row in a table. | Uniquely determines a row which isn’t primary key. |
NULL value acceptance | Cannot accept NULL values. | Can accept one NULL value. |
Number of keys that can be defined in the table | Only one primary key | More than one unique key |
Index | Creates clustered index | Creates non-clustered index |
NORMAL FORMS (1NF, 2NF, 3NF) - Eliminating Data Redundancy (link1)
1NF - Single column cannot have multiple values (Atomicity)
modified to below format for atomicity2NF = 1NF + Table should not have partial dependencies
Here office location only depends on the department Id - so split it.
3NF = 2NF + no transitive dependency on non-prime attributes
StudentId determines subject via subjectId --> transitive dependency (see below)
Boyce Codd NF (3.5NF) - super key (non-prime attribute depends on prime attribute)
Professor is a non-prime attribute - depends on the prime attribute - subject)