dw-bq-migration-overview MUST Read
https://cloud.google.com/solutions/migration/dw2bq/dw-bq-migration-overview
ARRAY_AGG, ARRAY_LENGTH
You can do some pretty useful things with arrays like:
finding the number of elements with
ARRAY_LENGTH(<array>)
deduplicating elements with
ARRAY_AGG(DISTINCT <field>)
ordering elements with
ARRAY_AGG(<field> ORDER BY <field>)
limiting
ARRAY_AGG(<field> LIMIT 5)
SELECTfullVisitorId, date, ARRAY_AGG(DISTINCT v2ProductName) AS products_viewed, ARRAY_LENGTH(ARRAY_AGG(DISTINCT v2ProductName)) AS distinct_products_viewed, ARRAY_AGG(DISTINCT pageTitle) AS pages_viewed, ARRAY_LENGTH(ARRAY_AGG(DISTINCT pageTitle)) AS distinct_pages_viewed FROM `data-to-insights.ecommerce.all_sessions` WHERE visitId = 1501570398 GROUP BY fullVisitorId, date ORDER BY date
Querying datasets that already have ARRAYs
SELECT visitId, hits.page.pageTitle
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`
WHERE visitId = 1501570398
You will get an error: Cannot access field page on a value with type ARRAY<STRUCT<hitNumber INT64, time INT64, hour INT64, ...>> at [3:8]
Before you can query REPEATED fields (arrays) normally, you must first break the arrays back into rows.
How do you do that with SQL?
Answer: Use the UNNEST() function on your array field:
SELECT DISTINCT
visitId,
h.page.pageTitle
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`,
UNNEST(hits) AS h
WHERE visitId = 1501570398
LIMIT 10
- You need to UNNEST() arrays to bring the array elements back into rows
- UNNEST() always follows the table name in your FROM clause (think of it conceptually like a pre-joined table
STRUCTs
The easiest way to think about a STRUCT is to consider it conceptually like a separate table that is already pre-joined into your main table. A STRUCT can have another STRUCT as one of its fields (you can nest STRUCTs)
A STRUCT can have:
- one or many fields in it
- the same or different data types for each field
- it's own alias
SELECT visitId, totals.*, device.*
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`
WHERE visitId = 1501570398
LIMIT 10
#standardSQL
SELECT race, participants.name
FROM racing.race_results
CROSS JOIN race_results.participants # this is the STRUCT (it is like a table within a table)
Below query will give you the same query result:
#standardSQL
SELECT race, participants.name
FROM racing.race_results AS r, r.participants
#standardSQL
SELECT race, participants.name
FROM racing.race_results AS r, r.participants
If you have more than one race type (800M, 100M, 200M), wouldn't a CROSS JOIN just associate every racer name with every possible race like a cartesian product?
Answer: No. This is a correlated cross join which only unpacks the elements associated with a single row. For a greater discussion, see working with ARRAYs and STRUCTs
#standardSQL
SELECT COUNT(p.name) AS racer_count
FROM racing.race_results AS r, UNNEST(r.participants) AS p
#standardSQL
SELECT COUNT(p.name) AS racer_count
FROM racing.race_results AS r, UNNEST(r.participants) AS p
QUANTILES & APPROX_QUANTILES
What is Quantiles (fractiles) (100 quantile is Percentile)
Percentile -> divide the set into 100 parts...
SELECT APPROX_QUANTILES(x,
2)
AS
approx_quantiles
FROM
UNNEST([1, 1, 1, 4, 5, 6, 7, 8, 9, 10]) AS x;
--> [1, 5, 10]
FROM UNNEST([1, 1, 1, 4, 5, 6, 7, 8, 9, 10]) AS x;
Approx Functions in Big Query (link)
SELECT FORMAT("%T", APPROX_QUANTILES(DISTINCT x, 2 RESPECT NULLS)) AS approx_quantiles
FROM UNNEST([NULL, NULL, 1, 1, 1, 4, 5, 6, 7, 8, 9, 10]) AS x;
+------------------+
| approx_quantiles |
+------------------+
| [NULL, 6, 10] |
+------------------+
SELECT APPROX_QUANTILES(x, 4) AS outputFROM UNNEST(GENERATE_ARRAY(1, 100)) AS x;
--> [1, 25, 50, 75, 100]
SELECT APPROX_QUANTILES(x, 100) AS outputFROM UNNEST(GENERATE_ARRAY(1, 200)) AS x;--> [1, 2, 4, 6, 8, 10, 12, ....., 198, 200]
SELECT APPROX_QUANTILES(x, 100) [OFFSET (5)] AS outputFROM UNNEST(GENERATE_ARRAY(1, 200)) AS x; --> [10]
Approximate aggregate functions are scalable in terms of memory usage and time, but produce approximate results instead of exact results. These functions typically require less memory than exact aggregation functions like COUNT(DISTINCT ...)
Read more here --> BigQuery Functions and Operators
SAFE. prefix
SAFE.
prefix, it will return NULL
instead of an error. The SAFE.
prefix only prevents errors from the prefixed function itselfSELECT SAFE.SUBSTR('foo', 0, -2) AS safe_output UNION ALL
SELECT SAFE.SUBSTR('bar', 0, 2) AS safe_output;
+-------------+
| safe_output |
+-------------+
| NULL |
| ba |
+-------------+
BigQuery- Query Cache
Cache results of previous queries
The BigQuery service automatically caches query results in a temporary table. If the identical query is submitted within approximately 24 hours, the results are served from this temporary table without any recomputation. Cached results are extremely fast and do not incur charges.
There are, however, a few caveats to be aware of. Query caching is based on exact string comparison. So even whitespaces can cause a cache miss.
Queries are never cached if they exhibit non-deterministic behavior (for example, they use CURRENT_TIMESTAMP or RAND), if the table or view being queried has changed (even if the columns/rows of interest to the query are unchanged), if the table is associated with a streaming buffer (even if there are no new rows), if the query uses DML statements, or queries external data sources.
WITH CLAUSE (Common Table Expression)
WITH clause (also called a Common Table Expression) improves readability but does not improve query speed or cost since results are not cached.
The same holds for views and subqueries as well
if used frequently, one way to potentially improve performance is to store the result into a table (or materialized view).
BIG QUERY - VERY RELEVANT INFO
https://googlecourses.qwiklabs.com/course_sessions/107473/labs/25818
_TABLE_SUFFIX (WildCard table reference)
BQML - Big Query Machine Learning
EVALUATE
select * from ML.EVALUATE(<dataset>.<model_name>) orselect * from ML.EVALUATE(<dataset>.<model_name>, 'SQL query to provide input sample data')SELECT * FROM MLML.PREDICT(MODEL bike_model.model_bucketized,( SELECT 'Park Lane , Hyde Park' AS start_station_name, CURRENT_TIMESTAMP() AS start_date) )
MODEL WEIGHTS
A linear regression model predicts the output as a weighted sum of its inputs. Often times, the weights of the model need to be utilized in a production environment.
SELECT * FROM ML.WEIGHTS(MODEL bike_model.model_bucketized)
BQ Create, Load commands
bq load --source_format=CSV
--location=EU--autodetect movies.movielens_movies_raw gs://dataeng-movielens/movies.csv
To replace a formatted String ( like, a | separated string into array)
SELECT * REPLACE(SPLIT(genres, "|") AS genres)
FROM movies.movielens_movies_raw WHERE movieId < 5
Collaborative Filtering
Matrix factorization is a collaborative filtering technique that relies on two vectors called the user factors and the item factors. The user factors is a low-dimensional representation of auser_id
and the item factors similarly represents anitem_id
.To perform a matrix factorization of our data, we use the typical BigQuery ML syntax except that themodel_type
ismatrix_factorization
and we have to identify which columns play what roles in the collaborative filtering setup.
CREATE OR REPLACE MODEL movies.movie_recommender
OPTIONS (model_type='matrix_factorization', user_col='userId', item_col='movieId', rating_col='rating', l2_reg=0.2, num_factors=16) AS
SELECT userId, movieId, rating
FROM movies.movielens_ratings
ML.PREDICT
SELECT * FROMML.PREDICT(MODEL `cloud-training-prod-bucket.movies.movie_recommender`,( SELECT movieId, title, 903 AS userIdFROM`movies.movielens_movies`, UNNEST(genres) gWHERE g = 'Comedy' ))ORDER BY redicted_rating DESC
LIMIT 5
Ref: https://googlecourses.qwiklabs.com/ course_sessions /109507 /labs/ 12073
Materialized VIEW - Automatic refresh (can turn on/off)
By default, materialized views are automatically refreshed within 5 minutes of a change to the base table. Examples of changes include row insertions or row deletions.
Automatic refresh can be enabled or disabled at any time.
To turn automatic refresh off when you create a table, set enable_refresh
to false
.
CREATE MATERIALIZED VIEWproject-id .my_dataset .my_mv_table
PARTITION BY RANGE_BUCKET(column, buckets)
OPTIONS (enable_refresh = false)
AS SELECT ...For an existing materialized view, you can modify theenable_refresh
value usingALTER MATERIALIZED VIEW
.
ALTER MATERIALIZED VIEW project-id .my_dataset .my_mv_table
SET OPTIONS (enable_refresh = true)
Query Data - options, Federated Query...
SQL GUI
BQ
Storage API --> Spark, Tensorflow, Dataflow, Pandas, Scikit-learn
Controlling access to DataSets
You can apply access controls during dataset creation by calling the datasets.insert
API method.
Access controls can't be applied during dataset creation in the Cloud Console or the bq
command-line tool.
You can apply access controls to a dataset after it is created in the following ways:
- Using the Cloud Console.
- Using the
bq update
command in the bq
command-line tool. - Calling the
datasets.patch
API method. - Using the client libraries.
Dataset Sharing - option in UI [share dataset]
Partitioning
https://cloud.google.com/bigquery/docs/querying-partitioned-tables
By ingestion time When you create an ingestion-time partitioned table, two pseudo columns are added to the table: a _PARTITIONTIME
pseudo column and a _PARTITIONDATE
pseudo column. When you query data in ingestion-time partitioned tables, you reference specific partitions by specifying the values in the _PARTITIONTIME
or _PARTITIONDATE
pseudo columns. For example:
_PARTITIONTIME >= "2018-01-29 00:00:00" AND _PARTITIONTIME < "2018-01-30 00:00:00"
_PARTITIONTIME BETWEEN TIMESTAMP('2016-01-01') AND TIMESTAMP('2016-01-02')
Limiting partitions queried using pseudo columns column
FROM
dataset.table
WHERE
_PARTITIONTIME BETWEEN TIMESTAMP('2016-01-01')
AND TIMESTAMP('2016-01-02')
Partition by Date/Time No pseudo columns here
Special partitions - _NULL_ if value is null _UNPARTITIONED_ (if value is outside the allowed range)
Integer Range Partitioning
- Using the Cloud Console
- Using a DDL
CREATE TABLE
statement with a PARTITION BY RANGE_BUCKET
clause that contains a partition expression
RANGE_BUCKET (
<integer_column>,
GENERATE_ARRAY(<beginning>, <end + 1>, <interval_length>)
)
* or in BQ command use
--range_partitioning=pickup_location_id,0,300,5
Argument Value column name customer_id start 0 end 100 interval 10
The table will be partitioned on the customer_id column into ranges of interval 10. The values 0 to 9 will be in one partition, values 10 to 19 in another partition, ..., and finally values 90 to 99 will be in another partition. Values outside of 0 to 99 (such as -1 or 100) will be in the UNPARTITIONED partition. Null values will be in the NULL partition. (pseudo columns _)UNPARTITIONED__, __NULL__
Sharding & partitioning
Partitioning is preferred - less metadata, better performance, less permission checks
Sharding - create separate table for each day/hour as u set as below<TABLE_NAME_PREFIX>_<YYMMDD>
When querying a date-sharded table, you only include the table(s) that you need. You can use either a UNION ALL
, or a wildcard table format.
SELECT*FROM `bigquery-public-data.google_analytics_sample.ga_sessions_201707*`WHERE _TABLE_SUFFIX BETWEEN '01' AND '14'
Use union in queries to scan multiple tables to get the result.
Clustering - to keep like data together - suppose if you want to search on a TEXT column
CREATE TABLE <my_partitioned_table>PARTITIONED BY DATE(creation_date)CLUSTER BY deviceNameASSELECT * from <Non Partitioned Table>
Data Analytics - Declarative, Predictive, Prescriptive
No comments:
Post a Comment