Sree's Blog: BigQuery My Reference notes

dw-bq-migration-overview MUST Read

https://cloud.google.com/solutions/migration/dw2bq/dw-bq-migration-overview

https://cloud.google.com/solutions/migration/dw2bq/dw-bq-schema-and-data-transfer-overview

ARRAY_AGG, ARRAY_LENGTH

You can do some pretty useful things with arrays like:

finding the number of elements with ARRAY_LENGTH(<array>)
deduplicating elements with ARRAY_AGG(DISTINCT <field>)
ordering elements with ARRAY_AGG(<field> ORDER BY <field>)
limiting ARRAY_AGG(<field> LIMIT 5)

SELECT
fullVisitorId, date, ARRAY_AGG(DISTINCT v2ProductName) AS products_viewed, ARRAY_LENGTH(ARRAY_AGG(DISTINCT v2ProductName)) AS distinct_products_viewed, ARRAY_AGG(DISTINCT pageTitle) AS pages_viewed, ARRAY_LENGTH(ARRAY_AGG(DISTINCT pageTitle)) AS distinct_pages_viewed FROM `data-to-insights.ecommerce.all_sessions` WHERE visitId = 1501570398 GROUP BY fullVisitorId, date ORDER BY date

Querying datasets that already have ARRAYs

In a BigQuery schema, an ARRAY field is noted as a REPEATED Mode.

SELECT visitId,  hits.page.pageTitle
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`
WHERE visitId = 1501570398

You will get an error: Cannot access field page on a value with type ARRAY<STRUCT<hitNumber INT64, time INT64, hour INT64, ...>> at [3:8]

Before you can query REPEATED fields (arrays) normally, you must first break the arrays back into rows.

How do you do that with SQL?

Answer: Use the UNNEST() function on your array field:

SELECT DISTINCT
  visitId,
  h.page.pageTitle
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`,
UNNEST(hits) AS h
WHERE visitId = 1501570398
LIMIT 10

You need to UNNEST() arrays to bring the array elements back into rows
UNNEST() always follows the table name in your FROM clause (think of it conceptually like a pre-joined table
STRUCTs

The easiest way to think about a STRUCT is to consider it conceptually like a separate table that is already pre-joined into your main table. A STRUCT can have another STRUCT as one of its fields (you can nest STRUCTs)
A STRUCT can have:

one or many fields in it
the same or different data types for each field
it's own alias

SELECT visitId, totals.*, device.*
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`
WHERE visitId = 1501570398
LIMIT 10

#standardSQL
SELECT race, participants.name
FROM racing.race_results
CROSS JOIN race_results.participants # this is the STRUCT (it is like a table within a table)

Below query will give you the same query result:

#standardSQL
SELECT race, participants.name
FROM racing.race_results AS r, r.participants

If you have more than one race type (800M, 100M, 200M), wouldn't a CROSS JOIN just associate every racer name with every possible race like a cartesian product?

Answer: No. This is a correlated cross join which only unpacks the elements associated with a single row. For a greater discussion, see working with ARRAYs and STRUCTs

#standardSQL
SELECT COUNT(p.name) AS racer_count
FROM racing.race_results AS r, UNNEST(r.participants) AS p

QUANTILES & APPROX_QUANTILES

What is Quantiles (fractiles) (100 quantile is Percentile)

Percentile -> divide the set into 100 parts...

SELECT APPROX_QUANTILES(x, 2) AS approx_quantiles
FROM UNNEST([1, 1, 1, 4, 5, 6, 7, 8, 9, 10]) AS x;

--> [1, 5, 10]

SELECT APPROX_QUANTILES(x, 4) AS approx_quantiles
FROM UNNEST([1, 1, 1, 4, 5, 6, 7, 8, 9, 10]) AS x;

--> [1, 1, 5, 8, 10]

Approx Functions in Big Query (link)

SELECT FORMAT("%T", APPROX_QUANTILES(DISTINCT x, 2 RESPECT NULLS)) AS approx_quantiles
FROM UNNEST([NULL, NULL, 1, 1, 1, 4, 5, 6, 7, 8, 9, 10]) AS x;

+------------------+
| approx_quantiles |
+------------------+
| [NULL, 6, 10]    |
+------------------+

SELECT APPROX_QUANTILES(x, 4) AS output
FROM UNNEST(GENERATE_ARRAY(1, 100)) AS x;

--> [1, 25, 50, 75, 100]

SELECT APPROX_QUANTILES(x, 100) AS output
FROM UNNEST(GENERATE_ARRAY(1, 200)) AS x;
--> [1, 2, 4, 6, 8, 10, 12, ....., 198, 200]

SELECT APPROX_QUANTILES(x, 100) [OFFSET (5)] AS output
FROM UNNEST(GENERATE_ARRAY(1, 200)) AS x;
 --> [10]

Approximate aggregate functions are scalable in terms of memory usage and time, but produce approximate results instead of exact results. These functions typically require less memory than exact aggregation functions like COUNT(DISTINCT ...)

----------------------------------------------------------------------------------------------

SAFE. prefix

If you begin a function with the SAFE. prefix, it will return NULL instead of an error. The SAFE. prefix only prevents errors from the prefixed function itself

SELECT SAFE.SUBSTR('foo', 0, -2) AS safe_output UNION ALL
SELECT SAFE.SUBSTR('bar', 0, 2) AS safe_output;

+-------------+
| safe_output |
+-------------+
| NULL        |
| ba          |
+-------------+

If no SAFE. prefix used --> query fails with error "Third argument in SUBSTR() cannot be negative"

BigQuery- Query Cache

Cache results of previous queries

The BigQuery service automatically caches query results in a temporary table. If the identical query is submitted within approximately 24 hours, the results are served from this temporary table without any recomputation. Cached results are extremely fast and do not incur charges.

There are, however, a few caveats to be aware of. Query caching is based on exact string comparison. So even whitespaces can cause a cache miss.

Queries are never cached if they exhibit non-deterministic behavior (for example, they use CURRENT_TIMESTAMP or RAND), if the table or view being queried has changed (even if the columns/rows of interest to the query are unchanged), if the table is associated with a streaming buffer (even if there are no new rows), if the query uses DML statements, or queries external data sources.

WITH CLAUSE (Common Table Expression)

WITH clause (also called a Common Table Expression) improves readability but does not improve query speed or cost since results are not cached.

The same holds for views and subqueries as well

if used frequently, one way to potentially improve performance is to store the result into a table (or materialized view).

BIG QUERY - VERY RELEVANT INFO

https://googlecourses.qwiklabs.com/course_sessions/107473/labs/25818

_TABLE_SUFFIX (WildCard table reference)

#standardSQL

CREATE OR REPLACE TABLE ecommerce.days_with_rain

PARTITION BY date

OPTIONS (

partition_expiration_days=60,

description="weather stations with precipitation, partitioned by day"

) AS

SELECT

DATE(CAST(year AS INT64), CAST(mo AS INT64), CAST(da AS INT64)) AS date,

(SELECT ANY_VALUE(name) FROM `bigquery-public-data.noaa_gsod.stations` AS stations

WHERE stations.usaf = stn) AS station_name, -- Stations may have multiple names

prcp

FROM `bigquery-public-data.noaa_gsod.gsod*` AS weather

WHERE prcp < 99.9 -- Filter unknown values

AND prcp > 0 -- Filter

AND CAST(_TABLE_SUFFIX AS int64) >= 2017

AND CAST(_TABLE_SUFFIX AS int64) <= 2019

BQML - Big Query Machine Learning

CREATE OR REPLACE MODEL

bike_model.model_bucketized TRANSFORM(* EXCEPT(start_date),

(EXTRACT(dayofweek

FROM

start_date) BETWEEN 2 AND 6,

'weekday',

'weekend') AS dayofweek,

ML.BUCKETIZE(EXTRACT(HOUR

FROM

start_date),

[5, 10, 17]) AS hourofday )

OPTIONS

(input_label_cols=['duration'],

model_type='linear_reg') AS

SELECT

duration,

start_station_name,

start_date

FROM

`bigquery-public-data`.london_bicycles.cycle_hire

EVALUATE

select * from ML.EVALUATE(<dataset>.<model_name>) or
select * from ML.EVALUATE(<dataset>.<model_name>, 'SQL query to provide input sample data')

SELECT * FROM ML
ML.PREDICT(MODEL bike_model.model_bucketized,
( SELECT 'Park Lane , Hyde Park' AS start_station_name, CURRENT_TIMESTAMP() AS start_date) )

MODEL WEIGHTS

A linear regression model predicts the output as a weighted sum of its inputs. Often times, the weights of the model need to be utilized in a production environment.
SELECT * FROM ML.WEIGHTS(MODEL bike_model.model_bucketized)

BQ Create, Load commands

bq --location=EU mk --dataset movies

bq load --source_format=CSV

--location=EU
--autodetect movies.movielens_movies_raw gs://dataeng-movielens/movies.csv

To replace a formatted String ( like, a | separated string into array)

SELECT * REPLACE(SPLIT(genres, "|") AS genres)

FROM movies.movielens_movies_raw WHERE movieId < 5

Collaborative Filtering

Matrix factorization is a collaborative filtering technique that relies on two vectors called the user factors and the item factors. The user factors is a low-dimensional representation of a user_id and the item factors similarly represents an item_id.

To perform a matrix factorization of our data, we use the typical BigQuery ML syntax except that the model_type is matrix_factorization and we have to identify which columns play what roles in the collaborative filtering setup.

CREATE OR REPLACE MODEL movies.movie_recommender
OPTIONS (model_type='matrix_factorization', user_col='userId', item_col='movieId', rating_col='rating', l2_reg=0.2, num_factors=16) AS
SELECT userId, movieId, rating
FROM movies.movielens_ratings

ML.PREDICT
SELECT * FROM
ML.PREDICT(MODEL `cloud-training-prod-bucket.movies.movie_recommender`,
( SELECT movieId, title, 903 AS userId
FROM
`movies.movielens_movies`, UNNEST(genres) g
WHERE g = 'Comedy' ))
ORDER BY redicted_rating DESC

LIMIT 5

Ref: https://googlecourses.qwiklabs.com/ course_sessions /109507 /labs/ 12073

Materialized VIEW - Automatic refresh (can turn on/off)

By default, materialized views are automatically refreshed within 5 minutes of a change to the base table. Examples of changes include row insertions or row deletions.

Automatic refresh can be enabled or disabled at any time.

To turn automatic refresh off when you create a table, set enable_refresh to false.

CREATE MATERIALIZED VIEW project-id.my_dataset.my_mv_table
PARTITION BY RANGE_BUCKET(column, buckets)
OPTIONS (enable_refresh = false)
AS SELECT ...For an existing materialized view, you can modify the enable_refresh value using ALTER MATERIALIZED VIEW.

ALTER MATERIALIZED VIEW project-id.my_dataset.my_mv_table
SET OPTIONS (enable_refresh = true)

Query Data - options, Federated Query... 
SQL GUI
BQ
Storage API --> Spark, Tensorflow, Dataflow, Pandas, Scikit-learn

Controlling access to DataSets
       https://cloud.google.com/bigquery/docs/dataset-access-controls 
You can apply access controls during dataset creation by calling the datasets.insert API method.
Access controls can't be applied during dataset creation in the Cloud Console or the bq command-line tool.
You can apply access controls to a dataset after it is created in the following ways:
Using the Cloud Console.
Using the bq update command in the bq command-line tool.
Calling the datasets.patch API method.
Using the client libraries.

Dataset Sharing - option in UI [share dataset]


Partitioning

https://cloud.google.com/bigquery/docs/querying-partitioned-tables

   By ingestion time 
        When you create an ingestion-time partitioned table, two pseudo columns         are added to the table: a _PARTITIONTIME pseudo column and                 a _PARTITIONDATE pseudo column. 
        When you query data in ingestion-time partitioned tables, you reference                 specific partitions by specifying the values in                 the _PARTITIONTIME or _PARTITIONDATE pseudo columns. For example:
_PARTITIONTIME >= "2018-01-29 00:00:00" AND _PARTITIONTIME < "2018-01-30 00:00:00"
_PARTITIONTIME BETWEEN TIMESTAMP('2016-01-01') AND TIMESTAMP('2016-01-02')
         Limiting partitions queried using pseudo columns  column
FROM
  dataset.table
WHERE
  _PARTITIONTIME BETWEEN TIMESTAMP('2016-01-01')
  AND TIMESTAMP('2016-01-02')

  Partition by Date/Time
         No pseudo columns here

        Special partitions - 
          _NULL_ if value is null
          _UNPARTITIONED_ (if value is outside the allowed range)


Integer Range Partitioning

Using the Cloud Console
Using a DDL CREATE TABLE statement with a PARTITION BY RANGE_BUCKET clause that contains a partition expressionRANGE_BUCKET (
  <integer_column>, 
  GENERATE_ARRAY(<beginning>, <end + 1>, <interval_length>)
)

* or in BQ command use 
--range_partitioning=pickup_location_id,0,300,5
Argument Value
column name customer_id
start 0
end 100
interval 10
The table will be partitioned on the customer_id column into ranges of interval 10. The values 0 to 9 will be in one partition, values 10 to 19 in another partition, ..., and finally values 90 to 99 will be in another partition. Values outside of 0 to 99 (such as -1 or 100) will be in the UNPARTITIONED partition. Null values will be in the NULL partition. (pseudo columns _)UNPARTITIONED__, __NULL__


Sharding & partitioning

Partitioning is preferred - less metadata, better performance, less permission checks

Sharding - create separate table for each day/hour as u set as below
<TABLE_NAME_PREFIX>_<YYMMDD>

https://www.yuichiotsuka.com/bigquery-table-partition/#Table_Sharding

When querying a date-sharded table, you only include the table(s) that you need. You can use either a UNION ALL, or a wildcard table format.

SELECT
*
FROM 
  `bigquery-public-data.google_analytics_sample.ga_sessions_201707*`
WHERE 
  _TABLE_SUFFIX BETWEEN '01' AND '14'


Use union in queries to scan multiple tables to get the result.

Clustering - to keep like data together 
- suppose if you want to search on a TEXT column

Argument	Value
column name	customer_id
start	0
end	100
interval	10

CREATE TABLE <my_partitioned_table>
PARTITIONED BY DATE(creation_date)
CLUSTER BY deviceName
AS
SELECT * from <Non Partitioned Table>

Data Analytics - Declarative, Predictive, Prescriptive

Thursday, January 7, 2021

BigQuery My Reference notes

dw-bq-migration-overview MUST Read

https://cloud.google.com/solutions/migration/dw2bq/dw-bq-migration-overview

ARRAY_AGG, ARRAY_LENGTH

Querying datasets that already have ARRAYs

Before you can query REPEATED fields (arrays) normally, you must first break the arrays back into rows.

How do you do that with SQL?Answer: Use the UNNEST() function on your array field:SELECT DISTINCT visitId, h.page.pageTitle FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`, UNNEST(hits) AS h WHERE visitId = 1501570398 LIMIT 10

You need to UNNEST() arrays to bring the array elements back into rowsUNNEST() always follows the table name in your FROM clause (think of it conceptually like a pre-joined tableSTRUCTs

SELECT visitId, totals.*, device.*

FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`

WHERE visitId = 1501570398

LIMIT 10

#standardSQL

SELECT race, participants.name

FROM racing.race_results

CROSS JOIN race_results.participants # this is the STRUCT (it is like a table within a table)

Below query will give you the same query result:

#standardSQL SELECT race, participants.name FROM racing.race_results AS r, r.participants

If you have more than one race type (800M, 100M, 200M), wouldn't a CROSS JOIN just associate every racer name with every possible race like a cartesian product?

#standardSQL SELECT COUNT(p.name) AS racer_count FROM racing.race_results AS r, UNNEST(r.participants) AS p

QUANTILES & APPROX_QUANTILES

Approx Functions in Big Query (link)

Read more here --> BigQuery Functions and Operators

SAFE. prefix

BigQuery- Query Cache

Cache results of previous queries

WITH CLAUSE (Common Table Expression)

BIG QUERY - VERY RELEVANT INFO

_TABLE_SUFFIX (WildCard table reference)

BQML - Big Query Machine Learning

EVALUATE

MODEL WEIGHTS

BQ Create, Load commands

To replace a formatted String ( like, a | separated string into array)

Collaborative Filtering

ML.PREDICT

Materialized VIEW - Automatic refresh (can turn on/off)

Query Data - options, Federated Query...

SQL GUIBQStorage API --> Spark, Tensorflow, Dataflow, Pandas, Scikit-learn

Controlling access to DataSets

SELECT columnFROM dataset.tableWHERE _PARTITIONTIME BETWEEN TIMESTAMP('2016-01-01') AND TIMESTAMP('2016-01-02')

Data Analytics - Declarative, Predictive, Prescriptive

No comments:

Post a Comment

How do you do that with SQL?
Answer: Use the UNNEST() function on your array field:
SELECT DISTINCT visitId, h.page.pageTitle FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`, UNNEST(hits) AS h WHERE visitId = 1501570398 LIMIT 10

You need to UNNEST() arrays to bring the array elements back into rows
UNNEST() always follows the table name in your FROM clause (think of it conceptually like a pre-joined table
STRUCTs

SELECT visitId, totals., device.

`#standardSQL SELECT race, participants.name FROM racing.race_results AS r, r.participants`

`#standardSQL SELECT COUNT(p.name) AS racer_count FROM racing.race_results AS r, UNNEST(r.participants) AS p`

SQL GUI
BQ
Storage API --> Spark, Tensorflow, Dataflow, Pandas, Scikit-learn

`column`
FROM
`dataset.table`
WHERE
_PARTITIONTIME BETWEEN TIMESTAMP('2016-01-01')
AND TIMESTAMP('2016-01-02')