Thursday, January 7, 2021

BigQuery My Reference notes


dw-bq-migration-overview MUST Read

https://cloud.google.com/solutions/migration/dw2bq/dw-bq-migration-overview

ARRAY_AGG, ARRAY_LENGTH 

You can do some pretty useful things with arrays like:

  • finding the number of elements with ARRAY_LENGTH(<array>)

  • deduplicating elements with ARRAY_AGG(DISTINCT <field>)

  • ordering elements with ARRAY_AGG(<field> ORDER BY <field>)

  • limiting ARRAY_AGG(<field> LIMIT 5)

SELECT
fullVisitorId, date, ARRAY_AGG(DISTINCT v2ProductName) AS products_viewed, ARRAY_LENGTH(ARRAY_AGG(DISTINCT v2ProductName)) AS distinct_products_viewed, ARRAY_AGG(DISTINCT pageTitle) AS pages_viewed, ARRAY_LENGTH(ARRAY_AGG(DISTINCT pageTitle)) AS distinct_pages_viewed FROM `data-to-insights.ecommerce.all_sessions` WHERE visitId = 1501570398 GROUP BY fullVisitorId, date ORDER BY date


Querying datasets that already have ARRAYs

In a BigQuery schema, an ARRAY field is noted as a REPEATED Mode.
SELECT visitId,  hits.page.pageTitle
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`
WHERE visitId = 1501570398

You will get an error: Cannot access field page on a value with type ARRAY<STRUCT<hitNumber INT64, time INT64, hour INT64, ...>> at [3:8]

Before you can query REPEATED fields (arrays) normally, you must first break the arrays back into rows.

How do you do that with SQL?

Answer: Use the UNNEST() function on your array field:

SELECT DISTINCT
  visitId,
  h.page.pageTitle
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`,
UNNEST(hits) AS h
WHERE visitId = 1501570398
LIMIT 10

  • You need to UNNEST() arrays to bring the array elements back into rows
  • UNNEST() always follows the table name in your FROM clause (think of it conceptually like a pre-joined table
STRUCTs

The easiest way to think about a STRUCT is to consider it conceptually like a separate table that is already pre-joined into your main table. A STRUCT can have another STRUCT as one of its fields (you can nest STRUCTs)

A STRUCT can have:

    • one or many fields in it
    • the same or different data types for each field
    • it's own alias

SELECT   visitId,   totals.*,   device.*

FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`

WHERE visitId = 1501570398

LIMIT 10



#standardSQL

SELECT race, participants.name

FROM racing.race_results

CROSS JOIN race_results.participants  # this is the STRUCT (it is like a table within a table)

Below query will give you the same query result:

#standardSQL
SELECT race, participants.name
FROM racing.race_results AS r, r.participants

       If you have more than one race type (800M, 100M, 200M), wouldn't a CROSS JOIN just associate every racer name with every possible race like a cartesian product?

Answer: No. This is a correlated cross join which only unpacks the elements associated with a single row. For a greater discussion, see working with ARRAYs and STRUCTs

#standardSQL
SELECT COUNT(p.name) AS racer_count
FROM racing.race_results AS r, UNNEST(r.participants) AS p


QUANTILES & APPROX_QUANTILES

What is Quantiles (fractiles)  (100 quantile is Percentile)

Percentile -> divide the set into 100 parts...

SELECT APPROX_QUANTILES(x, 2) AS approx_quantiles
FROM UNNEST([1, 1, 1, 4, 5, 6, 7, 8, 9, 10]) AS x;

--> [1, 5, 10]


SELECT APPROX_QUANTILES(x, 4AS approx_quantiles
FROM UNNEST([11145678910]) AS x;

--> [1, 1, 5, 8, 10]

Approx Functions in Big Query (link)

SELECT FORMAT("%T", APPROX_QUANTILES(DISTINCT x, 2 RESPECT NULLS)) AS approx_quantiles
FROM UNNEST([NULL, NULL, 1, 1, 1, 4, 5, 6, 7, 8, 9, 10]) AS x;

+------------------+
| approx_quantiles |
+------------------+
| [NULL, 6, 10]    |
+------------------+
SELECT APPROX_QUANTILES(x, 4) AS output
FROM UNNEST(GENERATE_ARRAY(1, 100)) AS x;

--> [1, 25, 50, 75, 100]

SELECT APPROX_QUANTILES(x, 100) AS output
FROM UNNEST(GENERATE_ARRAY(1, 200)) AS x;
--> [1, 2, 4, 6, 8, 10, 12, ....., 198, 200]

SELECT APPROX_QUANTILES(x, 100) [OFFSET (5)] AS output
FROM UNNEST(GENERATE_ARRAY(1, 200)) AS x;
 --> [10]

Approximate aggregate functions are scalable in terms of memory usage and time, but produce approximate results instead of exact results. These functions typically require less memory than exact aggregation functions like COUNT(DISTINCT ...)

----------------------------------------------------------------------------------------------

Read more here --> BigQuery Functions and Operators

SAFE. prefix

If you begin a function with the SAFE. prefix, it will return NULL instead of an error. The SAFE. prefix only prevents errors from the prefixed function itself

SELECT SAFE.SUBSTR('foo', 0, -2) AS safe_output UNION ALL
SELECT SAFE.SUBSTR('bar', 0, 2) AS safe_output;

+-------------+
| safe_output |
+-------------+
| NULL        |
| ba          |
+-------------+
 
If no SAFE. prefix used --> query fails with error "Third argument in SUBSTR() cannot be negative" 


BigQuery- Query Cache

Cache results of previous queries

The BigQuery service automatically caches query results in a temporary table. If the identical query is submitted within approximately 24 hours, the results are served from this temporary table without any recomputation. Cached results are extremely fast and do not incur charges.

There are, however, a few caveats to be aware of. Query caching is based on exact string comparison. So even whitespaces can cause a cache miss. 

Queries are never cached if they exhibit non-deterministic behavior (for example, they use CURRENT_TIMESTAMP or RAND), if the table or view being queried has changed (even if the columns/rows of interest to the query are unchanged), if the table is associated with a streaming buffer (even if there are no new rows), if the query uses DML statements, or queries external data sources.

WITH CLAUSE (Common Table Expression)

WITH clause (also called a Common Table Expression) improves readability but does not improve query speed or cost since results are not cached. 

 The same holds for views and subqueries as well

if used frequently, one way to potentially improve performance is to store the result into a table (or materialized view).

BIG QUERY - VERY RELEVANT INFO

https://googlecourses.qwiklabs.com/course_sessions/107473/labs/25818 

_TABLE_SUFFIX (WildCard table reference)

#standardSQL
 CREATE OR REPLACE TABLE ecommerce.days_with_rain
 PARTITION BY date
 OPTIONS (
   partition_expiration_days=60,
   description="weather stations with precipitation, partitioned by day"
 ) AS
 SELECT
   DATE(CAST(year AS INT64), CAST(mo AS INT64), CAST(da AS INT64)) AS date,
   (SELECT ANY_VALUE(name) FROM `bigquery-public-data.noaa_gsod.stations` AS stations
    WHERE stations.usaf = stn) AS station_name,  -- Stations may have multiple names
   prcp
 FROM `bigquery-public-data.noaa_gsod.gsod*` AS weather
 WHERE prcp < 99.9  -- Filter unknown values
   AND prcp > 0      -- Filter
   AND CAST(_TABLE_SUFFIX AS int64) >= 2017
   AND CAST(_TABLE_SUFFIX AS int64) <= 2019


BQML - Big Query Machine Learning 

CREATE OR REPLACE MODEL
  bike_model.model_bucketized TRANSFORM(* EXCEPT(start_date),
  IF
    (EXTRACT(dayofweek
      FROM
        start_date) BETWEEN 2 AND 6,
      'weekday',
      'weekend') AS dayofweek,
    ML.BUCKETIZE(EXTRACT(HOUR
      FROM
        start_date),
      [5, 10, 17]) AS hourofday )
OPTIONS
  (input_label_cols=['duration'],
    model_type='linear_reg') AS
SELECT
  duration,
  start_station_name,
  start_date
FROM
  `bigquery-public-data`.london_bicycles.cycle_hire

EVALUATE

select * from ML.EVALUATE(<dataset>.<model_name>) or 
select * from ML.EVALUATE(<dataset>.<model_name>, 'SQL query to provide input sample data')

SELECT * FROM ML
ML.PREDICT(MODEL bike_model.model_bucketized,
( SELECT 'Park Lane , Hyde Park' AS start_station_name, CURRENT_TIMESTAMP() AS start_date) )


MODEL WEIGHTS

A linear regression model predicts the output as a weighted sum of its inputs. Often times, the weights of the model need to be utilized in a production environment.

SELECT * FROM ML.WEIGHTS(MODEL bike_model.model_bucketized)


BQ Create, Load commands

          bq --location=EU mk --dataset movies
bq load --source_format=CSV  
--location=EU
--autodetect movies.movielens_movies_raw gs://dataeng-movielens/movies.csv

        
To replace a formatted String ( like, a | separated string into array)

SELECT * REPLACE(SPLIT(genres, "|") AS genres)

FROM movies.movielens_movies_raw WHERE movieId < 5

 

Collaborative Filtering

Matrix factorization is a collaborative filtering technique that relies on two vectors called the user factors and the item factors. The user factors is a low-dimensional representation of a user_id and the item factors similarly represents an item_id.

To perform a matrix factorization of our data, we use the typical BigQuery ML syntax except that the model_type is matrix_factorization and we have to identify which columns play what roles in the collaborative filtering setup.

CREATE OR REPLACE MODEL movies.movie_recommender

OPTIONS (model_type='matrix_factorization', user_col='userId', item_col='movieId', rating_col='rating', l2_reg=0.2, num_factors=16) AS

SELECT userId, movieId, rating

FROM movies.movielens_ratings 

      

ML.PREDICT

SELECT * FROM
ML.PREDICT(MODEL `cloud-training-prod-bucket.movies.movie_recommender`,
( SELECT movieId, title, 903 AS userId
FROM
`movies.movielens_movies`, UNNEST(genres) g
WHERE g = 'Comedy' ))
ORDER BY redicted_rating DESC
LIMIT 5


         Ref: https://googlecourses.qwiklabs.com/ course_sessions /109507 /labs/ 12073 


Materialized VIEW - Automatic refresh (can turn on/off)

By default, materialized views are automatically refreshed within 5 minutes of a change to the base table. Examples of changes include row insertions or row deletions.

Automatic refresh can be enabled or disabled at any time.

To turn automatic refresh off when you create a table, set enable_refresh to false.

CREATE MATERIALIZED VIEW project-id.my_dataset.my_mv_table
PARTITION
BY RANGE_BUCKET(column, buckets)
OPTIONS
(enable_refresh = false)
AS SELECT ...For an existing materialized view, you can modify the enable_refresh value using ALTER MATERIALIZED VIEW

ALTER MATERIALIZED VIEW project-id.my_dataset.my_mv_table

SET OPTIONS (enable_refresh = true)


Query Data - options, Federated Query...

SQL GUI
BQ
Storage API --> Spark, Tensorflow, Dataflow, Pandas, Scikit-learn


Controlling access to DataSets

You can apply access controls during dataset creation by calling the datasets.insert API method.

Access controls can't be applied during dataset creation in the Cloud Console or the bq command-line tool.

You can apply access controls to a dataset after it is created in the following ways:

  • Using the Cloud Console.
  • Using the bq update command in the bq command-line tool.
  • Calling the datasets.patch API method.
  • Using the client libraries.

Dataset Sharing - option in UI [share dataset]


Partitioning

https://cloud.google.com/bigquery/docs/querying-partitioned-tables

   By ingestion time 
        When you create an ingestion-time partitioned tabletwo pseudo columns         are added to the table: a _PARTITIONTIME pseudo column and                 a _PARTITIONDATE pseudo column. 

        When you query data in ingestion-time partitioned tables, you reference                 specific partitions by specifying the values in                 the _PARTITIONTIME or _PARTITIONDATE pseudo columns. For example:

  • _PARTITIONTIME >= "2018-01-29 00:00:00" AND _PARTITIONTIME < "2018-01-30 00:00:00"
  • _PARTITIONTIME BETWEEN TIMESTAMP('2016-01-01') AND TIMESTAMP('2016-01-02')
         Limiting partitions queried using pseudo columns

  column
FROM
 
dataset.table
WHERE
  _PARTITIONTIME
BETWEEN TIMESTAMP('2016-01-01')
 
AND TIMESTAMP('2016-01-02')


  Partition by Date/Time
         No pseudo columns here

        Special partitions - 
          _NULL_ if value is null
          _UNPARTITIONED_ (if value is outside the allowed range)


Integer Range Partitioning
  • Using the Cloud Console
  • Using a DDL CREATE TABLE statement with a PARTITION BY RANGE_BUCKET clause that contains a partition expression
RANGE_BUCKET (
<integer_column>,
GENERATE_ARRAY(<beginning>, <end + 1>, <interval_length>)
)
* or in BQ command use --range_partitioning=pickup_location_id,0,300,5
ArgumentValue
column namecustomer_id
start0
end100
interval10

The table will be partitioned on the customer_id column into ranges of interval 10. The values 0 to 9 will be in one partition, values 10 to 19 in another partition, ..., and finally values 90 to 99 will be in another partition. Values outside of 0 to 99 (such as -1 or 100) will be in the UNPARTITIONED partition. Null values will be in the NULL partition. (pseudo columns _)UNPARTITIONED__, __NULL__



Sharding & partitioning

Partitioning is preferred - less metadata, better performance, less permission checks

Sharding - create separate table for each day/hour as u set as below
<TABLE_NAME_PREFIX>_<YYMMDD>


When querying a date-sharded table, you only include the table(s) that you need. You can use either a UNION ALL, or a wildcard table format.

SELECT
*
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_201707*`
WHERE
_TABLE_SUFFIX BETWEEN '01' AND '14'


Use union in queries to scan multiple tables to get the result.

Clustering - to keep like data together 
- suppose if you want to search on a TEXT column
CREATE TABLE <my_partitioned_table>
PARTITIONED BY DATE(creation_date)
CLUSTER BY deviceName
AS
SELECT * from <Non Partitioned Table>

Data Analytics - Declarative, Predictive, Prescriptive



No comments:

Post a Comment