Quickstart

Before you start, make sure that you have access to Snowflake account and prepare the following information:

Snowflake Username
Snowflake Password
Snowflake Account Name
Snowflake Warehouse Name
Snowflake Database Name
Snowflake Schema Name
Snowflake password (you will store it locally in an environment variable)

You will also need:

Python 3.8 (must-have ⚠️ - this is enforced by the snowflake-snowpark-python package. Refer to Snowflake documentation for more details.
A tool to manage Python virtual environments (e.g. venv, conda, virtualenv). Anaconda is recommended by Snowflake.

Prepare new virtual environment with Python == 3.8.
Install the plugin

pip install "kedro-snowflake>=0.1.2"

Create new project from our starter

kedro new --starter=snowflights --checkout=0.1.2

 Project Name
 ============
 Please enter a human readable name for your new project.
 Spaces, hyphens, and underscores are allowed.
  [Snowflights]:

 Snowflake Account
 =================
 Please enter the name of your Snowflake account.
 This is the part of the URL before .snowflakecomputing.com
  []: abc-123

 Snowflake User
 ==============
 Please enter the name of your Snowflake user.
  []: user2137

 Snowflake Warehouse
 ===================
 Please enter the name of your Snowflake warehouse.
  []: compute-wh

 Snowflake Database
 ==================
 Please enter the name of your Snowflake database.
  [DEMO]:

 Snowflake Schema
 ================
 Please enter the name of your Snowflake schema.
  [DEMO]:

 Snowflake Password Environment Variable
 =======================================
 Please enter the name of the environment variable that contains your Snowflake password.
 Alternatively, you can re-configure the plugin later to use Kedro's credentials.yml
  [SNOWFLAKE_PASSWORD]:

 Pipeline Name Used As A Snowflake Task Prefix
 =============================================

  [default]:

 Enable Mlflow Integration (See Documentation For The Configuration Instructions)
 ================================================================================

  [False]:


 The project name 'Snowflights' has been applied to:
 - The project title in /tmp/snowflights/README.md
 - The folder created for your project in /tmp/snowflights
 - The project's python package in /tmp/snowflights/src/snowflights

Pipeline name parameter is here to allow you run many pipelines in the same database in snowflake and avoid conflicts between them. For demo it’s fine to leave it as default.

Leave the mlflow integration disabled for now. More instructions on how to get the integration to work will available later in a blog post.

The Snowflake Password Environment Variable is the name of the environment variable that contains your Snowflake password. Make sure to set in in your current terminal session. Alternatively, you can re-configure the plugin later to use Kedro’s credentials.yml. For example (using env var):

export SNOWFLAKE_PASSWORD="super_secret!"

Go to the project’s directory: cd snowflights
Install the requirements

pip install -r src/requirements.txt

Launch Kedro pipeline in Snowflake

kedro snowflake run --wait-for-completion

After launching the command, you will see auto-refreshing CLI interface, showing the progress of the tasks execution.

Kedro Snowflake Pipeline execution

In Snowpark, you can also see the history of the tasks execution: Kedro Snowflake Tasks history

Advanced configuration

This plugin uses *snowflake.yml configuration file in standard Kedro’s config directory to handle all its configuration. Follow the comments in the example config, to understand the meaning of each field and modify them as you see fit.

snowflake:
  connection:
    # Either credentials name (Reference to a key in credentials.yml as in standard Kedro)
    # or leave
    # credentials: ~
    # and specify rest of the fields
    credentials: snowflake
#    account: "abc-123"
#    database: "KEDRO"
    # Name of the environment variable to take the Snowflake password from
#    password_from_env: "SNOWFLAKE_PASSWORD"
#    role: ~
#    schema: "PUBLIC"
#    user: "user2137"
#    warehouse: "DEFAULT"
  runtime:
    # Default schedule for Kedro tasks
    schedule: "11520 minute"

    # Optional suffix for all kedro stored procedures
    stored_procedure_name_suffix: ""

    # Names of the stages
    # `stage` is for stored procedures etc.
    # `temporary_stage` is for temporary data serialization
    stage: "@KEDRO_SNOWFLAKE_STAGE"
    temporary_stage: '@KEDRO_SNOWFLAKE_TEMP_DATA_STAGE'

    # List of Python packages and imports to be used by the project
    # We recommend that this list will be add-only, and not modified
    # as it may break the project once deployed to Snowflake.
    # Modify at your own risk!
    dependencies:
      # imports will be taken from local environment and will get uploaded to Snowflake
      imports:
      - kedro
      - kedro_datasets
      - kedro_snowflake
      - omegaconf
      - antlr4
      - dynaconf
      - anyconfig
      # packages use official Snowflake's Conda Channel
      # https://repo.anaconda.com/pkgs/snowflake/
      packages:
      - snowflake-snowpark-python
      - cachetools
      - pluggy
      - PyYAML==6.0
      - jmespath
      - click
      - importlib_resources
      - toml
      - rich
      - pathlib
      - fsspec
      - scikit-learn
      - pandas
      - zstandard
      - more-itertools
      - openpyxl
      - backoff
    # Optionally provide mapping for user-friendly pipeline names
    pipeline_name_mapping:
     __default__: default

Snowflake datasets

This plugin integrates with Kedro’s datasets and provides additional set of datasets for Snowflake. The catalog.yml in our official Snowflights starter shows example usage of each of them:

companies:
  type: kedro_datasets.snowflake.SnowparkTableDataSet
  table_name: companies
  database: kedro
  schema: PUBLIC
  credentials: snowflake

reviews:
  type: pandas.CSVDataSet
  filepath: data/01_raw/reviews.csv

shuttles:
  type: pandas.ExcelDataSet
  filepath: data/01_raw/shuttles.xlsx
  load_args:
    engine: openpyxl # Use modern Excel engine, it is the default since Kedro 0.18.0

preprocessed_shuttles:
  type: kedro_snowflake.datasets.native.SnowflakeStageFileDataSet
  stage: "@KEDRO_SNOWFLAKE_TEMP_DATA_STAGE"
  filepath: data/02_intermediate/preprocessed_shuttles.csv
  credentials: snowflake
  dataset:
    type: pandas.CSVDataSet