Snowflake cloud DB and Python: "two good friends"

What can Snowflake do as a cloud DB?

Snowflake is a native cloud DB which runs on AWS, and now also on Azure. The Internet connection from the client to the cloud and the data within the DB are encrypted. During execution, it can automatically scale up as required, and shift down again at the end. Because (storage) volume and execution time are paid for, the shorter times can reduce costs. A detailed online documentation is available at the following URL: https://docs.snowflake.net/manuals/index.html

Incidentally, one does not have to be an AWS customer to be able to use Snowflake. As a cloud-DB service, Snowflake itself offers no proprietary ETL tools, but leaves this to the manufacturers of ETL, reporting or self-service BI tools. These usually provide native drivers and connections to allow use of their tools with Snowflake. If no separate ETL tool is to be used at the company, there are several possibilities of loading data and realizing the ETL routes. One possibility is implementation of logic in SQL, and orchestration via Python.

Snowflake offers drivers for Python, Spark, ODBC and JDBC

Snowflake's own implementation offers drivers for Python, Spark, ODBC and JDBC. This means that a Python connector for Snowflake is available to create DB connections from the own client.

This article describes this Python connector using sample snippets: ETL loading routes can be realized (manually) with the help of the Python connector (= DB connection). Scheduling can take place, for example, via "Cronjob"; sources (csv, xml, tables, xlsx etc.) can be imported, and further SQL scripts can also be invoked.

Procedure

In Python, new packages are installed with the help of "pip":

"pip install snowflake-connector-python"

The following module can be imported in Python code:

"import snowflake.connector"

This makes the Python connector available, and implementation can begin. A first snippet of Python is provided next:

(test of the connection)

# *****************************************************************************

import snowflake.connector

conn = snowflake.connector.connect(host=<hostname>,

                                   user=<DB-User>,

                                   password=<DB-Password>,

                                   account=<Snowflake-Account>)

cur = conn.cursor()

try:

    cur.execute("SELECT current_version()")   # Snowflake-SQL

   

    one_row = cur.fetchone()

    print(one_row[0])

finally:

    cur.close()

# =======================================================

# Eine Variante davon ist der Zugriff per Column-Name

# statt Indexposition:

#

# (ein Beispiel)

# =======================================================

cur_dict = conn.cursor(snowflake.connector.DictCursor)   # Cursor mit Parameter

try:

    cur_dict.execute("SELECT current_version() as c1, current_client() as c2, current_date() as c3")

    for r in cur_dict:

        print(f"{r['c1']}  --  {r['c2']}  --  {r['c3']}")

finally:

    cur_dict.close()

    conn.close()

# ******************************************************************************

After this first Python snippet, the next step attempts to invoke stored procedures from within Snowflake. Stored procedures in Snowflake are implemented in **JavaScript** (!)

This will amaze the ETL architect, data engineer and SQL developer...

The following question spontaneously arises: JavaScript and DBMS ... how is that supposed to work? Three Snowflake-specific objects are made available within stored procedures:

"snowflake", "Statement", "ResultSet"

They make JavaScript capable of DDL + DML within Snowflake. Stored procedures are invoked via "Callprocname()", regardless of whether in Snowflake or Python.

Side note:

A **template** for JavaScript stored procedures with SQL access in Snowflake:

create or replace procedure my_proc1()

returns string

language javascript

as

$$

try {   

    var stmt1 = snowflake.createStatement( { sqlText: "select * from Tabelle" } );  

    var rs1 = stmt1.execute();

   

    while (rs1.next()) {

        -- read from ResultSet "rs1"  via

        --

        -- column name :    rs1.column_name1

        -- or            

        -- indexpos.   :                          rs1.getColumnValue(1)

        ...

        ...

    }

    return "SUCCESS";

catch (err) {

    return "FAILED (Exception):   " + err;

}

$$

;

The method of invoking the above-mentioned stored procedure "my_proc1()" in Python is described here. Further Snowflake-specific SQL statements for the environment are needed; these are "mandatory" and shown here as examples; they are not treated in greater detail, as this is beyond the scope of this article:

...

cur.execute("use warehouse warehousename")

cur.execute("alter warehouse warehousename resume")


cur.execute("alter session set timestamp_type_mapping = 'TIMESTAMP_NTZ'")

cur.execute("alter session set timezone = 'Europe/Berlin'")

cur.execute("alter session set TIME_OUTPUT_FORMAT = 'HH24:MI:SS.FF'")


cur.execute("use database dbname")

cur.execute("use schema schemaname")


# *********************************************

# Invocation of thes stored procedure:

# =============================

cur.execute("call  my_proc1()")

# *********************************************


cur.execute("alter warehouse warehousename suspend")

...

What about the performance?

Which execution times are achieved with the variants in Python?

The following variants exist in Python (values determined from a "POC" project):

(1) Stored procedures in Snowflake are invoked from Python.

(2) Stored procedures are re-implemented or reproduced in Python, and also executed in Python.

(3) No stored procedures. Instead: SQL statements (= SELECT) are executed directly in Snowflake from Python, without a detour via stored procedures.

Several passes and test invocations were performed here:

With (1): python calling javascript stored proc.:

Execution times (sec):

(1) 8.61, (2) 7.28, (3) 6.00, (4) 6.70, (5) 8.64,

(6) 14.49, (7) 7.27, (8) 7.77, (9) 8.85, (10) 8.55

With (2): python re-implementing stored proc.:

Execution times (sec):

(1) 6.83, (2) 4.76, (3) 6.57, (4) 5.27, (5) 5.19,

(6) 4.81, (7) 5.58, (8) 5.38, (9) 12.54, (10) 5.83

With (3): python sql processing:

* select from (arbitrary) Snowflake table

* insert into another Snowflake table

==> including "TCL" (= begin, rollback, commit)

Execution times (sec):

(1) 4.42, (2) 6.52, (3) 35.08, (4) 3.91, (5) 3.69,

(6) 4.95, (7) 4.52, (8) 18.99, (9) 4.94, (10) 3.89

Evidently, there are outliers which cannot be explained by Snowflake either.

Note:

Another variant is invocation of SQL files or scripts containing multiple SQL statements which can be executed sequentially with the help of "execute_stream()", as per the following Python snippet. However, measurements indicating execution times are not available here:

try:

        # --------------------------------------------------------

        #    EXECUTE SQL-File with Snowflake-SQL-Statements:

        # --------------------------------------------------------

        ret_values_list = []     # Init.:  list for return values of sql execution


        with open("sqlskript_dateiname", 'r', encoding='utf-8') as f:

            for c in conn.execute_stream(f):      # executes all SQL-Statements LINE BY LINE!!

                for ret in c:

                    # save return values, 

                    # from each SQL-Statement:

                    ret_values_list.append(ret)

        ...

        ...

       

except snowflake.connector.errors.ProgrammingError as e:

        # default error message

        print(e)

        ...

Conclusion

Snowflake provides a native Python connector allowing manual ETL processes in a convenient manner. Python is nonetheless one option among many; graphical ETL tools now also offer native connectors.

Depending on (own) requirements and general conditions, use cases can also be implemented in Python. If Python is already deployed in-house, import procedures can be quickly integrated. Modular implementation is possible, and packages can be set up. Snowflake does not offer this option with stored procedures; frameworks are not possible in JavaScript. A lot of procedures ultimately exist side by side without any modularity.

Because the environment comprises a cloud, there is no file system for logging, and tables must be used instead. JavaScript in Snowflake is a new environment and a new "low-level" language for SQL developers, i.e. the learning curve here must not be underestimated. This can be bypassed by staying in Python instead of using stored procedures in JavaScript.

Execution times for Python are acceptable because of the native Snowflake connector for Python. SQL already tunes it internally on the way to Snowflake - that is not the case with the JDBC driver.

My recommendation

ETL with Python in the context of Snowflake is recommended if an in-house knowledge of Python already exists. This also allows quick implementation of test scenarios.