Skip to main content

Exasol is a leading manufacturer of analytical database systems. Its core product is a high-performance, in-memory, parallel processing software specifically designed for the rapid analysis of data. It normally processes SQL statements sequentially in an SQL script. But how can you execute several statements simultaneously? Using the simple script contained in this blog post, we show you how.

The pros and cons of parallelizing statements

Exasol is known for its massive parallel processing capabilities. Queries are distributed across all the nodes in a cluster using optimized parallel algorithms that process data locally in each node's main memory.

ETL processes often rely on LUA scripts or user-defined functions (UDFs). Within these scripts, SQL statements are executed in order to perform the various tasks of the ETL process. Examples of this include creating tables and importing data. These steps normally have to run sequentially to ensure that, for example, tables are loaded only after they have been created.

When loading tables or exporting data, it is possible to carry out certain steps in parallel. When LUA scripts are used in an ETL process, however, the execution of the instructions within the script is not parallelized: all the instructions are executed in a serialized order. Because Exasol's computing potential is not being fully utilized, this results in longer execution times.

What are the requirements?

Executing multiple instructions in parallel requires a database cluster with multiple nodes. You will therefore need an enterprise version of Exasol with multiple nodes to implement this solution. Also, the solution is only suitable for instructions that do not require a specific execution sequence, as they are executed in random order.

 

How can multiple commands be executed in parallel?

To circumvent the serialized query execution limitation within LUA scripts, a two-step approach is necessary. In the first step, all of the instructions to be executed are written out in a purpose-designed table, which can be created in advance or during execution of the script. In the second step, a Python script is invoked that connects to the database and executes the instructions received. This Python script is of the set and emit type, which makes it possible to distribute the instructions to the different nodes and thereby execute them in parallel.

Example code

The following code uses an example to demonstrate the solution. In our example, all views of the schema EXA_STATISTICS are to be exported to separate CSV files.

Creating a connection to Exasol

First, create a connection to the Exasol database. It makes sense here to address the full IP range of all the nodes. This is shown in the code below with the 11..14 for the 4 nodes in the cluster.

CREATE OR REPLACE CONNECTION exa_connection

                TO '192.168.6.11..14:8563' 

                USER 'tech_user'   

                IDENTIFIED BY 'secret';

 

Implementing the Python script

The second step is to implement the Python script that will execute the statements in parallel. To establish a connection, we need to use the pyExasol package. This is needed to establish contact with the cluster via the previously defined connection. DSN, username and password should be supplied for this purpose. A client name must also be specified. This can then be used as a filter on the instructions to check for correct execution. The script will return two values for each statement executed. If all is well, it will show "Success" and the number of lines involved; if there is a problem, it will show "Failed" along with the statement that failed when executed. This script can be enabled from various LUA scripts in order to execute statements in parallel.

--/
CREATE OR REPLACE PYTHON SET SCRIPT SCRIPTS.PYT_PARALLEL_EXECUTION(stmt VARCHAR(20000)) 
EMITS(succes_state VARCHAR(10), outputs VARCHAR(20000)) AS 
import pyExasol

# Create Exasol connection
con = exa.get_connection("exa_connection") 
C = pyExasol.connect(dsn=con.address, user=con.user, password=con.password, autocommit=True, encryption=True, client_name="PARALLEL_EXECUTION")

def run(ctx):

 while True:    
  stmt = ctx.stmt
  try:
   retrun_stmt = C.execute(stmt)   
   ctx.emit("Success", str(retrun_stmt.rowcount()))
   C.commit()
  except:
   ctx.emit("Failed", stmt)
   C.rollback()
  if not ctx.next(): break
 C.close()
/

 

Implementing the LUA script

The final step is to implement the LUA script. We now have to create a table and populate it with statements in order to export the EXA_STATISTICS tables. We then invoke the Python script to execute these instructions. Finally, we delete the table containing the export statements. Bear in mind that the Python script opens a new session and is terminated before the LUA script. You therefore have to create COMMITS in the LUA script to ensure that the changes made by the Python script are not overwritten.

CREATE OR REPLACE LUA SCRIPT SCRIPTS.LUA_PARALLEL_EXECUTION () AS
require "string"
nClock = os.clock()
-- Example case: export all stats tables    

    -- create tmp table 
	local suc, res = pquery([[CREATE OR REPLACE TABLE TABLES.TEMP_TABLE (i VARCHAR(20000))]])
	if suc == true then
		output("Temp table created")
		output("Execution time" .. os.clock()-nClock)
	elseif suc == false then
		output("ERROR: It was not possible to create the temp table")
		output("Script stopped")
		output("Execution time" .. os.clock()-nClock)
		exit()
	end
    -- Generate export statements
	-- Get Table names
   	local suc, res = pquery([[SELECT OBJECT_NAME FROM EXA_SYSCAT WHERE SCHEMA_NAME = 'EXA_STATISTICS']])
	
	-- Fill tmp table with statements for parallel execution	
	for i=1, #res do
		exp_stmt = "EXPORT"..res[i].."INTO LOCAL CSV FILE '/tmp/"..res[i]..".csv'"
		local suc, res1 = pquery([[INSERT INTO TABLES.TEMP_TABLE VALUES(:s)]], {s=exp_stmt})
		if suc == true then
	        suc_sum = suc_sum + 1	        
	    elseif suc == false then
	        output("WARNING: It was not possible to create the following import statement: ")
	        output(exp_stmt)
	    end
	end
    output(suc_sum.." Insert statements created and saved")


    -- Execute python script to parallelize import
    res = query([[SELECT SCRIPTS.PYT_PARALLEL_EXECUTION(i) FROM TABLES.TEMP_TABLE GROUP BY Iproc()]])
    
    -- Return total number of exported rows
    total = 0
    for i=1, #res do
            local stmt_return = res[i][1]
            if stmt_return == "Failed" then
                    output("WARNING: It was not possible to execute the following statement")
                    output(res[i][2])
                    output(stmt_return)
            elseif stmt_return == "Success" then
            	total = total + stmt_return
            end		
    end

   -- drop temp Table
	local suc, res = pquery([[DROP TABLE TABLES.TEMP_TABLE]])
	if suc == true then
		output("Temp table dropped ")
		output("Execution time" .. os.clock()-nClock)
	elseif suc == false then
		output("ERROR: Temp table could not be dropped")
	end   
	

	output("Number of exported row " .. total)
	output("Script finished successfully")
	output("Execution time" .. os.clock()-nClock)
      
/

The statements i, which are held in the temporarytable TEMP_TABLE, are sent to the function. The statement GROUP BY Iproc(), distributes the statements to the number of nodes, the Python function is started on each node and the statements are finally executed in parallel.

Done! ... and now?

Now that everything is set up, we can execute the LUA script and export the data in parallel.

EXECUTE SCRIPT SCRIPTS.LUA_PARALLEL_EXECUTION();

By parallelizing SQL statements within LUA scripts, we have succeeded in reducing the runtime of our regular import and export statements by 90%. This is a significant improvement on the time it takes to execute SQLs within LUA statements in serialized mode.

Our example shows how you can execute SQL statements in parallel within LUA scripts.  The template can, of course, be adapted to deal with many different use cases and is not restricted just to the export of data.

 

 

 

If you'd like to learn more about how to maximize the potential of Exasol, or need help implementing this script, please feel free to contact us!

Learn more!

 

 

 

Your Contact
Andreas Ambühl
Consultant
Andreas is a big fan of process optimization. What he does not like, however, are processes being held up by manual activities, slow processing or disparate systems. He therefore does his utmost to ensure that his clients' processes run efficiently and effectively.
#Nomoreabtippen #Spend2h #Save5min