Elastic Queries with Azure SQL Database and Synapse Analytics Serverless SQL Pools

Recently I have been exploring an option to use an Azure SQL Database for a small footprint Data Warehouse and wondered whether I could create an elastic query connection to a Serverless SQL Pools database and query a Delta folder in an Azure Data Lake Gen2. It can be done successfully using an External Table in a Serverless SQL Pools database and creating the same table in the Azure SQL Database. Now the data within the Azure SQL Database can be queried alongside the data within the Data Lake using the Serverless SQL Pools engine.

The goal here is to query the internal data and external data via a single surface area – the Azure SQL Database

In this blog scenario we’ll use an example database, there is an Azure SQL Database that contains AdventureWorks Dimension and Fact tables with less than 10GB of data overall (not enough data for a Dedicated SQL Pool!). There is also an Azure Data Lake Gen2 account that contains around 1.2 billion web telemetry events about users browsing/purchasing behaviour on the (fictious) website. This data is stored in Delta format (Parquet files). The goal here is to query this data via a single surface area without moving the web telemetry data and use the Serverless SQL Pools engine to do the “heavy lifting” when querying data in the data lake.

The SQL queries are on GitHub here.

What is Elastic Query?

Elastic Query allows querying remote tables on multiple databases using T-SQL commands. A connection can be made from an Azure SQL database to another Azure SQL Database and T-SQL queries can then be run across the databases. You can read more here.

Elastic Query functionality is included in the cost of the Azure SQL Database and is supported on any tier.

Basic Solution

Here in its most simple form is the architecture, where Power BI (or any SQL query tool) connects to the Azure SQL Database only, this SQL Database holds imported data and also the external tables available in the Serverless SQL Pools database. Serverless SQL Pools itself connects to the underlying data, in Delta format, in Azure Data Lake Gen2.

Solution Steps

The following steps cover the walkthrough, we need to create objects in the Serverless SQL Pools service first, then switch to an Azure SQL Database to complete the Elastic Query setup.

  • Serverless SQL Pools
    • Create a Serverless SQL Pools database
    • Create an External Table over Delta data in Azure Data Lake Gen2
    • Create SQL Login and User in Serverless SQL Pools
    • Grant SELECT on External Table to SQL User
    • Grant REFERENCE on database scoped credential used in External Table to SQL User
  • Azure SQL Database
    • Create Master Key
    • Create a Database Scoped Credential using the SQL User created in Serverless
    • Create External Data Source (URL of Serverless, Credential)
    • Create External Table in SQL Database

Solution Walkthough

We’ll now step through the process to allow Azure SQL Database to send queries to Serverless SQL Pools and retrieve the results. This walkthrough assumes you have an Azure Synapse Analytics workspace already setup, here is a Getting Started with Serverless SQL Pools article which includes setting up a new Synapse workspace. You will also require an Azure SQL Database. In this scenario, the Synapse workspace and SQL Database are all co-located in the same region.

Object Dependencies

The following is a SQL object dependency diagram showing the relationship between each object created.

Serverless SQL Pools Setup

Let’s create a new Serverless SQL Pools database, this step includes creating a data source to the Azure Data Lake Gen2 account, the associated Managed Identity security to allow Serverless SQL Pools to access the data, and creating the External Table over the Delta data.

Create SQL Objects

The following T-SQL can be run in Synapse Studio on the inbuilt Serverless SQL Pools service.

--create login
USE MASTER
CREATE LOGIN elasticuser WITH PASSWORD = '<strong_password>'

--create new database
CREATE DATABASE SQLElasticQuery;
GO

USE SQLElasticQuery;
GO

--create objects for connecting to data lake

--Create a schema to hold our objects
CREATE SCHEMA LDW authorization dbo;

--encryption to allow authentication
CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<strong_password>';

--Create a credential using Managed Identity
CREATE DATABASE SCOPED CREDENTIAL DataLakeManagedIdentity
WITH IDENTITY='Managed Identity'

--Create a data source for use in queries
CREATE EXTERNAL DATA SOURCE ExternalDataSourceDataLakeUKMI
	WITH (
		LOCATION   = 'https://<storageaccount>.dfs.core.windows.net/<container>',
        CREDENTIAL = DataLakeManagedIdentity
	    );

--Create Delta file format    
CREATE EXTERNAL FILE FORMAT SynapseDeltaFormat
WITH ( 
        FORMAT_TYPE = DELTA
     );

--enable support for UTF8
ALTER DATABASE SQLDatabash COLLATE Latin1_General_100_BIN2_UTF8;

Create External Table over Delta

We can now create an External Table over the Delta data in the Data Lake using the Data Source and File Format created in the previous step. The Location specifies the root folder of the Delta data.

--create external table over Delta
CREATE EXTERNAL TABLE LDW.WebTelemetryDelta
(
    UserID varchar(20),
    EventType varchar(100),
    ProductID varchar(100),
    URL varchar(100),
    Device varchar(50),
    SessionViewSeconds int,
    MessageKey varchar(100),
    EventYear int,
    EventMonth int,
    EventDate date
)  
WITH (
    LOCATION = 'cleansed/webtelemetrydeltakey',
    DATA_SOURCE = ExternalDataSourceDataLakeUKMI,  
    FILE_FORMAT = SynapseDeltaFormat
)
GO

Create SQL Objects required for Elastic Query

Once the External Table is created, we can now create a User in the database and grant SELECT permissions on the table. We also need to provide access to the Managed Identify credential that Serverless SQL Pools uses to connect to the data lake account (I have an ongoing discussion with Microsoft about the process of securing External Tables…I’ll keep you posted).

--switch to the Serverless databas
USE SQLElasticQuery;
GO
--create user from login
CREATE USER elasticuser FROM LOGIN elasticuser;

--grant select on the Delta table
GRANT SELECT ON LDW.WebTelemetryDelta to elasticuse

--grant the sql user access to the credential to access the data lake
GRANT REFERENCES ON DATABASE SCOPED CREDENTIAL::[DataLakeManagedIdentity] TO [elasticuser];

Once the objects are in place, we can now switch to the Azure SQL Database and create the necessary objects there.

Azure SQL Database Setup

Once the Serverless SQL Pools objects have been created, we can now create the necessary objects in the appropriate Azure SQL Database. In this scenario there is an Azure SQL Database with existing Dimension and Fact tables. We’ll create a master key, a database scoped credential that mirrors the login in Serverless SQL Pools, then a data source that references the Serverless SQL Pools endpoint.

--run on azure sql database if a master key doesn't exist
CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<strong_password>';

--create a credential with the same details as the login created earlier
--in Serverless SQL Pools
CREATE DATABASE SCOPED CREDENTIAL ElasticDBQueryCredV2
WITH IDENTITY = 'elasticuser',
SECRET = '<strong_password>';

--now create a data source pointing to the Serverless SQL Pools endpoint and database
--use the credential created in the earlier step
CREATE EXTERNAL DATA SOURCE ExternalDataSourceDataLakeUKMIV2 WITH
    (TYPE = RDBMS,
    LOCATION = 'synapse-ondemand.sql.azuresynapse.net',
    DATABASE_NAME = 'SQLElasticQuery',
    CREDENTIAL = ElasticDBQueryCredV2
);

Once the security mechanism is in place, we can create an External Table with the same schema as the table in the Serverless SQL Pools database.

--create schema to hold tables
CREATE SCHEMA LDW AUTHORIZATION dbo;
GO

--create external table referencing the data source created earlier
CREATE EXTERNAL TABLE LDW.WebTelemetryDelta
(
    UserID varchar(20),
    EventType varchar(100),
    ProductID varchar(100),
    URL varchar(100),
    Device varchar(50),
    SessionViewSeconds int,
    MessageKey varchar(100),
    EventYear int,
    EventMonth int,
    EventDate date
)  
WITH (
       DATA_SOURCE = ExternalDataSourceDataLakeUKMIV2
);
GO

Now that all the required objects are in place in both the Serverless SQL Pools database and the Azure SQL Database, we can start querying.

Querying

We can run a SELECT statement on the External Table within the Azure SQL Database and observer the results, we’ll also check the Monitor > SQL Requests area to see the SQL requests sent to the Serverless SQL Pools service. The following SQL queries were run using SSMS connected to the Azure SQL Database using an Active Directory login.

--show 10 rows of the source data
SELECT TOP 10 *
FROM [LDW].[WebTelemetryDelta]
WHERE EventDate = '2022-03-03';

Let’s run an aggregate query and then check the monitoring area in Synapse Studio.

--aggregate the web telemetry data using eventtype and device
SELECT	
	EventType,
	Device,
	SUM(CAST(SessionViewSeconds AS BIGINT)) AS TotalSessionViewSeconds,
	COUNT(*) AS TotalEventCount
FROM LDW.WebTelemetryDelta
GROUP BY
	EventType,
	Device;

The SQL generated in the Request content field shows that the aggregate query was sent to Serverless SQL Pools successfully.

--SQL generated by running the elastic query from Azure SQL Database
SELECT 
	[Tbl1002].[EventType] [Col1036],
	[Tbl1002].[Device] [Col1037],
	SUM(CONVERT(bigint,[Tbl1002].[SessionViewSeconds],0)) [Expr1003],
	COUNT(*) [Expr1004] 
FROM [LDW].[WebTelemetryDelta] [Tbl1002] 
GROUP BY 
	[Tbl1002].[EventType],
	[Tbl1002].[Device]

If we JOIN an internal and external table in a GROUP BY query, we’ll see that the query sent to Serverless SQL Pools is requesting each row. This is hugely inefficient as the query is not using the Serverless SQL Pools engine other than to pull the data row-by-row…the Azure SQL Database will then need to perform the GROUP BY operation!

--run aggregate query
SELECT 
	DP.EnglishProductCategoryName AS ProductCategory,	
	COUNT(WT.MessageKey) AS TotalEventCount
FROM DW.vwProductHierarchy DP
INNER JOIN LDW.WebTelemetryDelta WT ON DP.ProductKey + 400 = WT.ProductID
GROUP BY DP.EnglishProductCategoryName

As we can see, the SQL query sent to Serverless SQL Pools is requesting each row of data.

--generated SQL
SELECT 
    [Tbl1007].[MessageKey] [Col1018],
    CONVERT(int,[Tbl1007].[ProductID],0) [Expr1012] 
FROM [LDW].[WebTelemetryDelta] [Tbl1007]

What we can do is SELECT the data from the external table and then JOIN to the internal table, using a CTE or Temp table. In this example we’ll use a CTE to hold the results from the Serverless SQL Pools database.

--aggregate product category and name 
;WITH webtelemetry
AS
(
	SELECT
		ProductID,
		COUNT(*) AS TotalEventCount
	FROM LDW.WebTelemetryDelta
	GROUP BY ProductID
)
SELECT 
	DP.EnglishProductCategoryName AS ProductCategory,
	DP.EnglishProductName AS ProductName,
	SUM(WT.TotalEventCount) AS TotalEventCount
FROM DW.vwProductHierarchy DP
INNER JOIN webtelemetry WT ON DP.ProductKey + 400 = WT.ProductID
GROUP BY 
	DP.EnglishProductCategoryName,
	DP.EnglishProductName

--aggregate product category
;WITH webtelemetry
AS
(
	SELECT
		ProductID,
		COUNT(*) AS TotalEventCount
	FROM LDW.WebTelemetryDelta
	GROUP BY ProductID
)
SELECT 
	DP.EnglishProductCategoryName AS ProductCategory,	
	SUM(WT.TotalEventCount) AS TotalEventCount
FROM DW.vwProductHierarchy DP
INNER JOIN webtelemetry WT ON DP.ProductKey + 400 = WT.ProductID
GROUP BY DP.EnglishProductCategoryName

If we look at the SQL generated and run on the Serverless SQL Pools database, we can see an aggregate query being run, which is the desired result.

--SQL generated and run on Serverless SQL Pools database
SELECT 
    [Expr1009],
    CONVERT(int,[Col1032],0) [Expr1015] 
FROM (
        SELECT [Tbl1008].[ProductID] [Col1032],
        COUNT(*) [Expr1009] 
        FROM [LDW].[WebTelemetryDelta] [Tbl1008] 
        GROUP BY [Tbl1008].[ProductID]
        ) Qry1033

Using Elastic Query to load Azure SQL Database

We also have the option of using the elastic query link to INSERT INTO…SELECT data from the Serverless SQL Pools External Table into an internal table in the Azure SQL Database. Perhaps aggregated web telemetry data could be useful if sat alongside the already imported Dimensions and Facts.

Conclusion

In this blog post we’ve looked at how to setup an elastic query between an Azure SQL Database and a Serverless SQL Pools database to allow querying of data in an Azure Data Lake Gen2 account. This gives us an option when looking at using Azure SQL Database to hold data internally and also use the Serverless SQL Pools engine to process data in a Data Lake.

References