Mastering DP-500 Exam: Querying JSON Data in Serverless SQL Pools

1

This blog forms part of the DP-500 blog series on this site here and also from Data Mozart’s blog here.

Summary

Part of the DP-500: Designing and Implementing Enterprise-Scale Analytics Solutions Using Microsoft Azure and Microsoft Power BI (official link here) is understanding how to query complex data types including JSON data types. In this blog we’ll look at the SQL functions we can use to query JSON data using Azure Synapse Serverless SQL Pools. We’ll look at an example JSON file then use this file to query using Serverless SQL Pools and JSON functions.

In this blog we’ll look at how to query JSON files using the SQL functions JSON_QUERY and JSON_VALUE and highlighting the differences in these functions

The example SQL scripts in this blog are being run on a Synapse Analytics Serverless SQL Pools database, please refer to the Getting Started with Azure Synapse Analytics SQL Serverless blog if you are new to Serverless SQL Pools.


Example JSON File

In the following example JSON file, saved as orders.json to a container & folder in an Azure Data Lake Gen2 account, there are a number of different JSON structures such as key/value pairs (orderId, orderDate, orderSource), objects (orderDetails) and arrays (deliveryAddress).

Each order is an object that contains key/value pairs to describe the order, then nested objects and arrays to describe the details of the order.

{
	"orderId": 4567,
	"orderDate": "2022-04-28",
	"orderSource": "webstore",
	"customerDevice": "android",
	"orderDetails": [
		{
			"productID": 608,
			"orderQuantity": 5
		},
		{
			"productID": 710,
			"orderQuantity": 3
		}
	],
	"deliveryAddress": ["London", "England"]				
}
{
	"orderId": 4568,
	"orderDate": "2022-04-29",
	"orderSource" : "webstore",
	"customerDevice": "windows",
	"orderDetails": [
		{
			"productID": 780,
			"orderQuantity": 9
		},
		{
			"productID": 690,
			"orderQuantity": 6
		}
	],
	"deliveryAddress": [ "King Street", "Newcastle", "England"]
}

Show Basic JSON Data as Rows

We can query the JSON documents and retrieve each object as a row. We need to specify the FIELDQUOTE and FIELDTERMINATOR values in the OPENROWSET command, in this case we are using default values of 0x0b (as specified here). We need to pay attention to the ROWTERMINATOR value and ensure the correct hex value is used. In the example below, 0x0D hex value is used as each object is separated by a carriage return (list of hex values here).

SELECT 
    jsonContent
FROM
    OPENROWSET(
        BULK 'https://<storage>.dfs.core.windows.net/raw/json/orders.json',
        FORMAT = 'CSV',
        FIELDQUOTE = '0x0b',
        FIELDTERMINATOR ='0x0b',
        ROWTERMINATOR = '0x0D'
    )
    WITH (
        jsonContent VARCHAR(500)
    ) AS r

In the results, you can see each JSON object is being displayed on a separate row.


Using JSON_VALUE to extract scalar values

We can use JSON_VALUE to extract scalar values from a JSON document. In the following SQL we can refer to an attribute in the JSON document by declaring the JSON content in the WITH statement, then passing this into the JSON_VALUE function with the attribute we want.

Note that to retrieve an object or array value using JSON_VALUE we need to specify an ordinal position, for the first productId we therefore need to specify $.orderDetails[0].productID. Also note that we do not need the ROWTERMINATOR value in the OPENROWSET command as we are now parsing the JSON document.

SELECT 
    JSON_VALUE(jsonContent, '$.orderId') AS orderId,
    JSON_VALUE(jsonContent, '$.orderDate') AS orderDate,
    JSON_VALUE(jsonContent, '$.orderSource') AS orderSource,
    JSON_VALUE(jsonContent, '$.customerDevice') AS customerDevice,
    JSON_VALUE(jsonContent, '$.orderDetails[0].productID') AS productID
FROM
    OPENROWSET(
        BULK 'https://<storage>.dfs.core.windows.net/raw/json/orders.json',
        FORMAT = 'CSV',
        fieldterminator ='0x0b',
        fieldquote = '0x0b'
    )
    WITH (
        jsonContent VARCHAR(500)
    ) AS r;

In the results you can see each key/value attribute is displayed, plus the first productID in the orderDetails object.


Using JSON_QUERY to extract objects and arrays

If we wish to retrieve all object values and array values then we need to use JSON_QUERY. In the following SQL we are retrieving all the order details information by passing in the JSON content into the JSON_QUERY function and specifying the orderDetails object and the deliveryAddress array.

SELECT 
    JSON_QUERY(jsonContent,'$.orderDetails') AS orderDetailsObject,
    JSON_QUERY(jsonContent,'$.deliveryAddress') AS deliveryAddressArray    
FROM
    OPENROWSET(
        BULK 'https://<storage>.dfs.core.windows.net/datalakehouseuk/raw/json/orders.json',
        FORMAT = 'CSV',
        FIELDTERMINATOR ='0x0b',
        FIELDQUOTE = '0x0b'
    )
    WITH (
        jsonContent VARCHAR(500)
    ) AS r;

In the results below we have retrieved the orderDetails object and also the deliveryAddress array.

We can combine both JSON_VALUE and JSON_QUERY in the same query to retrieve the scalar values of the orderID and orderDate for each JSON object plus the orderDetails object and deliveryAddress array.

SELECT 
    JSON_VALUE(jsonContent,'$.orderId') AS orderId,
    JSON_VALUE(jsonContent,'$.orderDate') AS orderDate,
    JSON_QUERY(jsonContent,'$.orderDetails') AS orderDetailsObject,
    JSON_QUERY(jsonContent,'$.deliveryAddress') AS deliveryAddressArray
FROM
    OPENROWSET(
        BULK 'https://<storage>.dfs.core.windows.net/datalakehouseuk/raw/json/orders.json',
        FORMAT = 'CSV',
        FIELDTERMINATOR ='0x0b',
        FIELDQUOTE = '0x0b'
    )
    WITH (
        jsonContent VARCHAR(500)
    ) AS r;

In the results you can see both the scalar value columns (key/value in JSON object) and the orderDetails object and deliveryAddress array for each order.


Using OPENJSON to explode objects into rows

We can also use OPENJSON to explode out objects into rows, we do this by using APPLY to apply each object to an element, E.G. for each orderId, explode all the products ordered within that orderId.

In the following example, we query the JSON file and use CROSS APPLY to explode the orderDetails object to show the individual values of the products ordered for each orderId.

SELECT 
    orderId,
    orderDate,
    orderSource,
    productID,
    orderQuantity
FROM
    OPENROWSET(
        BULK 'https://<storage>.dfs.core.windows.net/raw/json/orders.json',
        FORMAT = 'CSV',
        FIELDTERMINATOR ='0x0b',
        FIELDQUOTE = '0x0b'
    )
    WITH (
        jsonContent varchar(MAX)
    ) AS r
    CROSS APPLY openjson(jsonContent)
    WITH   
        (   orderId int '$.orderId',
            orderDate date '$.orderDate',
            orderSource varchar(20) '$.orderSource',
            orderDetails nvarchar(max) '$.orderDetails' AS JSON)
    CROSS APPLY openjson(orderDetails)
    WITH
        ( productId int '$.productID',
          orderQuantity int '$.orderQuantity')

In the results you can see that now each key/value pairs from the document can be displayed in a table.


Querying Nested Data in Parquet Files

We can also apply the same JSON functions listed above to nested/JSON data that is stored within columns in Parquet files. Let’s look at 2 examples. The first example is a single column where each JSON document is stored (each order is a single JSON document). The second example has both columns with values and a column with JSON data in.

Example 1

In this example, we have a single column in a Parquet file which stores JSON data.

We can use the same functions to perform the same functionality: JSON_VALUE to extract scalar values, JSON_QUERY to extract objects and arrays.

SELECT 
    JSON_VALUE(JSONOrderData, '$.orderId') AS orderId,
    JSON_VALUE(JSONOrderData, '$.orderDate') AS orderDate,
    JSON_VALUE(JSONOrderData, '$.deliveryAddress.city') AS deliveryCity,
    JSON_VALUE(JSONOrderData, '$.orderDetails[0].productID') AS productID,
    JSONOrderData
FROM
    OPENROWSET(
        BULK 'https://<storage>.dfs.core.windows.net/raw/parquet/orders.parquet',
        FORMAT='PARQUET'
    ) AS r;

In the results we can see the values have been extracted from the JSON data in the Parquet column, plus we’re also showing the source JSON column for reference.

Example 2

In this example, we have a mixture of column types with the basic order information such as orderId, orderDate etc stored in separate columns, and orderDetails stored as JSON in a column.

We can again use OPENJSON and CROSS APPLY to extract both the separate columns and also the JSON column.

SELECT
   orderId,
   orderDate,
   orderSource,
   productId,
   orderQuantity
FROM
    OPENROWSET(
        BULK 'https://<storage>.dfs.core.windows.net/raw/json/parquet/orders.parquet',
        FORMAT='PARQUET'
    ) AS arrays
    CROSS APPLY OPENJSON (orderDetails) 
    WITH ( 
            productId int '$.productID',
            orderQuantity int '$.orderQuantity') as array_values

From the results we can see both the individual columns and the attributes in the JSON column are shown in table format.


References

1 thought on “Mastering DP-500 Exam: Querying JSON Data in Serverless SQL Pools

Leave a Reply

Your email address will not be published. Required fields are marked *