Using Aggregate Functions With MySQL Document Store
MySQL MySQL Document Store Aggregate Functions
MySQL Document Store is a JSON document storage solution built on MySQL. One of the advantages of using MySQL Document Store for JSON document storage is when we need to run complex queries for analytics or reporting we can use raw SQL to retrieve data from our JSON documents. But what if we want to view basic aggregate data, such as average, minimum, maximum, etc.? Can we get that data while still using the Document Store API? Yes, we can, and in this post, we will show you how it can be done.
Simple vs. Complex Values
We can only aggregate values from properties where all the values are ‘simple’. When we talk about ‘simple’ property values, we mean properties that contain simple values, like numbers or strings. In other words, these properties do not have values that are arrays or objects. Let’s take a look at an example.
score are simple values because they only contain strings or numbers. The property
course has a complex value because its value is an object.
Aggregating Root Properties
When using the MySQL Document Store API, we can aggregate data using the
groupBy() methods. Let’s look at this example to get the average score for a single user (who happens to be me).
db.scores.find("lastName = 'Stroz' and year(date) = 2022").fields(['firstName', 'lastName']).groupBy(['lastName', 'firstName'])
By examining our call to
find(), we see that we are filtering our results where the lastName property is equal to ‘Stroz’ and the date was from 2022. If we do not wish to return all the properties of a document, we can use
fields() to specify which properties we want to return. In this example, we are returning just the
lastName properties (we will add more fields in the next section). Also, the call to
groupBy() shows we want to group our results by the
firstName properties. When we run this command, we will see the following results.
If we wanted to get the average score, we would use
avg() in the call to
fields(). In this example, we also use
round() to round the result to two decimal places.
db.scores.find("lastName = 'Stroz' and year(date) = 2022").fields(['firstName', 'lastName', 'round(avg(score),2) as avg_score']).groupBy(['lastName', 'firstName'])
Executing this command gives us the following results:
If we wanted to add a property that counted the number of scores, the command would look like this:
db.scores.find("lastName = 'Stroz' and year(date) = 2022").fields(['firstName', 'lastName', 'round(avg(score),2) as avg_score', 'count(score) as score_count']).groupBy(['lastName', 'firstName'])
We can see in the results that I had 20 scores for 2022.
If we wanted to find the lowest and highest scores, we would use
max(), respectively. However, because of an issue with how Document Store uses JSON functions on the backend, we need to massage the data a little (The problem has been reported). The issue is that the value returned from the calls to
max() return strings instead of numbers. This may not be an issue, but if we compared
32 as strings,
123 would be returned by
The way we massage this data is to use
cast() to cast the value as an unsigned integer inside of
max()). Don’t worry. Even when this issue is addressed, using code like the sample below will still work.
db.scores.find("lastName = 'Stroz' and year(date)=2022").fields(['firstName', 'lastName', 'round(avg(score),2) as avg_score', 'count(score) as round_count', 'min(cast(score as unsigned)) as low_score', 'max(cast(score as unsigned)) as high_score']).groupBy(['lastName', 'firstName'])
When we run this command, we will see the results below.
Note: You may have noticed that the properties are not returned in the same order specified in
fields(). This is due to how the data is stored in the database table.
Aggregating Array Data
If we want to aggregate data that is in an array, we usually would need to use raw SQL to extract those values using
json_table() and aggregate them. Remember, the reason behind this post is to show how we can handle this by using the Document Store API.
The New Data Schema
To demonstrate how we would handle this, we need to use a different set of data. Instead of each document containing the name and score information, we will use documents where the scores are stored in an array for each person. Take a look at the sample schema below.
In this new schema, we see that each document contains properties with simple values for
lastName and a property named
scores that contains an array of objects. Each object has two properties with simple values,
When researching methods of getting the data from an array, as we have above, I could not find a combination that worked inside the call to
fields() the way I wanted. I even tried using a subquery in fields that used
While we cannot do this natively in the API, we can use user-defined functions to give us the functionality we need and use them as we did in the above examples.
Creating the User-Defined Function
To get started, I created the user-defined function below.
DELIMITER $$ CREATE FUNCTION `JSON_ARRAY_AVG`(arr JSON) RETURNS float DETERMINISTIC begin declare val float; select avg(ar.item) into val from json_table( arr, '$[*]' columns( item float path '$' ) ) ar; return val; end$$ DELIMITER ;
Let’s break down the syntax of the SQL command.
DELIMITER $$- This command tells MySQL that we want to change the command delimiter. We need to do this because we have commands in the function’s body that need to be ended with a semicolon.
(arr JSON)- This line tells MySQL that we are creating a function named
JSON_ARRAY_AVG, and the function expects one argument that should be of type
RETURNS float DETERMINISTIC- This line defines what data type is returned by the function and that the value is
- A function is deterministic if it always produces the same result for the same input parameters.
begin- This line starts the function definition.
declare val float;- This line declares a variable that will be used in the function and specifies its data type as
select avg(ar.item) into val from json_table( arr, '$[*]' columns(item float path '$')) ar;- This line does all the heavy lifting. We will break this down into smaller pieces.
select avg(ar.item) into val- sets the value of a variable named
valto the result of the call to
from json_table( arr, '$[*]' columns(item float path '$')) ar- uses
json_table()to put the values in our array into a virtual table that can be used in a query.
arris the array we passed into the function.
'$[*]specifies that we are using all the values in the root of the array.
- This function expects to get an array of simple values. We will see how to get that data in a bit.
columns(item float path '$')defines the columns in our virtual table.
- This example has a single column named
item, a float data type.
pathto this value is the root of the current element of the array, which is identified with
- This example has a single column named
- Whenever we use
json_table()we must use an alias. In this example, the alias name is
return val;- This line returns the value of our variable named
end$$- This line specifies the end of our function. We use
$$at the end because this is the new delimiter we specified.
DELIMITER ;- This line returns the delimiter to a semicolon.
Using Our Function
Now that we have our user-defined function created to get the average of values in the array passed to the function, let’s put it to work. Here is a command that uses this function in the call to
fields(). I’ll break down the syntax below.
db.scores2.find("lastName = 'Stroz'").fields(['firstName', 'lastName', 'round(json_array_avg(scores[*].score),2) as avg_score'])
The call to
find() shows we want documents where the
lastName property is equal to ‘Stroz’. In the call to
fileds(), we can see that we want to return the
lastName properties and that we want to return the rounded result of our call to
json_array_avg(). Here are the results from the command above:
Note that what we pass to
scores[*].score is a JSON path. This path tells MySQL to use all the elements of the scores array but only return values for the
score property. The values will be populated into a JSON array which can be used by
json_array_avg(). If we had specified the path as
scores[*], what would be passed to
json_array_avg() would be the full array of score objects.
We can create similar user-defined functions to return the lowest and highest scores.
When using the MySQL Document Store API, we can specify the results of MySQL functions in the
fields() method. We can use aggregate functions such as
avg() to return the average of simple values in the document root. To return this same value for properties stored in an array in our document while still using the Document Store API, we need to create user-defined functions to handle aggregating the data for us. Yes, we still need to write our own SQL commands to define the user-defined functions, but at least we can use those functions in the API.
P.S. - For those who may not understand why I chose the header image for this post: The rock around railroad ties is called ‘aggregate’.