Efficient Data Analytics with Two Simple Vectorization Techniques

One of the most prominent benefits of code vectorization is the elimination of repeated operations on a data set. For instance, categorizing and analyzing data through the use of a loop to iterate over each element in the data set may be feasible for small data sets, but it will quickly become costly in terms of time and computing resources. Instead, you can leverage the inherent properties of matrix operations and built-in functions to perform operations across specific dimensions of the data set, and logical operators to efficiently categorize data in an entire set at once. These two techniques along with a few others are introduced in this blog post.

In order to illustrate the power of these two vectorization techniques, let us consider a simple example (code provided for Matlab). Keep in mind that this is an example meant to cover the basic concepts and thought processes behind vectorization and only scratches the surface of its possibilities.

Now, let's get to it.

Imagine you are the owner of a department store with 500 locations nationwide. Your Point of Sale (PoS) system generates transaction data for every second between 8am and 8pm, and categorizes your products into 20 individual categories. At the end of each day, this data is stored in a database and you can extract this data into Matlab in the form of a 43200 x 20 x 500 matrix, where each element represents the total sales for a particular second (dimension 1), in a particular category (dimension 2) and at a particular store (dimension 3). Imagine now that once the system has generated this data, you want to extract the following statistical data, each day:

The total sales per second and per category, across all stores
The total sales per second and per store, across all categories
The percentage of sales, in each store, above $1, $5 and $10

Let's start by creating a matrix of random sales transactions, per second, between $0 and $15 as illustrated in the figure below.

Creating a matrix of sales transaction per second (dimension 1), category (dimension 2) and store (dimension 3)

sales = single(rand([12*3600,20,500])*15);

Loop Method

Without vectorization, we can initialize the output matrices to the correct dimensions to avoid dynamic memory allocation. You will notice that I also include the tic and toc functions to time the code execution.

tic
tot_sales_pcat_psec_1 = zeros(size(sales,1),size(sales,2));
tot_sales_pstore_psec_1 = zeros(size(sales,1),size(sales,3));
perc_sales_lvl3_1 = zeros(size(sales,3),1);
perc_sales_lvl2_1 = zeros(size(sales,3),1);
perc_sales_lvl1_1 = zeros(size(sales,3),1);

Next, we can set up the nested for-loops to iterate over each transaction. Since the aim is to compute the sales percentages per store, I decided to make the outermost for-loop one across the third dimension of the input matrix. The order of the other two for-loops, in this case, is irrelevant. Hence, I decided to tackle each of the dimensions in order of dimension, and create three nested for-loops across all stores, categories, and seconds. Inside the inner most for-loop we can digest each of the transactions by adding the transaction amount to two matrices:

to the row corresponding to the loop variable seconds and column corresponding to the loop variable categories of the matrix which represents the total sales per category for each second
to the row corresponding to the loop variable seconds and column corresponding to the loop variable stores of the matrix which represents the total sales per store for each second

Next, we can categorize the current transaction based on its amount:

Level 1: all transaction amounts greater than $1
Level 2: all transaction amounts greater than $5
Level 3: all transaction amounts greater than $10

Finally, at the end of each iteration of the outer-most for-loop across the stores, we can compute the percentages corresponding to each of the transaction levels, by dividing number of transactions accumulated in each level "bucket" across the inner two for-loops by the total number of transactions recorded for that store. This will provide to us the output to all three objectives above.

for stores = 1:size(sales,3)
    for categories = 1:size(sales,2)
        for seconds = 1:size(sales,1)
            tot_sales_pcat_psec_1(seconds,categories) =
            tot_sales_pcat_psec_1(seconds,categories)
            +sales(seconds,categories,stores);
            
            tot_sales_pstore_psec_1(seconds,stores) =
            tot_sales_pstore_psec_1(seconds,stores)
            +sales(seconds,categories,stores);
            if sales(seconds,categories,stores)>1
                perc_sales_lvl1_1(stores) =
                perc_sales_lvl1_1(stores)+1;
            end
            if sales(seconds,categories,stores)>5
                perc_sales_lvl2_1(stores) =
                perc_sales_lvl2_1(stores)+1;
            end
            if sales(seconds,categories,stores)>9
                perc_sales_lvl3_1(stores) =
                perc_sales_lvl3_1(stores)+1;
            end
        end
    end
    perc_sales_lvl1_1(stores) =
    (perc_sales_lvl1_1(stores)/numel(sales(:,:,stores)))*100;
    perc_sales_lvl2_1(stores) =
    (perc_sales_lvl2_1(stores)/numel(sales(:,:,stores)))*100;
    perc_sales_lvl3_1(stores) =
    (perc_sales_lvl3_1(stores)/numel(sales(:,:,stores)))*100;
end
t_loop = toc;

Vectorized Method

As promised, you can achieve the same outcome in just 6 lines and a much faster execution time through two simple vectorization concepts:

Performing dimensional operations to compute, for instance, the sum across an entire dimension of the input matrix at once
Using logical operators to create a matrix of boolean values on which operations can be performed

Let's tackle the first objective:

The total sales per second and per category, across all stores

The key to the first objective is "across all stores", meaning that we wish to end up with a matrix of dimension 43200 x 20, where each element is the sum across the third dimension (the stores). For instance, element (1,1) in the output matrix is the sum of (1,1,1)+(1,1,2)+(1,1,3)+...+(1,1,500) of the input matrix, sales. An easier way to remember this is that, in Matlab, the size of the output variable of sum along the dimension specified in the function call will be 1, whereas the size of remaining dimensions will be equal to their respective sizes in the input matrix to the function. In this case, specifying dimension 3 in the sum function call will generate a matrix of size 43200 x 20 x 1, which is what we are looking for. Therefore, we can achieve the first objective by writing:

tot_sales_pcat_psec_2 = sum(sales,3);

Now, let's look once more at the second objective:

The total sales per second and per store, across all categories

This is very similar to the first objective, but now with respect to all categories (i.e. the second dimension). Since we want to end up with a matrix of size 43200 x 500, we can write this as:

tot_sales_pstore_psec_2 = sum(sales,2);

And finally, let's take another look at the third objective:

The percentage of sales, in each store, above $1, $5 and $10

As in the outer-most for-loop above, we know that we will have to compute the percentage relative to the number of elements in each page of the 3D input matrix, sales. In order not to re-compute this three times, let's compute the total number of elements by which we will divide, as the product of dimensions 1 and 2, and store it in a variable.

tot_num_el = prod([size(sales,1),size(sales,2)]);

Computing the percentage of sales per second above each threshold will consist of three parts:

categorizing the transactions
computing the total number of transactions per store above each threshold
dividing by tot_num_el

To categorize the transactions, we can use the greater than logical operator to create a matrix of boolean values indicating whether a transaction exceeds the threshold value or not. When computing the total number of transactions exceeding the threshold, we know that we wish to end up with a 1-dimensional matrix of length equal to the number of stores. Hence, when we take the sum we would like the size of dimensions 1 and 2 in the output to be equal to 1. That is, we would like an output matrix of size 1 x 1 x 500. Therefore, we call the sum function with the boolean matrix as input, and specify to compute the sum across dimensions 1 and 2. Finally, we can divide the output of the sum function call by the total number of elements in each page and multiply by 100 to get the percentage. In Matlab, this can be written as follows:

perc_sales_lvl3_2 = (sum(sales>9,[1 2])/tot_num_el)*100;
perc_sales_lvl2_2 = (sum(sales>5,[1 2])/tot_num_el)*100;
perc_sales_lvl1_2 = (sum(sales>1,[1 2])/tot_num_el)*100;

To summarize, the complete vectorized solution to this problem would look like this:

tic
tot_sales_pcat_psec_2 = sum(sales,3);
tot_sales_pstore_psec_2 = sum(sales,2);
tot_num_el = prod([size(sales,1),size(sales,2)]);
perc_sales_lvl3_2 = (sum(sales>9,[1 2])/tot_num_el)*100;
perc_sales_lvl2_2 = (sum(sales>5,[1 2])/tot_num_el)*100;
perc_sales_lvl1_2 = (sum(sales>1,[1 2])/tot_num_el)*100;
t_vec = toc;

Comparison of Both Methods

In order to compare the performance and equivalency of both methods, we can use Matlab's tic-toc and isequal functionalities and print their respective outputs.

logicalstr = {'FALSE','TRUE'};
fprintf('RUN TIME:\nNon-vectorized solution: %.5fs\n
        Vectorized solution: %.5fs\n\n',t_loop,t_vec)
fprintf('OUTPUT COMPARISON\n')
fprintf('Total sales per store: %s\n',
logicalstr{isequal(tot_sales_pstore_psec_1(:),tot_sales_pstore_psec_2(:))+1})
fprintf('Total sales per category: %s\n',
logicalstr{isequal(tot_sales_pcat_psec_1(:),tot_sales_pcat_psec_2(:))+1})
fprintf('Percentage sales above thresholds: %s\n',
logicalstr{(isequal(perc_sales_lvl3_1(:),perc_sales_lvl3_2(:))&isequal(perc_sales_lvl2_1(:),perc_sales_lvl2_2(:))&isequal(perc_sales_lvl1_1(:),perc_sales_lvl1_2(:)))+1})

Try it out for yourself, and you will see that, while the output from each of the two methods is identical, the execution time of the vectorized solution is about 20 to 30 times faster and will take less than a second. On the other hand, the non vectorized solution will take on the order of 20 seconds to execute. Imagine that this data is analyzed every day, for a year, then the non-vectorized solution would result in a total execution time of around 2 hours, whereas the vectorized solution computes the same output in under 5 minutes.

Are you trying to solve a similar problem? Wondering where to go from here?

Let's get in touch!

Efficient Data Analytics with Two Simple Vectorization Techniques

Loop Method

Vectorized Method

Comparison of Both Methods

Recent Posts

ความคิดเห็น