Recommendation engine to boost Startbucks offers

Capstone Project for Udacity’s Data Scientist Nanodegree.

1. Introduction

The true secrete for perfect coodination of its business is hidden behind huge amount of data. Through strong data processing, Starbucks can understand the needs and wishes of each of its clients, and thus, recommend to them the right offer to boost its coffee sells.

However, the greatest challenge for increasing the client engagement through data analysis is to develop the right recommendation engine capable of recognising the best offer to send to a client. Therefore, this post aims to propose a feaseable approach to this problem which is under the scope of the Udacity Data Scientist Nanodegree’s capstone project.

A. Project Definition

Using machine learning algorithms and data visualisation strategies, the recommendation engine was built in Python using Jupyter Notebook. For further details, the entire data analysis and engine’s code is available in this GitHub repository.

The development of this engine followed three steps. The first step aimed to understand and clean the available data to prepare them for further analysis. With the well-formatted data, an in-deepth analysis was conducted to build three different machine learning models to predict the most suitable offer to be sent to a client. Finally, the last step was to compare the performance of these three models based on the mean squared error. The model capable of guaranteeing the lowest mean squared error is the most suitable to operate as a recommendation engine.

B. Metrics

2. Data Exploration and visualization

  1. Portfolio — dataset describing the characteristics of each offer type, including its offer type, difficulty, and duration.
  2. Profile — dataset containing information regarding customer demographics including age, gender, income, and the date they created an account for Starbucks Rewards.
  3. Transcript — dataset containing all the instances when a customer made a purchase. Moreover, it indicates when the client viewed, received, and completed an offer.

These three type of data (i.e. protifolio, profile and transcript) were preprocessed to determine which demographic group responds best to which offer type. The idea consisted of preprocessing the data to be used in the training of different machine learning models to estimate the offer net revenue.

A. Portfolio dataset

As shown in the graph below, the ten kind of offers are cathegorized into three types: bogo, discount and informational. It is possible to note that there are 4 type of bogo and discount offers, and only 2 informational offers.

The difficulty, duration and reward of each of type of offer are shown in the graph below. It is possible to note that the difficulty and reward of “bogo” offers — also known as ‘buy one and get one’ — are the same. Additionally, “informational” offers has difficulty and reward equal to zero. Remarkably, one of the discount offer has the highest difficulty, which means that the client has to spend at least 20 $ to receive a discount of 5 $.

B. Profile dataset

Fig. 2: A part of the client profile dataset.

C. Transcript dataset

As illustrated in the figure above, the Transcript dataset indicates when an offer was received, viewed and completed by the client. It also informs the total money spent during a transactioin. However, the structure of this dataframe is not straighforward to be analysed. The data are organised througthout dictionaries, which makes difficult to take key information from them.

The entire Transcript dataset is divided into ‘offer completed’, ‘offer received’, ‘offer viewed’, and ‘transaction’, as shown in the graph below. It is notable that there are much more ‘transaction’ than other categories. In average, each client has 18 historic events in the database.

3. Methodology to build the machine learning recommendation engines

  • Data preprocessing — prepare and clean the data to be used in the machine learning models.
  • Implementation — build and train three machine learning models using Random Forest Regressor and Linear Regression.

A. Data preprocessing

Remarkably, the Net Revenue (NR) is calculated from the total money spent by the client, the offer reward, and the cost of advertisement, as expressed in the formula underneath.

In other words, the NR is calculated from the difference between total income (I) and total expenses related to this offer (R+C). The total income (I) represents the total money spent by the client between two offers. It is considered that the visualisation of an offer will impact the client’s future behaviour up to the receivement of another offer. Therefore, the total money spent during the period between the visualisation of an offer and the receivement of another offer defines the total income (I).

On the other hand, the total offer expense is the sum of the cost of advertisement (C)and the reward received by the client (R). The reward is only subtracted from the net revenue (NR), if the client has completed (Delta_completed) the offer; otherwise, Starbucks does not need to give the reward (R) to the client. Remarkably, in this study, the cost of advertiment is worth $ 0.1.

i. Preprocessing of the portifolio dataset

Therefore, by using get_dummies method of Pandas library, the dummy columns can be easily created, as shown in the figures below.

Fig. 1: Result of the preprocessing of portfolio dataset.

ii. Preprocessing of the profile dataset

The distribution of client’s income classified by gender can be visualized in the graph below. It is possible to note that woman has an average income about 16% higher than man.

Additionally, woman clients are older than man on average 3 year old, as can be observed in the graph below.

Looking at the number of days each of these groups has subscribed to Starbucks account, it is possible to observe that most of clients are newer, with less than 1500 days. On the other hand, very few clients are senior ones with more than 2000 days of subscription.

iii. Preprocessing of the transcript dataset

The preprocessed transcript data look like as shown in the figure below, where the client_id is repeated as many times as the quantity of received offers. Analogously, each received offer is associated by two Boolean variables indicating whether the offer has been viewed and completed, i.e. is_offer_viewed and is_offer_completed, respectively. Moreover, each offer is associated with its respective net revenue (column net_revenue).

As can be noted in the first two rows in the Dataframe in the figure below, a negative net revenue is when an offer was sent to the client, but it was either not seen or not completed. On the contrary, a positive net revenue is when a user has seen the offer, but not necesseraly completed it, as indicated in the lines 2 and 3 in the Dataframe below.

It is important to note that the flag is_offer_completed is True when the client has completed and viewed the offer. A completed offer without being visualised is not considered a successful offer.

In the bar graph below, the average net revenue per offer is shown. It is possible to observe that the 10th offer (ie. fafdcd668e3743c1bb461111dcafc2a4), which is a discount offer, has the highest net revenue, representing about 30$. On the contrary, the first offer (i.e. 0b1e1539f2cc45b7b9fa7c272da2e1d7), which is a bogo offer, has the lowest net revenue amongst all possible Starbucks offers.

B. Implementation

For this, the client profile will be used as input for our machine learning model, whereas the net revenue will be the output. The recommendation engine will run the ten machine learning models to calculate the expected net revenue (NR) for each of these offers. As a result, the offer with the highest NR will be sent to the client.

Therefore, by using the pieace of code below, the cleaned portfolio was combined with the cleaned transcript dataframe (df_client_offer) and the cleaned client profile (portfolio_clean).

By joining these three dataset, it is obtained a complete Dataframe linking the client profile with the offer portfolio and the expected net revenue, as shown in the figure below.

Consequently, by joining the three dataframes, it is possible to have some key statistics of the entire input data, as summarized into the table below. Therefore, 78% of offers were viewed, only 52% were completed. The average net revenue is about 19 $/person. The offers are sent most of time email and mobile. The average age of Starbucks client are 54 years old, and most of them are man.

In order to determine the best machine learning model to implement our recommendation engine, three different approaches were evaluated using both Random Forest Regressor and Linear Regression, notably:

  1. Baseline model (BM): a simple machine learning model to predict the net revenue considering only the user gender (F, M, O) as input.
  2. Client profile model (CM): a machine learning model to predict the net revenue considering the full client profile as input (gender, age, income and client_duration_days).
  3. Client-profile and Offer channel model (COM): a machine learning model to predict the net revenue considering the full client profile (gender, age, income and client_duration_days) and the offer communication channel (email, mobile, social and web) as input.

Therefore, the three machine learning models implements a Random Forest Regressor or a Linear Regression that differs only on the number of input features. From the two function shown below, the three model were implemented. The differences into the three models (BM, CM and COM) reside on the num_features, which is worth 3, 6, and 10, respectively.

Function to train a Random Forest Regressor model
Function to train a Linear Regression model

Remarkably, the baseline model is to verify whether the model can estimate at least the average net revenue classified by gender. It is a simplified model that allow us to verify whether the machine learning model is working properly. Therefore, the Mean Squared Error (MSE) of the baseline model will be compared with the other two models (CM an COM) to identify the best one. The most suitable linear regression model is the model that can guarantee the lowest MSE.

The following four paragraphs describe each of these four machine learning models.

i. The Baseline model (BM)

Based on the results shown in the three graphs above, the baseline predicted model can estimate the average net renue of participants with an error inferior to 6% for woman and man and below 32% for other genders. The accuracy of the ‘Other’ prediction is low because the number of people who have indicated ‘Other’ as gender is very small (less than 2% of clients).

According to the graph below, the Random Forest and Linear Regression model assured similar MSE. This lead us to conclude that for the baseline model, both algorithms are comparable.

As shown in the code below, the Random Forest Regressor was set up through a GridSearch as shown in code below:

The n_estimentors, min_samples_split and boostrap parameters were tuned to configure the model. As a result, the best perameters to configure each of the ten RandomForestRegressor model are shown below. Therefore, most of time, it is recommended to activate bootstrap, the min_samples_split is worth either 2 or 4, and the n_estimators varies between 10 and 30.

Moreover, in the model training and valitating process, it was used cross validation using 5 Kfolds to verify whether there is a bias in the model. The cross validation is implemented by using the code below. This allows us to check whether the model is overfitting or underfitting. By analysing the results of the cross validation above, it is possible to note that the MSE varies considerably for each Kfold, which indicates that either the dataset is unbalanced or the the model is underfitting.

ii. Client profile model (CM)

  • Client age
  • Client income
  • Number of days a user is a client member
  • Client gender (F, M and O)

The MSE for each of these models is shown in the graph below. It is important to highlight that in the model training and validation steps, it was also used GridSearch and KFold cross validation. The MSE shown in this graph is the average of the 5 KFold training. For the sake of simplicity, in this blog post the results of the MSE per Kfold iteration and the GridSearch results are not going to be shown. However, the full analysis is avaible in my GitHub repository.

From the graph below, it is possible to note that the Linear regression model can assure lower MSE that Random Forest.

iii. Client-profile and Offer channel Model (COM)

4. Results and discussions

A. Comparison of the three linear regression model

From the graph above, it is possible to note that the Client profile Model (CM) and Client Offer Model (COM) using Linear Regression can assure the lowest average MSE. It is important to highlight that the average MSE is the average among the ten offers. The graph below summarises the MSE obtained for each offer and for each machine learning model.

In these graph, it is possible to note that the CM model (section 3.B) and COM (section 3.C) resulted in the mean square error very similar, which is lower than the BM (sectioin 3.A). Therefore, this indicates that the offer channel (social, web, email, etc) does not influence the net revenue estimation. To simplify the linear regression model, it is better to use the CM model to build a recommendation engine, since it is a simpler machine learning model.

B. Building the recommendation engine

By running the recommendation engine for all the 14825 client profiles available cleaned Starbucks dataset, the statistics for the top three offers are shown below.

The offer in rank #1:

From the graph above, it is possible to note that the discount offer with difficulty 10, duration 10 and reward 2 (i.e. ID = ‘fafdcd668e3743c1bb461111dcafc2a4’) is the most suitable offer to be sent most of the time. The developped engine recommended this offer for more than 12000 out of 14825 clients.

According to the Table below, the average age of these clients is 57 years old and most of them are man.

The offer in rank #2:

The offer in rank #3:

5. Conclusion

The first step was to understand and explore the available data. This step was essential to recognise the type of information of each dataset and evaluate the potencials of extracting valuable knwoledge from it. After recognising how the data were structured, the second step was to clean the data to avoid missing values and facilitating the data analysis. Finally, the third step was to develop the machine learning model to build the final recommendation engine.

Particularly, the preprocessing of the Transcript dataset was the most challenging, especially because the format based on temporal dictionaries. Using the Pandas library, the data preprocessing of the Transcript was oriented to calculte the net revenue of each offer.

The net revenue was used as the key parameter to develop the linear regression models. To determine which type of feature is the best to use to estimate the expected offer net revenue, three different linear models were developped. By comparing the mean squared error of each linear model, it was concluded that the model using the full client profile is the most suitable model, because it can guarantee a great trade-off between complexity and precision. It guaranteed lower mean squared error than the baselined model, but using fewer features than the model based on the client profile and communication channel.

As a result, the developped recommendation engine operates the linear model using the full client profile as input to estimate the offer net revenue. With the expected net revenue of the each offer, the proposed recommendation engine select the offer with the highest net revenue to send to the client.

After runing the engine through the 14825 client profiles, it is possible to observe that there are some offers that are considered as “best sellers”, such as the discount offer with difficulty 10, duration 10 and reward 2.

As improvement, it is necessary to verify the impact of the cost of investment on the net revenue and the final engine recommendation. Additionnaly, it is needed to verify if the money spent by the client during two consecutive offers has no influence in future purchases. Finally, the performance of the developped engine can be compared with a classifier, such as Random Forest, or support vector machines.