Capstone Project for Udacity’s Data Scientist Nanodegree.
Having a good product at the right time with the best price. This is the secrete formula for the Starbucks’ success. With 32,660 stores in more than 80 countries, Starbucks sells more than 4 million coffee drinks per day. However, how does this American multinational achieve to manage quality, price and time with high precision?
The true secrete for perfect coodination of its business is hidden behind huge amount of data. Through strong data processing, Starbucks can understand the needs and wishes of each of its clients, and thus, recommend to them the right offer to boost its coffee sells.
However, the greatest challenge for increasing the client engagement through data analysis is to develop the right recommendation engine capable of recognising the best offer to send to a client. Therefore, this post aims to propose a feaseable approach to this problem which is under the scope of the Udacity Data Scientist Nanodegree’s capstone project.
A. Project Definition
In collaboration with Udacity and Starbucks, the Data Scientist Nanodegree’s Capstone project aims at developping an outstanding recommendation engine to identify the next offer to keep its clients engaged in enjoying another cup of coffee in the future.
Using machine learning algorithms and data visualisation strategies, the recommendation engine was built in Python using Jupyter Notebook. For further details, the entire data analysis and engine’s code is available in this GitHub repository.
The development of this engine followed three steps. The first step aimed to understand and clean the available data to prepare them for further analysis. With the well-formatted data, an in-deepth analysis was conducted to build three different machine learning models to predict the most suitable offer to be sent to a client. Finally, the last step was to compare the performance of these three models based on the mean squared error. The model capable of guaranteeing the lowest mean squared error is the most suitable to operate as a recommendation engine.
To determine the best machine learning model to build the recommendation engine, the Mean Squared Error (MSE) was used as metric for comparison. By dividing the entire database into training and test dataset, and implementing KFold cross validation the three models will be evaluated. The model that can guarantee the lowest MSE will be selected as the most suitable model to build the recommendation engine.
2. Data Exploration and visualization
The data available for analysis are structured into three different DataFrames, notably:
- Portfolio — dataset describing the characteristics of each offer type, including its offer type, difficulty, and duration.
- Profile — dataset containing information regarding customer demographics including age, gender, income, and the date they created an account for Starbucks Rewards.
- Transcript — dataset containing all the instances when a customer made a purchase. Moreover, it indicates when the client viewed, received, and completed an offer.
These three type of data (i.e. protifolio, profile and transcript) were preprocessed to determine which demographic group responds best to which offer type. The idea consisted of preprocessing the data to be used in the training of different machine learning models to estimate the offer net revenue.
A. Portfolio dataset
The portifolio dataset consists of a small dataframe containing only ten rows refering to the ten possible offers. In this dataset, there is the essencial information concerning the offers, such as duration, difficulty, reward, and communication channels.
As shown in the graph below, the ten kind of offers are cathegorized into three types: bogo, discount and informational. It is possible to note that there are 4 type of bogo and discount offers, and only 2 informational offers.
The difficulty, duration and reward of each of type of offer are shown in the graph below. It is possible to note that the difficulty and reward of “bogo” offers — also known as ‘buy one and get one’ — are the same. Additionally, “informational” offers has difficulty and reward equal to zero. Remarkably, one of the discount offer has the highest difficulty, which means that the client has to spend at least 20 $ to receive a discount of 5 $.
B. Profile dataset
The client profile dataset has a lot of missing values, especially concerning the gender and the client income, as shown in Table and graph below. Moreover, it is possible to note that some clients have indicated to have 118 years old, which is unreal. Conversely, the column ‘became_member_on’ has no missing values. Consequently, this dataset needs to be cleaned before being used by machine learning models.
C. Transcript dataset
The transcript dataset contains all the historic data of how clients interact with an offer. The dataset is composed of 17000 unique clients identified by an ID number. An example of the historic of a client (e.g. client_id = “0009655768c64bdeb2e877511632db8f”) is shown in the figure below.
As illustrated in the figure above, the Transcript dataset indicates when an offer was received, viewed and completed by the client. It also informs the total money spent during a transactioin. However, the structure of this dataframe is not straighforward to be analysed. The data are organised througthout dictionaries, which makes difficult to take key information from them.
The entire Transcript dataset is divided into ‘offer completed’, ‘offer received’, ‘offer viewed’, and ‘transaction’, as shown in the graph below. It is notable that there are much more ‘transaction’ than other categories. In average, each client has 18 historic events in the database.
3. Methodology to build the machine learning recommendation engines
The methodology choosen to building the machine learning recommendation engine is structured into two main steps, notably:
- Data preprocessing — prepare and clean the data to be used in the machine learning models.
- Implementation — build and train three machine learning models using Random Forest Regressor and Linear Regression.
A. Data preprocessing
The three input datasets (portfolio, profile and transcript) are going to be preprossessed to be used in the training of machine learning models to estimate the Net Revenue (NR) of each offer.
Remarkably, the Net Revenue (NR) is calculated from the total money spent by the client, the offer reward, and the cost of advertisement, as expressed in the formula underneath.
In other words, the NR is calculated from the difference between total income (I) and total expenses related to this offer (R+C). The total income (I) represents the total money spent by the client between two offers. It is considered that the visualisation of an offer will impact the client’s future behaviour up to the receivement of another offer. Therefore, the total money spent during the period between the visualisation of an offer and the receivement of another offer defines the total income (I).
On the other hand, the total offer expense is the sum of the cost of advertisement (C)and the reward received by the client (R). The reward is only subtracted from the net revenue (NR), if the client has completed (Delta_completed) the offer; otherwise, Starbucks does not need to give the reward (R) to the client. Remarkably, in this study, the cost of advertiment is worth $ 0.1.
i. Preprocessing of the portifolio dataset
The preprocessing of the portfolio dataset consists of creating dummy columns to indicate each type of offer channel (i.e. email, mobile, social, and web). The main purpose of this preprocessing is to verify the potentials of using this type of data to enhance the machine learning models.
Therefore, by using get_dummies method of Pandas library, the dummy columns can be easily created, as shown in the figures below.
ii. Preprocessing of the profile dataset
The prepocessing of the profile dataset consists of dropping all rows containing either null values (NaN or None) or age equals 118. After cleaning the data, the entire dataset contains 14825 full client profiles out of 17000 from the original dataset. The distribution of client profile by gender is shown in the graph below. It is possible to note that there are about 38% more man than woman in the dataset. Additionnally, very few users has indicated ‘other’ as gender.
The distribution of client’s income classified by gender can be visualized in the graph below. It is possible to note that woman has an average income about 16% higher than man.
Additionally, woman clients are older than man on average 3 year old, as can be observed in the graph below.
Looking at the number of days each of these groups has subscribed to Starbucks account, it is possible to observe that most of clients are newer, with less than 1500 days. On the other hand, very few clients are senior ones with more than 2000 days of subscription.
iii. Preprocessing of the transcript dataset
The transcript data must be preprocessed to indicate the offers received by a client, and whether this offer has been seen and completed. Furthermore, it is necessary to calculate the net revenue for each received offer based on the Equation I mentioned above.
The preprocessed transcript data look like as shown in the figure below, where the client_id is repeated as many times as the quantity of received offers. Analogously, each received offer is associated by two Boolean variables indicating whether the offer has been viewed and completed, i.e. is_offer_viewed and is_offer_completed, respectively. Moreover, each offer is associated with its respective net revenue (column net_revenue).
As can be noted in the first two rows in the Dataframe in the figure below, a negative net revenue is when an offer was sent to the client, but it was either not seen or not completed. On the contrary, a positive net revenue is when a user has seen the offer, but not necesseraly completed it, as indicated in the lines 2 and 3 in the Dataframe below.
It is important to note that the flag is_offer_completed is True when the client has completed and viewed the offer. A completed offer without being visualised is not considered a successful offer.
In the bar graph below, the average net revenue per offer is shown. It is possible to observe that the 10th offer (ie. fafdcd668e3743c1bb461111dcafc2a4), which is a discount offer, has the highest net revenue, representing about 30$. On the contrary, the first offer (i.e. 0b1e1539f2cc45b7b9fa7c272da2e1d7), which is a bogo offer, has the lowest net revenue amongst all possible Starbucks offers.
According to the portifolio dataset, Starbucks can offer ten different kinds of offers. Therefore, the idea is to design a machine learning model for each type of offer to estimate the net revenue given a type of client.
For this, the client profile will be used as input for our machine learning model, whereas the net revenue will be the output. The recommendation engine will run the ten machine learning models to calculate the expected net revenue (NR) for each of these offers. As a result, the offer with the highest NR will be sent to the client.
Therefore, by using the pieace of code below, the cleaned portfolio was combined with the cleaned transcript dataframe (df_client_offer) and the cleaned client profile (portfolio_clean).
By joining these three dataset, it is obtained a complete Dataframe linking the client profile with the offer portfolio and the expected net revenue, as shown in the figure below.
Consequently, by joining the three dataframes, it is possible to have some key statistics of the entire input data, as summarized into the table below. Therefore, 78% of offers were viewed, only 52% were completed. The average net revenue is about 19 $/person. The offers are sent most of time email and mobile. The average age of Starbucks client are 54 years old, and most of them are man.
In order to determine the best machine learning model to implement our recommendation engine, three different approaches were evaluated using both Random Forest Regressor and Linear Regression, notably:
- Baseline model (BM): a simple machine learning model to predict the net revenue considering only the user gender (F, M, O) as input.
- Client profile model (CM): a machine learning model to predict the net revenue considering the full client profile as input (gender, age, income and client_duration_days).
- Client-profile and Offer channel model (COM): a machine learning model to predict the net revenue considering the full client profile (gender, age, income and client_duration_days) and the offer communication channel (email, mobile, social and web) as input.
Therefore, the three machine learning models implements a Random Forest Regressor or a Linear Regression that differs only on the number of input features. From the two function shown below, the three model were implemented. The differences into the three models (BM, CM and COM) reside on the num_features, which is worth 3, 6, and 10, respectively.
Remarkably, the baseline model is to verify whether the model can estimate at least the average net revenue classified by gender. It is a simplified model that allow us to verify whether the machine learning model is working properly. Therefore, the Mean Squared Error (MSE) of the baseline model will be compared with the other two models (CM an COM) to identify the best one. The most suitable linear regression model is the model that can guarantee the lowest MSE.
The following four paragraphs describe each of these four machine learning models.
i. The Baseline model (BM)
To verify whether the predicted model works correctly, a simple model which uses only the client gender (i.e. F, M and O) as feature will be evaluated. This code is implemented by the function compare_baseline_with_average_per_gender available in this GitHub repository. The evaluation consists of comparing the predicted value — named ‘Net Revenue Prediction’ in the graphs below — for the three possible input values (i.e. [1 0 0], [0 1 0], and [0 0 1]) with the classified average of net revenue of the whole dataset.
Based on the results shown in the three graphs above, the baseline predicted model can estimate the average net renue of participants with an error inferior to 6% for woman and man and below 32% for other genders. The accuracy of the ‘Other’ prediction is low because the number of people who have indicated ‘Other’ as gender is very small (less than 2% of clients).
According to the graph below, the Random Forest and Linear Regression model assured similar MSE. This lead us to conclude that for the baseline model, both algorithms are comparable.
As shown in the code below, the Random Forest Regressor was set up through a GridSearch as shown in code below:
The n_estimentors, min_samples_split and boostrap parameters were tuned to configure the model. As a result, the best perameters to configure each of the ten RandomForestRegressor model are shown below. Therefore, most of time, it is recommended to activate bootstrap, the min_samples_split is worth either 2 or 4, and the n_estimators varies between 10 and 30.
Moreover, in the model training and valitating process, it was used cross validation using 5 Kfolds to verify whether there is a bias in the model. The cross validation is implemented by using the code below. This allows us to check whether the model is overfitting or underfitting. By analysing the results of the cross validation above, it is possible to note that the MSE varies considerably for each Kfold, which indicates that either the dataset is unbalanced or the the model is underfitting.
ii. Client profile model (CM)
To enhance the model prediction, the full client profile will be used. Therefore, the CM model considers six features to train a linear model, notably:
- Client age
- Client income
- Number of days a user is a client member
- Client gender (F, M and O)
The MSE for each of these models is shown in the graph below. It is important to highlight that in the model training and validation steps, it was also used GridSearch and KFold cross validation. The MSE shown in this graph is the average of the 5 KFold training. For the sake of simplicity, in this blog post the results of the MSE per Kfold iteration and the GridSearch results are not going to be shown. However, the full analysis is avaible in my GitHub repository.
From the graph below, it is possible to note that the Linear regression model can assure lower MSE that Random Forest.
iii. Client-profile and Offer channel Model (COM)
Similarly, to verify whether it is possible to reduce the MSE, the channel of offer transmission was used as feature for training the linear regression model. The comparison of the MSE per offer is shown in the Table bellow. It is possible to conclude that the Linear Regression model is better than Random Forest, since it can guarantee lower MSE.
4. Results and discussions
After building three different machine learning models, this section aims at evaluating the performance of these three developped models based on the MSE. Thereafter, using the most suitable model to predict the net revenue, the recommendation engine was designed to identify the relation between users and offers.
A. Comparison of the three linear regression model
Aiming to compare the three developped linear regression models explained above, the average MSE was compared as shown in graph below.
From the graph above, it is possible to note that the Client profile Model (CM) and Client Offer Model (COM) using Linear Regression can assure the lowest average MSE. It is important to highlight that the average MSE is the average among the ten offers. The graph below summarises the MSE obtained for each offer and for each machine learning model.
In these graph, it is possible to note that the CM model (section 3.B) and COM (section 3.C) resulted in the mean square error very similar, which is lower than the BM (sectioin 3.A). Therefore, this indicates that the offer channel (social, web, email, etc) does not influence the net revenue estimation. To simplify the linear regression model, it is better to use the CM model to build a recommendation engine, since it is a simpler machine learning model.
B. Building the recommendation engine
To determine the best offer to send to the client, the proposed recommendation engine uses the ten CM models described above. As shown in the schematic in the figure below, each of these models will calculate the expected net revenue of a specific offer sent to the client. Subsequently, the offers will be ranked according to the expected net revenue. Finally, the engine will select the offer with the highest net revenue as the best one to be sent to the client.
By running the recommendation engine for all the 14825 client profiles available cleaned Starbucks dataset, the statistics for the top three offers are shown below.
The offer in rank #1:
From the graph above, it is possible to note that the discount offer with difficulty 10, duration 10 and reward 2 (i.e. ID = ‘fafdcd668e3743c1bb461111dcafc2a4’) is the most suitable offer to be sent most of the time. The developped engine recommended this offer for more than 12000 out of 14825 clients.
According to the Table below, the average age of these clients is 57 years old and most of them are man.
The offer in rank #2:
The second best offer to send to clients is also a discount offer. However, it is a discount offer with difficulty 7, duration 7, and reward 3.
The offer in rank #3:
The third position is more diverse, as can be noted in the graph below. The third best offer is often a bogo offer with duration 5, difficulty 5, and reward 5 (i.e. ID = f19421c1d4aa40978ebb69ca19b0e20d). Most of the clients who will complete this offer are woman of 78 years old.
To boost the Starbucks’ coffee selling, it was designed a recommendation engine based on linear regression model using real data provided by Udacity in collaboration with Starbucks. The proposed recommendation engine was designed following three main steps.
The first step was to understand and explore the available data. This step was essential to recognise the type of information of each dataset and evaluate the potencials of extracting valuable knwoledge from it. After recognising how the data were structured, the second step was to clean the data to avoid missing values and facilitating the data analysis. Finally, the third step was to develop the machine learning model to build the final recommendation engine.
Particularly, the preprocessing of the Transcript dataset was the most challenging, especially because the format based on temporal dictionaries. Using the Pandas library, the data preprocessing of the Transcript was oriented to calculte the net revenue of each offer.
The net revenue was used as the key parameter to develop the linear regression models. To determine which type of feature is the best to use to estimate the expected offer net revenue, three different linear models were developped. By comparing the mean squared error of each linear model, it was concluded that the model using the full client profile is the most suitable model, because it can guarantee a great trade-off between complexity and precision. It guaranteed lower mean squared error than the baselined model, but using fewer features than the model based on the client profile and communication channel.
As a result, the developped recommendation engine operates the linear model using the full client profile as input to estimate the offer net revenue. With the expected net revenue of the each offer, the proposed recommendation engine select the offer with the highest net revenue to send to the client.
After runing the engine through the 14825 client profiles, it is possible to observe that there are some offers that are considered as “best sellers”, such as the discount offer with difficulty 10, duration 10 and reward 2.
As improvement, it is necessary to verify the impact of the cost of investment on the net revenue and the final engine recommendation. Additionnaly, it is needed to verify if the money spent by the client during two consecutive offers has no influence in future purchases. Finally, the performance of the developped engine can be compared with a classifier, such as Random Forest, or support vector machines.