Starbucks Offer Prediction

Udacity Capstone Project

Ben Stone
17 min readApr 3, 2021
Original Starbucks sign in Pikes Place, Seattle

Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free).

Intro

I was given raw test data from Starbucks that simulated how members make purchasing decisions and how those decisions are influenced by promotional offers. There are no explicit products to track or what products the offer’s affect. Only the amounts of each transaction or offer are recorded.

My base task was to process offer data to determine which demographic groups respond best to which offer type.

To challenge myself further, I wanted to predict how successful offers would be for a given person based solely on their demographic. This would require processing the data into a usable form and deciding which methods and metrics to use.

My full code can be found on my GitHub, here.

In my mind I had three main tasks and will split this report/post into three such sections.

  1. Data Cleaning — Bringing together the data under one, clean and organised Data Frame.
  2. Analysing Data — Exploring the Data Frame to find conclusions
  3. Offer Prediction — Use Machine Learning to predict user activity

Data Files

Raw data was given in the form of the following three json files;

portfolio.json — (Descriptions of each offer)

  • Id (string) — offer id
  • Offer_type (string) —( BOGO, discount, informational)
  • Difficulty (int) — minimum required spend to complete an offer
  • Reward (int) — reward given for completing an offer
  • Duration (int) — time for offer to be open, in days
  • Channels (list of strings)

profile.json — (Each person’s demographics)

  • Age (int) — age of the customer
  • Became_member_on (int) — date when customer created an app account
  • Gender (str) — gender of the customer (M:Male, F:Female, O:Other)
  • Id (str) — customer id
  • Income (float) — customer’s income

transcript.json— (Simulated results of offers over time)

  • Event (str) — record description (ie transaction, offer received, offer viewed, etc.)
  • Person (str) — customer id
  • Time (int) — time in hours since start of test. The data begins at time t=0
  • Value — (dict of strings) — either an offer id or transaction amount depending on the record

1. Data Cleaning

General Clean

The data itself wasn’t too dirty all considering with the majority of the task unpacking lists, converting data types and encoding long id strings. There are 11 offers, 10 given with an 11th considered as ‘No Offer’.

Judging Success

One of the two big jobs for the clean was judging whether or not an offer had been successful and how best to show this. I decided that a Success column filled with 1s and 0s denoting success and failure respectively would serve well. The following steps outline my thought process:

For every ‘offer complete’, there is a corresponding transaction. By combining these two events it clearly separates the successful offers from regular transactions.

A transaction would then show an absence of an offer/ a successful non-offer

An ‘offer received’ with no corresponding ‘offer completed’ is unsuccessful.

‘offer viewed’ entries lie outside of the scope of my report for the time being and will be dropped

A successful informational offer is one that has been viewed and a transaction has taken place within a certain time of being received.

Below shows most of the steps for this final point in identifying successful information offers. This particular chunk of code, runs through each row of a person’s activity and judges whether an offer has been viewed and if the offer has been completed. As multiple transactions may lie within a valid informational time window, a check is made to ensure only the first transaction is attributed to the offer. If it is deemed successful, the offer entry has its relevant data updated and the transaction removed.

for m in range(len(df)):
# Find each informational offer
if df.loc[m, 'offer_type'] == 'informational':
last_info_time = df.loc[m, 'days_elapsed']
p_id_i = df.loc[m, 'person_id']
i_idx = m
i_dur = df.loc[m, 'duration']
# Check offer has been viewed
elif df.loc[m, 'event'] == 'offer viewed':
# check if informational offer has been viewed
if df.loc[m, 'offer_id'] == i_idx:
viewed = True
else:
viewed = False
# Find each transaction
elif df.loc[m, 'event'] == 'transaction':
last_trans_time = df.loc[m, 'days_elapsed']
p_id_t = df.loc[m, 'person_id']
time_elap = last_trans_time - last_info_time
time_diff = i_dur - time_elap
# Check if linked to person and within time
if (time_diff >= 0)&(p_id_i == p_id_t) & viewed == True:
#check if transaction is linked with offer
if (i_idx not in success_list):
# Check for duplicate
success_list.append(i_idx)
# At to transaction drop list
drop_list.append(m)
# Update Offer Received
df.at[i_idx, 'event'] = 'offer completed'
df.at[i_idx, 'amount'] = df.loc[m, 'amount']
df.at[i_idx,'days_elapsed']=df.loc[m,'days_elapsed']
df.at[i_idx, 'success'] = 1

As a result of the stated steps, we now have a clear structure with each row showing either:

Offer Received | Success = 0

Offer Completed | Success = 1

Transaction | Success = 1

Compute or Remove NaNs

As all of the NaN values that couldn’t be easily filled were the same rows it was in theory possible to compute them with some Machine Learning. Had this been an overwhelming success I would talk my way through the methods and metrics to compute these. However, as I came to find out many (many, many) many, hours later this approach did not quite work. I will whittle the problems down to three main ones:

1. Bias.

The data itself is somewhat skewed towards Middle Aged, Middle Income Men as we will discover later. Fig. 1 shows, despite my best attempts the bias of the data still remains. I had hoped to avoid this or lessen it but perhaps that was optimistic of me.

Figure 1— Predicted Missing Age Values

2. Computing Time

With such a large data frame to process, the computing time for each model was excessively long to build and too space intensive to store.

3. Modelling Accuracy

After all of the attempted computing of the NaNs, the affect it had was to:

  • Reduce Success Accuracy by ~10% on success and increase the F1 score
  • Increase and M.S.E. for transaction amount

After all this work then, I decided to not accept the losses of these predicted data points and drop them.

Outliers

Outliers were calculated as values that lay between the 25th percentile (although this was essentially 0) and 75th percentile. Bigger amounts were considered atypical and likely a group order or an error in the data.

Cleaned Data Description

At this point our data was rid of NaN values, success judged, and some categorical data converted into dummy columns. Breaking the (156,715 x 24) Data Frame down into its smaller parts we get the following columns;

IDs/Indexes:

person_id: custumer ID

person_index: simplified custumer ID

offer_id: Offer ID

offer_index: Simplified Offer ID

Demographic data attributed to each individual:

age: Age in years

income: Annual Income

gender: Gender (M: Male, F: Female, O:Other)

member_length: Number of days customer has been a member

Time/Test Based Data:

hours_elapsed: Hours elapsed in test

days_elapsed: Days elapsed in test

Event Data:

amount: Amount spent by a customer either via an offer or by transaction

success: 1 for successful offer, 0 for not

event_offer completed: 1 for completed offer, 0 for not

event_offer received: 1 for received (but not completed) offer, 0 for not

event_transaction: 1 for transaction, 0 for not

reward: Discount given

Offer Data:

offer_reward: Discount given if offer completed

difficulty: Minimum required spend to complete an offer

duration: Time for offer to be open, in days

offer_type: Type of offer (BOGO, Discount, Informational or Transaction)

email: 1 if offer delivered via email, 0 if not

mobile: 1 if offer delivered via mobile, 0 if not

social: 1 if offer delivered via social media, 0 if not

web: 1 if offer delivered via a website, 0 if not

2. Analysing Data

Demographic Groups

Having become well acquainted with the dataset at this point I was aware of the 4 points of data I could use to describe a single person; Age, Income, Gender and Member Length.

Figure 2— Individual Demographics

As previously mentioned, Fig. 1 and now Fig 2 show that members are predominantly Middle-Aged Males earning $60,000–70,000 a year with a membership length of around 1 year.

Figure 2.1— Demographic Event Activity

By looking at Fig. 2.1 there’s a slight shift in shape such as the under 40’s are making up more activity than would be expected if activity was the same for every person. This was an initial stop on my journey and it is a naïve view of the data. With a small extra step much more depth is revealed.

Figure 3— Demographic Event Breakdown

Fig. 3 starts to show some real insight on how demographics can affect the outcome.

The main conclusions from this are as follows:

  • Age — Offers have less effect with younger members
  • Income — Offers out-perform regular transactions for the highest earners
  • Gender — Female Offer Success to Failure Rate the highest
  • Membership Length — Highest Offer Success Rate from 1–3 year

From this particular graph the most responsive person would be a Middle Age Female, earning over $50,000 and having been a member for between 1 and 3 years.

Offer Types

Table 1 — Total Offers by Offer Type

Table 1 shows that a similar number of each offer was sent to each person with the notable exception of Informational that was sent out roughly 30% less.

Figure 4 — Successful Offers by Demographic

These numbers help bring some sense to Fig 4. This roughly equates to similar Discount and BOGO dominance with a jump down to Informational. What’s interesting is the relative success Informational offers have for younger and lower income earners. These match and even out perform BOGO despite less considerably less being distributed. Only discount consistently outperforms the other offers.

Table 2 — Offer Description

To try and work out why Discounts and Informational were more successful than expected I turned to the descriptions of the offers themselves (Table 2). On the face of it, it’s not entirely clear. BOGO has similar if not easier requirements for reward, the rewards themselves are actually better in most cases and the distribution methods are pretty similar. The most notable difference is the length of which the offer is valid, which for Discounts generally longer than BOGO.

Following on from this, finding out why Informational was popular in some cases was less complex. By looking at the average spend per person based on Age and Income it becomes very clear (Fig. 5).

Figure 5 — Average Spend by Age/Income
Figure 6— Average Spend by Age/Income

The answer is simply that the offers are too expensive to be viable for these people. The choice for lowest earners is especially clear as the Difficulty for most of the offers is more than their average spend. Factoring in how the rewards are given also explains why Discounts and Informational are most popular as the total spend is the lowest.

Membership Length didn’t seem to have much effect on average spend but as Table 3 shows, women generally spend more. This would also explain why a greater proportion of women complete more offers than men (Fig. 4).

Table 3 — Average Spend by Gender

What doesn’t quite make sense is that these offers are seemingly designed to avoid one of the most active demographic sections of the member base as Fig. 6 shows. This is more apparent in the Annual Income demographic that sees very little difference between activity levels from low to mid income earners.

Conclusions

My base task was to combine transaction, demographic and offer data to determine which demographic groups respond best to which offer type. We found that Women not only spend more but are much more likely to respond to an offer than their Male counter-parts. Informational is most successful with the youngest and lower earners. We also found that Discount Offers are the most likely to prove successful in all situations due to the lower overall spend and longer Duration of validity. No offer/Regular Transactions are still by far the most popular method of purchase.

Side Note:

Instead of mixing all data together, having clear periods to separate each offer could help show clearer differences. This could especially help with ‘No Offer’ as although some weeks a person may not be offered a new one, there still may be a valid offer running into the time period. This makes it hard to compare regular transaction patterns to ‘no offer’ patterns.

3. Offer Prediction

Aim

I wanted to know if it was possible that given any person, if it were possible to predict;

1. The success of any offer given to any person

2. The transaction amount for each offer if successful

Building a class object was my aim as it provides a flexible modular code that will help simplify some steps of what I wish to build.

Data Selection / Preparation

The first steps were to choose what data and Machine Learning models I would need to help fulfil my aims. As my aim was to do it entirely off of demographic data, for both aims Age, Income, Gender and Member_length would be used with Success and Amount to be the predicted class labels.

It was important to treat each offer as a separate entity and predict for each. This meant for each aim I needed to build 11 models. This highlighted efficiency as an important feature when selecting which model to use.

Models

  • For Success, it was clear that Logistic Regression was best method as it is easy to implement, interpret, efficient to train and directly meant for finding binary variables.
  • For Amount and for very similar efficiency reasons, I chose to use a Linear Regression model.

Metrics

It was important to select the right metrics for evaluating my models. For the Success model I chose Accuracy and F1 Score and for Amount, and Mean Squared Error. I selected these for the following reasons:

  • Accuracy — Its a simple but effective method of comparing the number of correct predictions vs incorrect.
Accuracy = (True_Positives + True_Negatives) / 
(False_Positives + False_Negatives
  • F1 ScorePrecision would likely be affected by the large number of transactions in the dataset but I didn’t want to ignore it. Using Recall and Precision separately becomes a balancing act, as when one goes up, often the other comes down. I therefore chose the Harmonic Mean of both, or F1 Score.
Precision = True_Positive / (True_Positive + False_Positive)Recall = True_Positive / (True_Positive + False_Negative)F1 = 2 * Precision * Recall / (Precision + Recall)

— or Coefficient of Determination, is the amount of the variance in the dependent variable that is predictable from the independent variable. It’s a useful metric in itself and very easy to implement and infer.

“R² is simply the square of the sample correlation coefficient between the observed outcomes and the observed predictor values.” — Devore, Jay L. (2011). Probability and Statistics for Engineering and the Sciences

 = 1 − u * v
u = ((y_true - y_pred) ** 2).sum()
v = ((y_true - y_true.mean()) ** 2).sum()
- From Sklearn Linear Regression Documentation
  • Mean Squared Error (M.S.E) — Takes the average square distance between the predicted values and the actual values. It’s easy to implement, easy to infer and popular as a result.
MSE = 1/N *( (y_true - y_pred)^2).sum()

Scaling

Whereas Logistic Regression does not require assumptions of normality, in Linear Regression, coefficients are partly determined by the size of the feature so features must normalised prior to training. For these reasons I would use the StandardScaler provided by the Sklearn library prior to building the Amount models.

Parameters

Using GridSearchCV from the Sklearn library, I aimed to refine and optimise my model’s hyper-parameters where possible. Using this I sent in lists of possible parameters and GridSearchCV tries them all and send back the best parameters with a score. It is scored using the Mean Cross-Validated score of the best model.

Below in Table 4/5 are a couple of the results from this testing.

Table 4 — Success Model Parameter Optimising

For Success, not all of the score were improved, parameters from the lower half of Table 4 were considered the best with a mix of the C values, solvers you see above and a global warm_start = True. All other hyper parameters were kept as default.

Side Note: This does optimise the model in terms of its Mean Cross-Validated score however, it made little to no difference on overall Accuracy and F1 Scores.

Table 5 — Amount Model Parameter Optimising

Perhaps surprisingly, and as Table 5 shows, Amount the score was largely unchanged even with the majority of parameters being tested. Though the fit intercept changed occasionally, further investigation found little evidence of change and that it must be small enough to be insignificant. The main take away from the tests was to set a global parameter of n_jobs = -1 was able to make a very small but positive change. All other hyper parameters were kept as default.

Side Note: This unsurprisingly had no effect on later and MSE scores.

Modelling

After selecting and refining my models, I chose to build each model and put them into a dictionary for later use. In the absence of ‘unsuccessful’ transactions, global ‘failed’ offers were used.

A snippet of the code used to make the model for Amount prediction model is below.

for offer_num in offer_list:
offer_dict[f"offer_{offer_num}"] = pred_df[pred_df \
['offer_index']== offer_num].copy()
offer_dict[f'offer_{trans_id}'] = df[['offer_index',\
'amount', 'gender', 'age', 'income',\
'member_length']].copy()
X = offer_dict[f'offer_{offer_num}'].iloc[:,2:].copy()
X = pd.get_dummies(X)
scaler = StandardScaler()
x_scaled = scaler.fit_transform(X.values)
X = pd.DataFrame(x_scaled, columns = X.columns)
y = offer_dict[f'offer_{offer_num}'].iloc[:,1]
X_train, X_test, y_train, y_test = train_test_split(X, y, \
test_size = 0.2, random_state = 42)
model = LinearRegression(n_jobs = -1)
model.fit(X_train, y_train)

Once each model was created, it was added into a dictionary along with R² and Mean Squared Error metrics;

        amount_r2.append([offer_num , model.score(X_test, y_test)])
amount_mse.append([offer_num, mean_squared_error\
(y_test, model.predict(X_test))])
model_dict[f'offer_{offer_num}'] = model
Table 6— Success and Amount Predictions
Table 7— Offer Recommendations

The Success model was built in a similar way and afterwards I was had two dictionaries of 11 models for each of the offers available.

The data was then sent to each of the 22 models and the results presented in a data frame like Table 6.

After taking difficulty into account, and removing predicted failures the resultant recommendations for the above look like Table 7.

Evaluation

With the method I implemented computing time is excellent. The class makes it nice and easy to predict with the majority of the functions set in a different file. This will make it much easier to add extra functionality to should I wish at a later date.

Table 8— Prediction Model Coefficients

Table 8shows the model coefficients for all 22 models.

For Success, the feature of most importance is shown to be Age and largely in a negative sense. Member Length comes in second in a positive way and perhaps most interesting is the Income though positive, has very little influence in the models decision.

For Amount and contrary to Success, the most important feature by a big margin is Income. This is perhaps to be expected after looking at Fig. 5 earlier. Next are Age and Women that again fit our expectations from earlier and both have a positive relationship with amount spent.

Table 9— Success F1 and Accuracy Scores and Amount R² and Mean Squared Error

The score metrics from earlier are combined in to the handy table you see here (Table 9). With an average Accuracy score of 69% and F1 score of 0.64 the Success models turned out well.

The and Mean Squared Error of the Amount model however aren’t quite as good. To improve these metrics alone I suspect a neural network approach would be best or increasing the amount of demographic features available. This could also perhaps be achieved with an ensemble method but with 22 models this makes it a computing heavy method and therefore wasn’t attempted.

Figure 7 — Individual Data vs Prediction

Taking an individual approach, Fig. 7 highlights a rather nice prediction with all values around $20 mark. This was a great representation of what the model can do and highlights where some of the accuracy might be getting lost.

Figure 8— Total Successes vs Predicted

Fig. 8 and Fig. 9 shows the result of predicting the every person in the data set to show where inaccuracies lie. It is worth mentioning that the Member Length plot for each is a histogram as opposed to a bar graph.

For Success, the general trend of each predicted response largely follows the data. It is true that the prediction is a tad optimistic for the Gender responses but the differences between each gender is about right. Membership Length is largely correct and although there are discrepancies for younger members, the accuracy is high from 50 upwards. Income is the furthest from the mark, withe the exception of around $75,000 the rest is either under or over predicting. As the coefficients put little to no weight on income, it’s not too worrying a find.

Figure 9— Average Spend vs Predicted

For Amount (Fig. 9), the predictions are somewhat similar in the way that the general trend of each follows the data. Membership Length successfully follows its downward trend. Gender remains optimistic for women with Male getting under predicted. Age sees a reverse in accuracy compared with Fig. 8, with the younger members most accurately predicted. Income is still struggling for accuracy although there is a marked improvement for the low to mid earner.

Conclusions

The models themselves, run beautifully smooth and with very low computing time. For the limited number of features present, I was pleased with the outcome.

Success accuracy by metric shows promise and the rough trends they predict are inline enough that using this as a model to send offers to members would be worth testing.

On the face of it, I wouldn’t fully trust the Amount model as an accurate representation of spend amount. It is roughly in line with most of the trends though. As a method for ranking successfully predicted offers though it could prove very useful.

Side Note

To add some spice to my class, I added the ability to add a completely new person by entering their demographics. Without testing though or spending a good deal of time exploring this it doesn’t add much (yet). Along with this some general graph creation function were that produced a number of the one shown here.

Final Conclusions

My base task was to process offer data to determine which demographic groups respond best to which offer type. This I completed and found that whilst some demographics (i.e. Women) were more likely to perform well. Some (eg. low income) were more likely to be disadvantaged by the difficulty of the offers.

I challenged myself further and predicted how successful offers would be for a given person based solely on their demographic data. This was largely successful with a robust Success model utilising the Amount model to predict which offers would be more successful and rank them in order of potential spending. Both of these models would likely be improved with a greater number of features but with the limited demographic data present, I am pleased with the result.

Personal Reflection

There are a myriad of improvements I would suggest for the data collection. It would be great to know when the offers get delivered do the members get a notification, what are the items being discounted, which items are getting bought, the prices of said items etc… I am not naïve though and know that messy and incomplete data is part and parcel of data science.

For my own models, I would like to try and implement a neural network approach but time and computing limitations will put this on the back burner for a little while.

If I was to continue on from this I would love to find out which part of an offer members respond to most and try to design the most effective offer possible. I would also like to deploy this and make a shiny dashboard but sinking towards a week into computing NaNs that I ended dropping really hampered my schedule.

--

--

Ben Stone

I write in a style that reflects me. Then I alter it so it becomes legible. My legible views are my own.