Amazon Purchase History

3 minute read

Published:

Analyzing my Amazon Purchase History

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn')
from datetime import datetime
%matplotlib notebook

Read in raw data, break out day/month/year, fix currency strings

  • Note use of .strftime("%B") to get month names
  • .apply() with a lambda that returns a pd.Series to generate multiple new columns at once
  • fix ‘Item Total’ column to make dollar amounts floats
df = pd.read_csv('data/amazon-redacted.csv')
df[['year','month','month_str','day']] = df.apply(lambda row: pd.Series([
    datetime.strptime(row['Order Date'], '%m/%d/%y').year,
    datetime.strptime(row['Order Date'], '%m/%d/%y').month,
    datetime.strptime(row['Order Date'], '%m/%d/%y').strftime('%B'),
    datetime.strptime(row['Order Date'], '%m/%d/%y').day
]),
                                     axis=1)
df['Item Total'] = \
    df['Item Total'].replace('[\$,)]','', regex=True).astype(float)
df.head()
Order DateCategoryWebsiteConditionList Price Per UnitPurchase Price Per UnitQuantityItem SubtotalItem Subtotal TaxItem TotalTax Exemption AppliedTax Exemption TypeExemption Opt-OutCurrencyGroup Nameyearmonthmonth_strday
01/5/19Health and BeautyAmazon.comnew$10.25$9.701$9.70$0.5810.28NaNNaNNaNUSDNaN20191January5
11/5/19NaNAmazon.comnew$11.99$11.991$11.99$0.0011.99NaNNaNNaNUSDNaN20191January5
21/5/19Misc.Amazon.comnew$0.00$6.001$6.00$0.006.00NaNNaNNaNUSDNaN20191January5
31/5/19Misc.Amazon.comnew$0.00$14.521$14.52$0.0014.52NaNNaNNaNUSDNaN20191January5
41/5/19Tools & Home ImprovementAmazon.comnew$0.00$43.991$43.99$0.0043.99NaNNaNNaNUSDNaN20191January5

Define a set of useful column names in a good order

useful=['year', 'month_str', 'day', 'Category', 'Item Total']
df[useful].head()
yearmonth_strdayCategoryItem Total
02019January5Health and Beauty10.28
12019January5NaN11.99
22019January5Misc.6.00
32019January5Misc.14.52
42019January5Tools & Home Improvement43.99

At this point I can plot my monthly purchase total

datetime.strptime('1',"%m").strftime('%B')
'January'
month_totals = df.groupby('month').sum().reset_index()

fig, ax = plt.subplots(figsize=(8,6), tight_layout=True)
sns.barplot(x='month',
            y='Item Total',
            facecolor='xkcd:cadet blue',
            data=month_totals,
            ax=ax)
plt.ylabel('Total Amazon Orders ($)')
ax.set_xticklabels(month_totals['month'].apply(lambda m: datetime.strptime(str(m),"%m").strftime('%B')));
<IPython.core.display.Javascript object>

What if this is skewed by large purchases?

Use .groupby() and .idxmax() to find most expensive item purchased each month

df.loc[df.groupby('month').idxmax()['Item Total'], useful]
yearmonth_strdayCategoryItem Total
42019January5Tools & Home Improvement43.99
352019February3Health and Beauty29.99
602019March18Baby Product105.95
832019April27Tools & Home Improvement60.60
1002019May27Kitchen41.99
1082019June2Paperback100.28
1782019July31Baby Product211.99
1862019August3Electronics185.49
2282019September6Baby Product205.64
df_low = df[df['Item Total']<100]
adjusted_month_totals = df_low.groupby('month').sum().reset_index()

fig, ax = plt.subplots(figsize=(8,6), tight_layout=True)
sns.barplot(x='month',
            y='Item Total',
            facecolor='xkcd:cadet blue',
            data=adjusted_month_totals,
            ax=ax)
plt.ylabel('Total Amazon Orders ($)')
ax.set_xticklabels(month_totals['month'].apply(lambda m: datetime.strptime(str(m),"%m").strftime('%B')));
<IPython.core.display.Javascript object>

That didn’t really change the picture

Just for kicks let’s look at the order histogram for each month

fig, axes = plt.subplots(int(np.ceil(df['month'].nunique()/2)),
                         2,
                         figsize=(10,15),
                         tight_layout=True)
for month,month_str,ax in zip(df['month'].unique(), df['month_str'].unique(), axes.ravel()):
    sns.distplot(df.query('month==@month')['Item Total'],
                 bins=20,
                 ax=ax,
                 kde=False,
                 rug=True)
    plt.sca(ax)
    plt.xlabel('Item Price ($)')
    plt.ylabel('Number of Items')
    plt.title(month_str)
<IPython.core.display.Javascript object>