Unveiling Patterns: A Dive into Market Basket Analysis with Apriori Algorithm

Felix Kiprotich
7 min readOct 15, 2023

--

In the dynamic realm of data science, the quest for valuable insights is perpetual. One such fascinating journey I recently undertook at CodeClause was into the intricate landscapes of Market Basket Analysis using the Apriori Algorithm. Let’s embark on this enlightening expedition together!

Understanding Market Basket Analysis

Market Basket Analysis is a strategic approach that dissects customer purchase patterns, unraveling hidden connections between products. Imagine having the ability to predict what products are likely to be purchased together — this is the essence of Market Basket Analysis.

What is Association Rule Learning?

Association Rule Learning is rule-based learning for identifying the association between different variables in a database. One of the best and most popular examples of Association Rule Learning is the Market Basket Analysis. The problem analyses the association between various items that has the highest probability of being bought together by a customer.

For example, the association rule, {onions, chicken masala} => {chicken} says that a person who has got both onions and chicken masala in his or her basket has a high probability of buying chicken also.

Apriori Algorithm

The algorithm was first proposed in 1994 by Rakesh Agrawal and Ramakrishnan Srikant. Apriori algorithm finds the most frequent itemsets or elements in a transaction database and identifies association rules between the items just like the above-mentioned example.

How Apriori works ?

To construct association rules between elements or items, the algorithm considers 3 important factors which are, support, confidence and lift. Each of these factors is explained as follows:

  1. Support:

The support of item I is defined as the ratio between the number of transactions containing the item I by the total number of transactions expressed as :

Support indicates how popular an itemset is, as measured by the proportion of transactions in which an itemset appears. In Table 1 below, the support of {apple} is 4 out of 8, or 50%. Itemsets can also contain multiple items. For instance, the support of {apple, beer, rice} is 2 out of 8, or 25%.

If you discover that sales of items beyond a certain proportion tend to have a significant impact on your profits, you might consider using that proportion as your support threshold. You may then identify itemsets with support values above this threshold as significant itemsets.

2. Confidence:

This is measured by the proportion of transactions with item I1, in which item I2 also appears. The confidence between two items I1 and I2, in a transaction is defined as the total number of transactions containing both items I1 and I2 divided by the total number of transactions containing I1. ( Assume I1 as X , I2 as Y )

Confidence says how likely item Y is purchased when item X is purchased, expressed as {X -> Y}. This is measured by the proportion of transactions with item X, in which item Y also appears. In Table 1, the confidence of {apple -> beer} is 3 out of 4, or 75%.

One drawback of the confidence measure is that it might misrepresent the importance of an association. This is because it only accounts for how popular apples are, but not beers. If beers are also very popular in general, there will be a higher chance that a transaction containing apples will also contain beers, thus inflating the confidence measure. To account for the base popularity of both constituent items, we use a third measure called lift.

3. Lift:

Lift is the ratio between the confidence and support.

Lift says how likely item Y is purchased when item X is purchased, while controlling for how popular item Y is. In Table 1, the lift of {apple -> beer} is 1,which implies no association between items. A lift value greater than 1 means that item Y is likely to be bought if item X is bought, while a value less than 1 means that item Y is unlikely to be bought if item X is bought. ( *here X represents apple and Y represents beer* )

for Extra Reading to this

Introduction to apriori algorithm

This dataset I get from kaggle, and contains information about Customers buying different grocery items at a supermarket. That’s around 7501 row data that I have. Let’s explore this dataset before doing modeling with apriori algorithm.

How does apriori algorithm work?

First I’ll take all of “missing values” for this analysis, and I try to handle it in the end of the result. The first step make all subsets of a frequent itemset must be frequent with frequency table for each transaction.

# create all items to array
list_all_item = []
for i in range(0, 7501):
list_all_item.append([str(df.values[i,j]) for j in range(0, 20)])

# create a frequent items (dataframe) with Transaction Encoder
from mlxtend.preprocessing import TransactionEncoder
te = TransactionEncoder()
te.fit(list_all_item)
data = te.transform(list_all_item)
data = pd.DataFrame(data, columns = te.columns_)
data.head().T

The next step is build apriori model, you can change all parameters at apriori model in mlxtend package. I’ll try to use parameters min_support for this modeling. It’s mean the support greater than parameters would be show in the result.

# Call apriori function which requires minimum support, confidance and lift, min length is combination 
# of item default is 2".
rules = apriori(transactions, min_support=0.003, min_confidance=0.2, min_lift=3, min_length=2)

## min_support = 0.003 -> means selecting items with min support of 0.3%
## min_confidance = 0.2 -> means min confidance of 20%
## min_lift = 3
## min_length = 2 -> means no. of items in the transaction should be 2

#it generates a set of rules in a generator file...
rules

# all rules need to be converted in a list..
Results = list(rules)
Results

# convert result in a dataframe for further operation...
df_results = pd.DataFrame(Results)

# as we see "order_statistics" , is itself a list so need to be converted in proper format..
df_results.head()

Final Association Rules result

From the association rules result above there is many missing value. I can’t combine itemset with the missing values. So, I’ll try handle this missing value for each itemset in antecedents and consequents. The result for both of them is a frozenset.

The frozenset() method returns an immutable frozenset object initialized with elements from the given iterable.

# keep support in a separate data frame so we can use later.. 
support = df_results.support

'''
convert orderstatistic in a proper format.
order statistic has lhs => rhs as well rhs => lhs
we can choose any one for convience.
Let's choose first one which is 'df_results['ordered_statistics'][i][0]'
'''

#all four empty list which will contain lhs, rhs, confidance and lift respectively.
first_values = []
second_values = []
third_values = []
fourth_value = []

# loop number of rows time and append 1 by 1 value in a separate list..
# first and second element was frozenset which need to be converted in list..
for i in range(df_results.shape[0]):
single_list = df_results['ordered_statistics'][i][0]
first_values.append(list(single_list[0]))
second_values.append(list(single_list[1]))
third_values.append(single_list[2])
fourth_value.append(single_list[3])

# convert all four list into dataframe for further operation..
lhs = pd.DataFrame(first_values)
rhs = pd.DataFrame(second_values)

confidance=pd.DataFrame(third_values,columns=['Confidance'])

lift=pd.DataFrame(fourth_value,columns=['lift'])


# concat all list together in a single dataframe
df_final = pd.concat([lhs,rhs,support,confidance,lift], axis=1)
df_final


From the result, the lift of an association rule “if ground beef & eggs then mineral water” is 2.12 because the confidence is 50% (the highest confidence for this dataset). This means that consumers who purchase ground beef & eggs are 2.12 times more likely to purchase mineral water than randomly chosen customers. Larger lift means more interesting rules. Association rules with high support are potentially interesting rules. Similarly, rules with high confidence would be interesting rules as well. So, you can take the conclusion with the highest the three parameters in that association rules.

RESULTS AND APPLICATION:

Cross Selling could be improved by bundling up the items/ products

Promotional activities could be done to improve the sales on the goods that customers don’t buy

Store layout could be modified so the sales could be improved when certain items/ products are kept together.

That is all from me, I hope you can take the insight of this dataset. There are still many mistakes and shortcomings in every model that I do. For more detail about this data, the code, and more visualize you can reach my github by following this link. Feel free to ask, and lets start discuss guys!

Thank you, I hope you enjoy it guys. See you on the next stories. Have a nice day! :)

--

--

No responses yet