Think of a grocery app like BigBasket. You are planning for grocery essentials and you have listed out a few things that you are planning to buy. But after listing out the things, you end up buying more than you think! This is called impulsive buying and many brands like BigBasket or Grofers use apriori algorithm to leverage this phenomenon.
Apriori algorithm is a concept of Data Mining. Data mining allows users to analyze, classify, and discover correlations among data. The task is related to the crucial process called Association Rule Learning.
What is Associate Rule Learning?
Associate Rule Learning or ARL is a part of data mining anomaly detection. ARL is a procedure to look for the items or events that do not correspond to a familiar pattern. Such kinds of familiar patterns are termed anomalies and transcribe critical and actionable data in different application fields. ARL looks for the association among variables.
What is the Apriori Algorithm?
Apriori algorithm also known as levelwise search is an iterative approach that is attempted on a database record (particularly transactional records or the records including a particular number of fields or items). The apriori is a classic algorithm that follows a bottomup approach to incrementally contrast complex records. It is the most used algorithm in today’s world of machine learning and artificial intelligence.
Apriori algorithm consists of two primary steps:
 Selfjoin
 Pruning
Repeat these two steps k times, where k is the number of items in the last iteration you get frequent items sets containing k items. Another point to remember here is Apriori algorithm follows the sequential approach.
Example – Take an example of app BigBasket. Using this app, one can figure out which items are purchased together and frequently through BigBasket.
Scientists describe the apriori algorithm from its pseudocode that is widely available online. The apriori algorithm is used with the collaboration of other algorithms to sort and contrast data effectively to show a better picture of how a complex system reflects patterns and trends.
Another example is – the information gathered from an ecommerce website that a customer who purchases a keyboard is also likely to buy a mouse simultaneously. For this, the following association rule will be made:
Support: The percentage of taskrelevant data transactions for which the pattern is true
Support (Keyboard > Mouse) =
Confidence: The measure of certainty or trustworthiness associated with each discovered pattern.
Confidence (Keyboard > Mouse) =
Thus, the algorithm focuses on the finding of rules that satisfy both the minimum support threshold and a minimum confidence threshold (strong rules).
 Item: article in the basket
 Itemset: a group of items purchased together in a single transaction
The working of the Apriori algorithm
 Find all frequent itemsets:
 Get frequent items:


 Items whose occurrence in the database is greater than or equal to the min. support threshold.
 Get frequent itemsets:
 Generate candidates from frequent items.
 Prune the results to find frequent itemsets.

 Generate strong association rules from frequent itemsets
 Rules which satisfy the min. support and min. confidence threshold.
Flowchart of the above problem statement
Key Concepts
 Frequent itemsets: All the sets which contain the item with the minimum support (denoted as for item set.
 Apriori Property: Any subset of frequent itemsets must be frequent.
 Join operation: To find, a set of candidate kitemsets is generated by joining with itself.
Example of Apriori algorithm
Consider the following database(D) with 4 transactions (T1, T2, T3 and T4).
Let minimum support = 2%
Let minimum confidence = 50%
Now, let’s generate Association rule of the above:
And, to find confidence:
To convert confidence into confidence%, multiply confidence with 100
[For example, confidence%(A > C) = 0.66 x 100 = 66%]
Now compare confidence% of association rule with a minimum confidence threshold (50%). Since confidence% of both Association rules is greater than minimum confidence threshold, so both Association rules are final rules.
Was the example technical one?
Don’t worry! I would like it if even an undergraduate can understand this without any problem. Let’s make it simple without any use of terminologies or jargons.
Let’s start with a nonsimple example,
Transaction ID
Transaction ID
Items Bought
T1
{Mango, Onion, Nintendo, Keychain, Eggs, Yoyo}
T2
{Doll, Onion, Nintendo, Keychain, Eggs, Yoyo}
T3
{Mango, Apple, Keychain, Eggs}
T4
{Mango, Umbrella, Corn, Keychain, Yoyo}
T5
{Corn, Onion, Onion, Keychain, Icecream, Eggs}
Now, we follow a simple golden rule: we say an item/itemset is frequently bought if it is bought at least 60% of times. So for here it should be bought at least 3 times.
For simplicity
M = Mango
O = Onion
And so on……
So the table will be
Original table:
Transaction ID
Items Bought
T1
{M, O, N, K, E, Y }
T2
{D, O, N, K, E, Y }
T3
{M, A, K, E}
T4
{M, U, C, K, Y }
T5
{C, O, O, K, I, E}
Step 1: Count the number of transactions in which each item occurs, Note ‘O=Onion’ is bought 4 times in total, but it occurs in just 3 transactions.
Item
No of transactions
M
3
O
3
N
2
K
5
E
4
Y
3
D
1
A
1
U
1
C
2
I
1
Step 2: Now remember as said the item is said frequently bought if it is bought at least 3 times. So in this step we remove all the items that are bought less than 3 times from the above table and we are left with
Item
Number of transactions
M
3
O
3
K
5
E
4
Y
3
These are the single items that are bought frequently. Now let’s say we want to find a pair of items that are bought frequently. We continue from the above table (Table in step 2)
Step 3: We start making pairs from the first item, like MO,MK,ME,MY and then we start with the second item like OK,OE,OY. We did not do OM because we already did MO when we were making pairs with M and buying a Mango and Onion together is the same as buying Onion and Mango together. After making all the pairs we get,
Item pairs
MO
MK
ME
MY
OK
OE
OY
KE
KY
EY
Step 4: Now we count how many times each pair is bought together. For example M and O is just bought together in {M,O,N,K,E,Y}
While M and K is bought together 3 times in {M,O,N,K,E,Y}, {M,A,K,E} AND {M,U,C, K, Y}
After doing that for all the pairs we get
Item Pairs
Number of transactions
MO
1
MK
3
ME
2
MY
2
OK
3
OE
3
OY
2
KE
4
KY
3
EY
2
Step 5: Golden rule to the rescue. Remove all the item pairs with number of transactions less than three and we are left with
Item Pairs
Number of transactions
MK
3
OK
3
OE
3
KE
4
KY
3
These are the pairs of items frequently bought together.
Now let’s say we want to find a set of three items that are brought together.
We use the above table (table in step 5) and make a set of 3 items.
Step 6: To make the set of three items we need one more rule (it’s termed as selfjoin),
It simply means, from the Item pairs in the above table, we find two pairs with the same first Alphabet, so we get
· OK and OE, this gives OKE
· KE and KY, this gives KEY
Then we find how many times O,K,E are bought together in the original table and same for K,E,Y and we get the following table
Item Set
Number of transactions
OKE
3
KEY
2
While we are on this, suppose you have sets of 3 items say ABC, ABD, ACD, ACE, BCD and you want to generate item sets of 4 items you look for two sets having the same first two alphabets.
· ABC and ABD > ABCD
· ACD and ACE > ACDE
And so on … In general, you have to look for sets having just the last alphabet/item different.
Step 7: So we again apply the golden rule, that is, the item set must be bought together at least 3 times which leaves us with just OKE, Since KEY are bought together just two times.
Thus the set of three items that are bought together most frequently are O,K,E.
Applications of Apriori Algorithm
 Data analysis
 Data Classification
 Crossmarketing
 Clustering
 Catalogue design
 Lossleader analysis, etc.
Limitations of using Apriori algorithm
 Apriori algorithm can be very slow and the bottleneck is candidate generation.
 For example, if the transaction DB has 104 frequent 1itemsets, they will generate 107 candidate 2itemsets even after employing the downward closure.
 To compute those with sup more than min sup, the database needs to be scanned at every level. It needs (n +1 ) scans, where n is the length of the longest pattern
Methods to improve Apriori’s efficiency
 Hashbased itemset counting: A kitemset whose corresponding hashing bucket count is below the threshold cannot be frequent
 Transaction reduction: A transaction that does not contain any frequent kitemset is useless in subsequent scans
 Partitioning: An itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB.
 Sampling: mining on a subset of given data, lower support threshold + a method to determine the completeness
 Dynamic itemset counting: add new candidate itemsets only when all of their subsets are estimated to be frequent
Advantages of Apriori
 Advantages
 Uses large itemset property
 Easily parallelized
 Easy to implement
Disadvantages of Apriori
 Assumes the transaction database is memory resident.
 Requires many database scans
Final Remarks
In this technical tutorial of Data Science, we have learned about association rule mining that is implemented with the help of the Apriori algorithm. We have also learned an example of Apriori algorithm and how the Apriori algorithm works. The Apriori algorithm is also useful in Python programming language to perform market basket analysis.