All you need to know about Apriori Algorithm

Think of a grocery app like BigBasket. You are planning for grocery essentials and you have listed out a few things that you are planning to buy. But after listing out the things, you end up buying more than you think! This is called impulsive buying and many brands like BigBasket or Grofers use apriori algorithm to leverage this phenomenon.

Apriori algorithm is a concept of Data Mining. Data mining allows users to analyze, classify, and discover correlations among data. The task is related to the crucial process called Association Rule Learning.

What is Associate Rule Learning?

Associate Rule Learning or ARL is a part of data mining anomaly detection. ARL is a procedure to look for the items or events that do not correspond to a familiar pattern. Such kinds of familiar patterns are termed anomalies and transcribe critical and actionable data in different application fields. ARL looks for the association among variables.

What is the Apriori Algorithm?

Apriori algorithm also known as level-wise search is an iterative approach that is attempted on a database record (particularly transactional records or the records including a particular number of fields or items). The apriori is a classic algorithm that follows a bottom-up approach to incrementally contrast complex records. It is the most used algorithm in today’s world of machine learning and artificial intelligence.

Apriori algorithm consists of two primary steps:

  1. Self-join
  2. Pruning

Repeat these two steps k times, where k is the number of items in the last iteration you get frequent items sets containing k items. Another point to remember here is Apriori algorithm follows the sequential approach.

Example – Take an example of app BigBasket. Using this app, one can figure out which items are purchased together and frequently through BigBasket. 

Scientists describe apriori algorithm from its pseudo code that is widely available online. The apriori algorithm is used with the collaboration of other algorithms to sort and contrast data effectively to show a better picture of how a complex system reflects patterns and trends.

Another example is – the information gathered from an e-commerce website that a customer who purchases a keyboard is also likely to buy a mouse simultaneously. For this, the following association rule will be made:

Support: The percentage of task-relevant data transactions for which the pattern is true

Support (Keyboard -> Mouse) = 

Confidence: The measure of certainty or trustworthiness associated with each discovered pattern.

Confidence (Keyboard -> Mouse) = 

Thus, the algorithm focuses on the finding of rules that satisfy both minimum support threshold and a minimum confidence threshold (strong rules).

  • Item: article in the basket
  • Itemset: a group of items purchased together in a single transaction

The working of Apriori algorithm

  • Find all frequent itemsets:
  • Get frequent items:
      • Items whose occurrence in the database is greater than or equal to the min.support threshold.
    • Get frequent itemsets:
      • Generate candidates from frequent items.
      • Prune the results to find the frequent itemsets.
  • Generate strong association rules from frequent itemsets
    • Rules which satisfy the min.support and min.confidence threshold.

 

Flowchart of the above problem statement


 

Key Concepts

  • Frequent item sets: All the sets which contain the item with the minimum support (denoted as for item set.
  • Apriori Property: Any subset of frequent item sets must be frequent.
  • Join operation: To find, a set of candidate k-itemsets is generated by joining with itself.

Example of Apriori algorithm

Consider the following database(D) with 4 transactions (T1,T2,T3 and T4).

Let minimum support = 2%

Let minimum confidence = 50%


Now, let’s generate Association rule of the above:


And, to find confidence:


To convert confidence into confidence%, multiply confidence with 100

[For example, confidence%(A -> C) = 0.66 x 100 = 66%]

Now compare confidence% of association rule with a minimum confidence threshold (50%). Since confidence% of both Association rules is greater than minimum confidence threshold, so both Association rules are final rules.

Was the example technical one?

Don’t worry! I would like it if even an undergraduate can understand this without any problem. Let’s make it simple without any use of terminologies or jargons.

Let’s start with a non-simple example,

 

Transaction ID

Items Bought

T1

{Mango, Onion, Nintendo, Key-chain, Eggs, Yo-yo}

T2

{Doll, Onion, Nintendo, Key-chain, Eggs, Yo-yo}

T3

{Mango, Apple, Key-chain, Eggs}

T4

{Mango, Umbrella, Corn, Key-chain, Yo-yo}

T5

{Corn, Onion, Onion, Key-chain, Ice-cream, Eggs}

  

Now, we follow a simple golden rule: we say an item/itemset is frequently bought if it is bought at least 60% of times. So for here it should be bought at least 3 times.

 

For simplicity

M = Mango

O = Onion

And so on……

 

So the table will be

 

Original table:

Transaction ID

Items Bought

T1

{M, O, N, K, E, Y }

T2

{D, O, N, K, E, Y }

T3

{M, A, K, E}

T4

{M, U, C, K, Y }

T5

{C, O, O, K, I, E}

 

 

Step 1: Count the number of transactions in which each item occurs, Note ‘O=Onion’ is bought 4 times in total, but it occurs in just 3 transactions.

 

Item

No of transactions

M

3

O

3

N

2

K

5

E

4

Y

3

D

1

A

1

U

1

C

2

I

1

 

 

Step 2: Now remember as said the item is said frequently bought if it is bought at least 3 times. So in this step we remove all the items that are bought less than 3 times from the above table and we are left with

 

Item

Number of transactions

M

3

O

3

K

5

E

4

Y

3

 

 

These are the single items that are bought frequently. Now let’s say we want to find a pair of items that are bought frequently. We continue from the above table (Table in step 2)

 

Step 3: We start making pairs from the first item, like MO,MK,ME,MY and then we start with the second item like OK,OE,OY. We did not do OM because we already did MO when we were making pairs with M and buying a Mango and Onion together is the same as buying Onion and Mango together. After making all the pairs we get,

 

Item pairs

MO

MK

 ME

MY

OK

OE

OY

KE

KY

EY

 

 

Step 4: Now we count how many times each pair is bought together. For example M and O is just bought together in {M,O,N,K,E,Y}

While M and K is bought together 3 times in {M,O,N,K,E,Y}, {M,A,K,E} AND {M,U,C, K, Y}

After doing that for all the pairs we get

 

Item Pairs

Number of transactions

MO

1

MK

3

ME

2

MY

2

OK

3

OE

3

OY

2

KE

4

KY

3

EY

2

 

 

Step 5: Golden rule to the rescue. Remove all the item pairs with number of transactions less than three and we are left with

 

Item Pairs

Number of transactions

MK

3

OK

3

OE

3

KE

4

KY

3

 

These are the pairs of items frequently bought together.

Now let’s say we want to find a set of three items that are brought together.

We use the above table (table in step 5) and make a set of 3 items.

 

Step 6: To make the set of three items we need one more rule (it’s termed as self-join),

It simply means, from the Item pairs in the above table, we find two pairs with the same first Alphabet, so we get

·         OK and OE, this gives OKE

·         KE and KY, this gives KEY

 

Then we find how many times O,K,E are bought together in the original table and same for K,E,Y and we get the following table

 

Item Set

Number of transactions

OKE

3

KEY

2

 

While we are on this, suppose you have sets of 3 items say ABC, ABD, ACD, ACE, BCD and you want to generate item sets of 4 items you look for two sets having the same first two alphabets.

·         ABC and ABD -> ABCD

·         ACD and ACE -> ACDE

 

And so on … In general, you have to look for sets having just the last alphabet/item different.

 

Step 7: So we again apply the golden rule, that is, the item set must be bought together at least 3 times which leaves us with just OKE, Since KEY are bought together just two times.

 

Thus the set of three items that are bought together most frequently are O,K,E.

 

Applications of Apriori Algorithm

  • Data analysis
  • Data Classification
  • Cross-marketing
  • Clustering
  • Catalog design
  • Loss-leader analysis, etc.

Limitations of using Apriori algorithm

  • Apriori algorithm can be very slow and the bottleneck is candidate generation.
  • For example, if the transaction DB has 104 frequent 1-itemsets, they will generate 107 candidate 2-itemsets even after employing the downward closure.
  • To compute those with sup more than min sup, the database needs to be scanned at every level. It needs (n +1 ) scans, where n  is the length of the longest pattern

Methods to improve Apriori’s efficiency

  • Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent
  • Transaction reduction: A transaction that does not contain any frequent k-itemset is useless in subsequent scans
  • Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB.
  • Sampling: mining on a subset of given data, lower support threshold + a method to determine the completeness
  • Dynamic itemset counting: add new candidate itemsets only when all of their subsets are estimated to be frequent

Advantages of Apriori

  • Advantages
  • Uses large itemset property
  • Easily parallelized
  • Easy to implement

Disadvantages of Apriori

  • Assumes the transaction database is memory resident.
  • Requires many database scans

 

Final Remarks

In this technical tutorial of Data Science, we have learned about association rule mining that is implemented with the help of Apriori algorithm. We have also learned an example of Apriori algorithm and how Apriori algorithm works. The Apriori algorithm is also useful in Python programming language to perform market basket analysis.

 

Tags:

Comments [0]

Join Codegnan to Learn Popular Technologies

Get Certified From

Request for more information
By Providing your contact details, you agree to our Privacy Policy

Want to Become Data Scientist in 4 months?

Codegnan helps you to become one, Enroll now and start One on One Live sessions.

Get Microsoft Technology Associate and HPE Certificate after successful completion of Exams

Free Webinars

Webinars are always part of our community. Attend regular webinars and gain knowledege from us

Learn more

Trending Blogs

Our Student Reviews

639+ Positive Reviews on Google
500+ Positive Reviews on Facebook
350+ Positive Reviews on Justdial