Unlock Offer - Flat 50% Off On All Courses

Hide

Data Science (10 blogs)

Become a Certified Professional  

ALL YOU NEED TO KNOW ABOUT APRIORI ALGORITHM

Last updated on Nov. 15, 2020, 7:09 p.m. 480 Views

SAIRAM UPPUGUNGLA

SAIRAM UPPUGUNGLA |

He is a tech-expert with 7 years of industrial experience in Python, Data Analysis, Big Data, Machine Learning and NLP. He has 360 degrees of expertise in all these subjects. He is known for his practical approach to different real-time industrial problems. He is known for his great interest in helping students reach their true potential and scale greater heights. Believing in Problem-based teaching pedagogies, he left his job in Malaysia as a data engineer and came back to the newly born state to fill the void between students and the industry.

ALL YOU NEED TO KNOW ABOUT APRIORI ALGORITHM

Last updated on Nov. 15, 2020, 7:09 p.m. 480 Views

SAIRAM UPPUGUNGLA


Think of a grocery app like BigBasket. You are planning for grocery essentials and you have listed out a few things that you are planning to buy. But after listing out the things, you end up buying more than you think! This is called impulsive buying and many brands like BigBasket or Grofers use apriori algorithm to leverage this phenomenon.

Apriori algorithm is a concept of Data Mining. Data mining allows users to analyze, classify, and discover correlations among data. The task is related to the crucial process called Association Rule Learning.

What is Associate Rule Learning?

Associate Rule Learning or ARL is a part of data mining anomaly detection. ARL is a procedure to look for the items or events that do not correspond to a familiar pattern. Such kinds of familiar patterns are termed anomalies and transcribe critical and actionable data in different application fields. ARL looks for the association among variables.

What is the Apriori Algorithm?

Apriori algorithm also known as level-wise search is an iterative approach that is attempted on a database record (particularly transactional records or the records including a particular number of fields or items). The apriori is a classic algorithm that follows a bottom-up approach to incrementally contrast complex records. It is the most used algorithm in today’s world of machine learning and artificial intelligence.

Apriori algorithm consists of two primary steps:

  1. Self-join
  2. Pruning

Repeat these two steps k times, where k is the number of items in the last iteration you get frequent items sets containing k items. Another point to remember here is Apriori algorithm follows the sequential approach.

Example – Take an example of app BigBasket. Using this app, one can figure out which items are purchased together and frequently through BigBasket. 

Scientists describe the apriori algorithm from its pseudo-code that is widely available online. The apriori algorithm is used with the collaboration of other algorithms to sort and contrast data effectively to show a better picture of how a complex system reflects patterns and trends.

Another example is – the information gathered from an e-commerce website that a customer who purchases a keyboard is also likely to buy a mouse simultaneously. For this, the following association rule will be made:

Support: The percentage of task-relevant data transactions for which the pattern is true

Support (Keyboard -> Mouse) = 

Confidence: The measure of certainty or trustworthiness associated with each discovered pattern.

Confidence (Keyboard -> Mouse) = 

Thus, the algorithm focuses on the finding of rules that satisfy both the minimum support threshold and a minimum confidence threshold (strong rules).

  • Item: article in the basket
  • Itemset: a group of items purchased together in a single transaction

The working of the Apriori algorithm

  • Find all frequent itemsets:
  • Get frequent items:
      • Items whose occurrence in the database is greater than or equal to the min. support threshold.
    • Get frequent itemsets:
      • Generate candidates from frequent items.
      • Prune the results to find frequent itemsets.
  • Generate strong association rules from frequent itemsets
    • Rules which satisfy the min. support and min. confidence threshold.

 

Flowchart of the above problem statement

 

Key Concepts

  • Frequent itemsets: All the sets which contain the item with the minimum support (denoted as for item set.
  • Apriori Property: Any subset of frequent itemsets must be frequent.
  • Join operation: To find, a set of candidate k-itemsets is generated by joining with itself.

Example of Apriori algorithm

Consider the following database(D) with 4 transactions (T1, T2, T3 and T4).

Let minimum support = 2%

Let minimum confidence = 50%

Now, let’s generate Association rule of the above:

And, to find confidence:

To convert confidence into confidence%, multiply confidence with 100

[For example, confidence%(A -> C) = 0.66 x 100 = 66%]

Now compare confidence% of association rule with a minimum confidence threshold (50%). Since confidence% of both Association rules is greater than minimum confidence threshold, so both Association rules are final rules.

Was the example technical one?

Don’t worry! I would like it if even an undergraduate can understand this without any problem. Let’s make it simple without any use of terminologies or jargons.

Let’s start with a non-simple example,

 

Transaction ID

 

Transaction ID

Items Bought

T1

{Mango, Onion, Nintendo, Key-chain, Eggs, Yo-yo}

T2

{Doll, Onion, Nintendo, Key-chain, Eggs, Yo-yo}

T3

{Mango, Apple, Key-chain, Eggs}

T4

{Mango, Umbrella, Corn, Key-chain, Yo-yo}

T5

{Corn, Onion, Onion, Key-chain, Ice-cream, Eggs}

  

Now, we follow a simple golden rule: we say an item/itemset is frequently bought if it is bought at least 60% of times. So for here it should be bought at least 3 times.

 

For simplicity

M = Mango

O = Onion

And so on……

 

So the table will be

 

Original table:

Transaction ID

Items Bought

T1

{M, O, N, K, E, Y }

T2

{D, O, N, K, E, Y }

T3

{M, A, K, E}

T4

{M, U, C, K, Y }

T5

{C, O, O, K, I, E}

Step 1: Count the number of transactions in which each item occurs, Note ‘O=Onion’ is bought 4 times in total, but it occurs in just 3 transactions.

Item

No of transactions

M

3

O

3

N

2

K

5

E

4

Y

3

D

1

A

1

U

1

C

2

I

1

Step 2: Now remember as said the item is said frequently bought if it is bought at least 3 times. So in this step we remove all the items that are bought less than 3 times from the above table and we are left with

Item

Number of transactions

M

3

O

3

K

5

E

4

Y

3

 

 

These are the single items that are bought frequently. Now let’s say we want to find a pair of items that are bought frequently. We continue from the above table (Table in step 2)

 

Step 3: We start making pairs from the first item, like MO,MK,ME,MY and then we start with the second item like OK,OE,OY. We did not do OM because we already did MO when we were making pairs with M and buying a Mango and Onion together is the same as buying Onion and Mango together. After making all the pairs we get,

Item pairs

MO

MK

 ME

MY

OK

OE

OY

KE

KY

EY

 

 

Step 4: Now we count how many times each pair is bought together. For example M and O is just bought together in {M,O,N,K,E,Y}

While M and K is bought together 3 times in {M,O,N,K,E,Y}, {M,A,K,E} AND {M,U,C, K, Y}

After doing that for all the pairs we get

 

Item Pairs

Number of transactions

MO

1

MK

3

ME

2

MY

2

OK

3

OE

3

OY

2

KE

4

KY

3

EY

2

 

 

Step 5: Golden rule to the rescue. Remove all the item pairs with number of transactions less than three and we are left with

 

Item Pairs

Number of transactions

MK

3

OK

3

OE

3

KE

4

KY

3

 

These are the pairs of items frequently bought together.

Now let’s say we want to find a set of three items that are brought together.

We use the above table (table in step 5) and make a set of 3 items.

 

Step 6: To make the set of three items we need one more rule (it’s termed as self-join),

It simply means, from the Item pairs in the above table, we find two pairs with the same first Alphabet, so we get

·         OK and OE, this gives OKE

·         KE and KY, this gives KEY

 

Then we find how many times O,K,E are bought together in the original table and same for K,E,Y and we get the following table

 

Item Set

Number of transactions

OKE

3

KEY

2

 

While we are on this, suppose you have sets of 3 items say ABC, ABD, ACD, ACE, BCD and you want to generate item sets of 4 items you look for two sets having the same first two alphabets.

·         ABC and ABD -> ABCD

·         ACD and ACE -> ACDE

 

And so on … In general, you have to look for sets having just the last alphabet/item different.

 

Step 7: So we again apply the golden rule, that is, the item set must be bought together at least 3 times which leaves us with just OKE, Since KEY are bought together just two times.

 

Thus the set of three items that are bought together most frequently are O,K,E.

 

Applications of Apriori Algorithm

  • Data analysis
  • Data Classification
  • Cross-marketing
  • Clustering
  • Catalogue design
  • Loss-leader analysis, etc.

Limitations of using Apriori algorithm

  • Apriori algorithm can be very slow and the bottleneck is candidate generation.
  • For example, if the transaction DB has 104 frequent 1-itemsets, they will generate 107 candidate 2-itemsets even after employing the downward closure.
  • To compute those with sup more than min sup, the database needs to be scanned at every level. It needs (n +1 ) scans, where n  is the length of the longest pattern

Methods to improve Apriori’s efficiency

  • Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent
  • Transaction reduction: A transaction that does not contain any frequent k-itemset is useless in subsequent scans
  • Partitioning: An itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB.
  • Sampling: mining on a subset of given data, lower support threshold + a method to determine the completeness
  • Dynamic itemset counting: add new candidate itemsets only when all of their subsets are estimated to be frequent

Advantages of Apriori

  • Advantages
  • Uses large itemset property
  • Easily parallelized
  • Easy to implement

Disadvantages of Apriori

  • Assumes the transaction database is memory resident.
  • Requires many database scans

 

Final Remarks

In this technical tutorial of Data Science, we have learned about association rule mining that is implemented with the help of the Apriori algorithm. We have also learned an example of Apriori algorithm and how the Apriori algorithm works. The Apriori algorithm is also useful in Python programming language to perform market basket analysis.

 

 

Trending Certifications at Codegnan

Python for Data Science Certification

5.0

(45)

MTA Certification - Python for Data Science

555 Learners

200 Hours

Next Batch:
30th Nov | Weekday (Online) | 06:00PM- 08:00PM

Key Skills:
Mathematical Thinking, Critical Thinking

Python Training in Vijayawada

5.0

(89)

Microsoft Certification for Python [Vijayawada]

145 Learners

40 Hours

Next Batch:
30th Nov | Weekday (Online) | 06:00PM- 08:00PM

Key Skills:
Mathematical Thinking, Critical Thinking, Logical Thinking

Machine Learning with Python Certification

5.0

(350)

Codegnan Certification for Machine Learning

431 Learners

50 Hours

Next Batch:
Nov 23rd | Weekday (Online) | 10:00AM - 12:00PM

Key Skills:
Mathematical thinking, Python Programming Skills

Web Development with Python - Django Certification

5.0

(298)

Codegnan Certification for Django

380 Learners

40 Hours

Next Batch:
Dec 1st | Weekday (Online) | 6:00PM - 7:00PM

Key Skills:
Python Programming Skills

Browse Categories

Browse Btn1 Browse Btn1 Browse Btn1 Browse Btn1 Browse Btn1 Browse Btn1 Browse Btn1 Browse Btn1 Browse Btn1 Browse Btn1 Browse Btn1 Browse Btn1 Browse Btn1 Browse Btn1 Browse Btn1 Browse Btn1 Browse Btn1 Browse Btn1 Browse Btn1 Browse Btn1 Browse Btn1 Browse Btn1 Browse Btn1 Browse Btn1 Browse Btn1 Browse Btn1 Browse Btn1 Browse Btn1
Become a Certified Professional