Classify — The main alligator behind the ML world 🐉🐲

Sangeetha Venkatesan
7 min readSep 22, 2022

CLASSIFY after you get CLARIFIED about the data 🚀

The major chunk of problem-solving with machine learning involves understanding how machine learning decisions are made when classifying. (Justifying the predictions that are made 🧠)

Classification has diverse applications. One of the most predominant ones in the conversational AI landscape is text classification or intent classification.


We have a feature set(attributes for the model) + Target (Outcome) — a kind of supervised machine learning task.

(X) -> classification model -> Output class label (y) — given ‘y’ is a discrete attribute.

When solving a text classification problem, one of the essential points to consider is how well we build an explanatory tool that distinguishes between the objects of different classes. That represents descriptive modeling — what features represent an Intent or a category? What features of the training data for “Hours of Operation” Intent explain the intent?

Predictive modeling: Consider a FAQ chatbot, where I have to predict the intent of the user. Here, we predict the class labels.

Alert: The type of categorization is very important. The categories we choose should have good distinction.

Class confusion — one of the crucial problems to solve with closed categories where there is some kind of relationship (which could be parent class, subclass category)

👑 Learning algorithm — we can come up with different models, understanding the learning algorithm behind each model to kind of decipher which model best fits the relationship between text and label is very important and also generalizes on the unseen data.

What kind of learning algorithm does the decision Tree have? It's a linear model. Given the space of all decision trees on the input variables — This is referred to as hypothesis class. When restricting the linear model with the specific hypothesis class — we are inducing an “inductive bias” on it. They are less effective in representation when compared to the “universal approximators” hypothesis class that feed-forward neural networks have.

But stick to the advantages it does provide — Easy and efficient to train, convex optimization objectives, interpretable.

Let us take the decision of learning about the decision tree classification algorithm 💥

Decision Trees Classifiers: 🌴🌲🎄

It all starts with building on what follow-up questions or rules you can impose. But the idea is how well we can craft the question such that leads to a conclusion about the class labels. The features for each class label account for to form of a hierarchical structure comprising nodes and directed edges.

Tree — what it can have — Root node, Internal nodes, Leaf or terminal nodes (target or class label)

Root + Internal nodes — The conditions on which the records had different characteristics.

Two possibilities: 🌴🍃🌿

  1. It could be a series of test conditions that follow the appropriate branch based on the outcome of a test and it leads to an internal node and the application of another test condition starts
  2. It hits the leaf node.

How many decision trees are possible to construct? If there are 3 features in a dataset, then there are exponentially many decision trees(8–2 power 3)

Given we have exponential search space, finding an optimal tree could be computationally expensive in a case where a big paragraph data has around 700 feature matrix space. Let's try to find a suboptimal tree.

Can we apply the greedy approach to see which attribute could be chosen to perform optimum decisions at each level?

Basic Idea: Categorizing the samples in a tree structure having each internal node comes up with a test condition. We do recursively this until each node is pure (all data points have the same label)

What makes this an ML model is that there are exponential splitting conditions we have to come up with on the dataset.

What's the exhausting part here? Our model should learn which features to take and come up with split conditions sub-optimal.

Conceptual intuition: To come up with splits such that we end up splitting the datasets as pure subsets.

Mathematical intuition: What makes it pure? How can we quantify it?

Theory of Information Gain — what split helps gain most information about features of the data.

Decision Tree

Let’s go back to the goal — to ultimately classify the target labels based on the available features. When we build the tree and evolve it into an optimal tree, what are the branches we want to consume? The more info the split condition or the feature provides, the probability of choosing that branch is high!


It's a measure of information contained in a state. If the entropy is high, there is no surprise for us, because the distributions are equally partitioned,, leading to no proper classification criteria.

If I say there is 50% distribution of classes in the root node, then the entropy is at the highest value of 1.

  1. For each split that we make, calculate the entropy measure of the state (state involves how data points are classified based on the target variable) if there are 3 target variables, we will get a three sets of data points belonging to each category.

Consider the target variable (love, hate)



We have a root node now — What’s the entropy for this one?

Remember — Entropy is the summation of the product of Probability of the state we wanted to calculate to the inverse probability of the state.

Entropy of Root Node:



Total Samples = 20

Highest possible value of entropy = 1

Left of RootNode: (L1)



Total Samples = 14

Entropy of left of Root node (L1):

Right of RootNode (R1):



Total Samples = 6

Entropy of Right of RootNode (R1):

Left of RootNode in Second split which is (L2):


Total Samples = 4

Entropy of L2:

Right of RootNode in Second split (R2):



Total Samples = 16

Entropy of R2:

Amongst the above nodes, which has minimum Entropy? You found it :)

Its 🦹‍♂️🦹‍♂️🦹‍♂️🦹‍♂ state — We have a pure node now — Only Hate class. 💥

Now what next? Till now we have calculated at node level, now we have to move to split level. We have to find the Information Gain — How much useful information a state has to go further down that line.

Information Gain corresponding to a split = Entropy of the parent Node — Combined Entropy of the child nodes

Split 1: Root node — Entropy of Root node = 1

L1: Weights of L1 = 14/20 — Entropy of L1 = 0.99 (Weights with respect to parent)

R1: Weights of R1 = 6/20 — Entropy of R1 = 0.91

Split 2: Root node — Entropy of Root node = 1

L2: Weights of L2 = 4/20 — Entropy of L2 = 0

R2: Weights of R2 = 16/20 — Entropy of R2 = 0.95

In this, the second split gives the Greater Information Gain. What the decision tree model comes up with is possible splits and choose the one that maximizes the information gain. As you can see, it's based on the greedy algorithm, looks out to find local optima (which is the split that maximizes the information gain). So, according to this approach, we learn to go forward. There is no backtracking in the tree.

Every decision is based on the current split. So keeping in point that this may not give the most optimal set of overall splits.

After we chose Split2 (High Information Gain) in the splits, we again go forward with another split. At each split level, the level of impurity (mixed samples) decreases.

So ultimately, the tree has leaf nodes that are pure:




In the next blog, we can apply these concepts to a dataset :)



Sangeetha Venkatesan

NLP Engineer, Language actors talking the stage with large language models!