Decision Tree -decide with Intuition 🚀
Entropy, information gain etc — what it means for the dataset? 😵💫
We will take a dataset and analyze the interpretability of the decision behind the classification of a particular target. If you are new to Decision Trees, I have attached the link to my previous work — https://medium.com/@sangeetha-venkatesan/classify-the-main-alligator-behind-the-ml-world-19a968961511
Understanding the data 🚀
Let's take another problem, Classifying the quality of milk 🥛 based on a few features, and each classification algorithm drills down to specific strategies that help decide the factors of splitting a tree. The data involves nominal targets and categorical features and continuous features.
Below are the features of the dataset, and the target being “Grade”. I have started from the tabular dataset and then will work on the Language dataset by applying ML algorithms.
pH — Nominal Attribute — 3 to 9.5 max
Temperature — Nominal Attribute — 34'C to 90'C max
Taste, Order, Fat, Turbidity — Ordinal Attribute — with 0 representing “Bad” and 1 representing “Good”
Colour: Nominal Attribute — 240 to 255 max
Grade: Categorical — Multiclass attribute
Let's construct the tree 🌲 Decision Tree is a good predictive and interoperability modeling algorithm. Let us see how the features contribute to picking the class. When the data point goes to the leaf node, then its class is determined.
Decision Tree Algorithm: 🚀
- For each feature or attribute of the dataset, the decision tree algorithm forms the node. We start with the root node; we have two nodes — The decision node and the leaf node.
- Drill down from the decision node based on the splitting condition. The process continues until we reached the leaf node. The leaf node holds the final value of the prediction.
It's a non-parametric approach, hence based on the features and the threshold value of features, we are finding the decision boundary that splits the records of pure classes.
Below is the dataset, Milk Grade is determined based on the initial feature attributes.
There are families of Decision Tree Algorithms, ID3, C4.5,C5.0, CART. For eg: the difference between ID3 and C4.5 is based on splitInfo compute included that penalizes higher granular leaf nodes that increase the complexity of the model.
Goal: To build a tree from the data, which questions to ask, when to ask, which questions lead to a class categorization etc.
CART Algorithm: Classification and Regression Trees — We start from the root node taking the entire dataset, then come up with a condition that splits the training data records. This process is repeated until the splits have the purest distribution of class labels.
We will implement both measures of impurity — Gini index and entropy — Gini gain and Information Gain, which says how much to quantify the uncertainty at a node, and how much it reduces the uncertainty.
😵💫 🍃 — Node class (Decision Node, Leaf Node)
class Node:
def__init__(self,feature_index=None,
threshold=None,left=None,right=None,info_gain=None,value=None):
#for decision node 😵💫
self.feature_index = feature_index
self.threshold = threshold
self.left = left
self.right = right
self.information_gain = information_gain
#for leaf node 🍃
self.value = value
Here, for decision nodes, for further splitting, we need the feature index of that split of data points, and thresholds of feature values through which we iterate to come up with “what good question we can ask” to categorize. The decision node could have a left and right sub-tree. Each decision node — the deciding factor is “Information Gain”
Leaf nodes — those would be the values of the class label — which are high, medium, and low in this dataset.
Now let's initialize the Tree and come up with the methods needed for building the tree. Building the decision tree is the recursive process, the recursion should have a stopping criterion. Here, that is crucial for not having the model overfit the training data.
#Tree class
class DecisionTreeClassifier():
def __init__(self,min_samples_split=2, max_depth=2):
#initialize the root of the tree
self.root = None
#stopping criterion
self.min_samples_split = min_samples_split
self.max_depth = max_depth
Here, the stopping criterion is based on the minimum number of samples in each split, so the nodes don't have singular class labels (0,0,1) with unique feature values branched out. This is not a good generalized representation of the tree. Based on the tree depth, is proportional to the interpretability of the tree.
✍🏻 Now split the dataset into Train and Test before passing the Train data records to build the tree.
from sklearn.model_selection import train_test_split
#X -> feature values Y -> Target classX = dataset.iloc[:,:-1].values
Y = dataset.iloc[:,-1].values.reshape(-1,1)
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.2,random_state=41)
🦹♂️ Now lets fit the decision tree for this training data (X_train)
classifier = DecisionTreeClassifier(min_samples_split=3,max_depth=3)
classifier.fit(X_train,Y_train)
We have initialized the DecisionTreeClassifier with a minimum sample split of 3 and the maximum depth of the tree to be constructed is 3. This is more of a sort of hyperparameter that we can tune to build a decision tree with a good classification F1-score.
What are the underlying steps behind the fit method?
def fit(self,X,Y):
dataset = np.concatenate((X,Y),axis=1)
self.root = self.build_tree(dataset)
return self
The root node will start the entire dataset of X_train.
How are we going to build a tree from this root node?
The build is a recursive process based on the purity of the split and based on till it reaches the leaf node.
Having known the stopping criterion, we will see now how to get the best split of the dataset based on samples and features. The best split method returns a dict that has the computed Information Gain for the split of the dataset. If the information Gain is > 0, then we are good to split the tree into left subtree and right subtree, each involves the similar procedure of building the tree with the splits calculated based on “Information Gain”.
When we exhaust the depth of the decision tree, that's when we end up in a leaf node with the corresponding value as the target class label that we are looking to predict.
Now onto the crucial part, how to get the best split of the dataset. 💥🧠
Let's consider the initial dataset has the entire training dataset, X_train.
The best split is based on the information gain. Information gain is updated based on looping through the feature values. In order to come up with the best split, we have to come up with the unique values of features, based on possible thresholds, the dataset is splitted.
💥Computing the information gain — “Gini Index” and “Entropy” of the nodes.
Node level measure could be “Gini Index” and “Entropy”, and the split level measure is “Information Gain”
We can toggle the node impurity measure “Gini” and “entropy”, it could be also a good place for ensembling based on different versions of the model by choosing these measure values.
Before finding the information gain of the split, let's start with finding the gini_index of the node.
def gini_index(self,y):
class_labels = np.unique(y)
gini = 0
for cls in class_labels:
p_cls = len(y[y==cls]) / len(y)
gini += p_cls**2
return 1 - gini
if mode == “gini” gives the information gain from the Gini split for the Gini Index impurity measure
Gini split is the weighted sum of the gini index of the individual nodes of child, from this, information Gain is calculated.
Let's now look at the “Entropy” impurity measure,
Lets now see the information gain based on the Entropy measure,
As we trace back, once these measures are calculated, tree is built based on the conditions of feature values.
def calculate_leaf_value(self,Y):
Y = list(Y)
return max(Y,key=Y.count)
If information gain is 0, then we have reached a leaf node on which there is no need of a decision, we can find the count distribution of class labels.
After fitting the data onto the decision tree, now lets print the tree,
classifier.print_tree()
Printing the tree based on Information Gain on decision node. The leaf nodes represent the labels.
Now,lets see the prediction part,
Y_pred = classifier.predict(X_test)
from sklearn.metrics import accuracy_score
accuracy_score(Y_test,Y_pred)
Based on the tree we have constructed from the training data, the class label is determined traversing the built tree.
Lets traverse one record to find the label for this record:
array([[ 4.7, 38. , 1. , 0. , 1. , 0. , 255. ],
[ 8.6, 55. , 0. , 1. , 1. , 1. , 255. ]])
x0 = 4.7 (pH)
x1 = 38.0 (Temperarture)
x2 = 1 (Taste)
x3 = 0 (Odor)
x4 = 1 (Fat)
x5 = 0 (Turbidity)
x6 = 255 (color)
Same applies to second record, this is the traversal of tree for the unseen records.
To all the unseen records (X_test), this represents the accuracy of the baseline model, note that, we did not perform cross validation to apply pruning techniques to make the tree robust, which will be covered in cross validation.
0.8726415094339622
To visualize in neat way, lets use some package and fit the data in sklearn Decision Tree classifier.
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# Fit the classifier with default hyper-parameters
clf_gini = DecisionTreeClassifier(criterion='gini', max_depth=2, random_state=0)# fit the model
clf_gini.fit(X_train, Y_train)import matplotlib.pyplot as plt # data visualization
plt.figure(figsize=(12,8))from sklearn import treetree.plot_tree(clf_gini.fit(X_train, Y_train))
This is how the trees are constructed based on the feature value thresholds and node impurity measure.
To make the best out of decision tree classifier, we have to cross validate and perform pruning techniques, which will be covered as I work on cross validation.