Dimensionality Reduction ๐ -> ๐ฆ
As Naval Ravikant says โEveryone should Learn Computers, so you don't get scared when something breaksโ. The same thing applies to math and machine learning. Learn the mathematical intuition behind it to some extent so you don't get scared working with the algorithms.
๐ฆนโโ๏ธ Deciphering the intuition!
We could approach most natural language problems to decipher the intuition behind developing a specific representation of a method or solution.
The progress in the solution toward understanding is drilling down to the properties or characteristics of datasets that significantly impact the intuition towards finding a better representation method or solution.
โ๐ป Consider text representation in understanding the semantics โ one started with denotation semantics and then moved to distributed semantics.
Above all, applied based on considering the following properties:
๐ Dimensionality
๐ Sparsity
๐ Resolution
Vectors and matrix concepts progressed based on the limitations behind each idea of representation.
๐ Representation of semantics started with denotation โ objectifying the meaning in a simple way (wordnet) โ The characteristic of sparsity comes into play.
๐ with distributional semantics, more dimensions are utilized to preserve the context, and the concept of the โcurse of dimensionalityโ comes into play.
๐ with the resolution, the concept of multimodality and applying different perspectives to really understand the ground truth of the sentence comes into play.
โ๐ป Data compression with little loss of information, better visualization!
๐ธ Two methods:
- Feature Selection (selecting new features from the subset of the old attributes)
- Feature Extraction
PCA comes under the Feature Extraction type.
๐๐ป Benefits:
- Data mining algorithms work better if the dimensionality or number of attributes of data is less (possible indication of noise or outliers)
- Explainability of the model
- Visualization
- Less memory or computing time
๐ฆนโโ๏ธ Curse of higher dimensions:
- Data becomes increasingly sparse
- Impact on classification: There are not enough data objects to be modeled that can reliably assign the class to all the data objects involved
- Effect on clustering: Density and distance between data points become less meaningful.
๐ฆ Principal Component Analysis (Linear dimensionality technique)
๐ง Intuitions:
- Find new attributes(principal components) that are linear combinations of original attributes
- The components are orthogonal to each other.
- Maximizing the amount of variation in data (Mathematical Intuition: Calculating the eigenvectors that correspond to the largest eigenvalues). Hence, maximizing the variance of original data.
We use Singular value decomposition since we drill it down to a unit vector of length since we have the hypothesis of vector projection being similar in direction but different in magnitude.
๐ Algorithm:
๐ฅ Input: Data contains noise, which basically has information too. Noise is the essence of a sample. It has good noise.
๐ Goal: To reduce the dimensionality but retain much of the variance in the data samples. It's a linear mapping of the data to a lower-dimensional space in such a way that the variance of the data in the lower-dimensional space is maximized. The components created in PCA are independent of one another.
๐ Dataset: We are going to use the banking dataset โ https://huggingface.co/datasets/banking77. To encode the dataset, I am going to use the Sentence transformers model โall-MiniLM-L6-v2โ which encodes the data samples into โ384-dimensional vector spaceโ
Whatโs the point behind PCA?
X = data samples above with D = 384 dimensions, To reduce the dimensions to M where M < D, hence we come up with newly transformed data samples preserving the variance, z.
X belongs to Real valued numbers in D dimensions
z belongs to Real valued numbers in M dimensions
PCA: Linear Dimensionality Reduction. It uses linear transformations to encode in the reduced feature space.
- Install sentence transformers: !pip install sentence_transformers
from sentence_transformers import SentenceTransformerimport numpy as npmodel = SentenceTransformer(โall-MiniLM-L6-v2โ)
2) Load the dataset
from datasets import load_datasetdataset = load_dataset("banking77")
3) Take the training examples:
df = dataset['train']
df_pandas = df.to_pandas()
5) Below is the dataframe:
6) Now we come up with the data points we want to reduce the dimension
X = model.encode(df_pandas['text'])
7) checking the dimensions of data points
X.shape
10003 data points with 384 dimensions or variables. The categories are described in this: https://huggingface.co/datasets/banking77#data-fields
8) Since the dimensionality reduction is performed on the data samples, here we separate the target variable which could be used for categorization.
target = df_pandas.iloc[:,1]
target
9) Now we are ready to perform the principal component analysis:
Whatโs behind โ from sklearn.decomposition import PCA and
pca = PCA(n_components=4)
pca_result = pca.fit_transform(X)
๐ let's break it down!
Let's start from step 1
โ๐ป Subtract the mean of each variable
Subtracting the mean of each variable so that the dataset could be centered around the origin. Since the ultimate intuition is to fit the data in a direction of the greatest variance of the data, calculating covariance is crucial step.
X_meaned = X - np.mean(X , axis = 0)
โ๐ป Find the covariance matrix
Its a square matrix giving the covariance between each pair of elements of a given vector.
cov_mat = np.cov(X_meaned , rowvar = False)
โ๐ป Find the Eigen values and Eigen vectors
This is based on the intuition of the orthogonality(mutually perpendicular) of the vectors. Higher the Eigenvalue higher the variance. Hence, the axis on which the higher eigenvalue is observed captures the higher variability on the data points. This is done component by component, each capturing different layers of variation of data points.
eigen_values , eigen_vectors = np.linalg.eigh(cov_mat)
โ๐ป Sort the Eigenvalues in descending order
Since we wanted to capture higher variance amongst the data points, sort the eigenvalues corresponding to the eigenvector.
sorted_index = np.argsort(eigen_values)[::-1]
sorted_eigenvalue = eigen_values[sorted_index]
sorted_eigenvectors = eigen_vectors[:,sorted_index]
โ๐ป Specify the number of components to break down the data points
Since the data samples are large enough to decipher some useful results, I am setting the number of principal components to 50
num_components = 50
The results of sorted_eigenvectors will be principal components that captures the highest variability.
โ๐ป Select the subset of components from the above Eigen value matrix
eigenvector_subset = sorted_eigenvectors[:,0:num_components]
โ๐ป Transforming the data by applying a sequence of transpose
Perform dot product of subset along with the original mean-centered data and applying a final transpose gives the vector in reduced dimensions.
X_reduced = np.dot(eigenvector_subset.transpose() , X_meaned.transpose() ).transpose()
Now X_reduced will have data points with reduced dimensions as specified by the number of components.
X_reduced.shape(10003, 50)
โ๐ป Computing the explained variance of the reduced datapoints dimensions
From the above data, perform the calculation of explained variance.
total_eigenvalues = sum(eigen_values)
variance_exp = [(i/total_eigenvalues) for i in sorted(eigen_values, reverse=True)]
Now lets plot the explained variance in the principal components we have broken down:
# Plot the explained variance
import matplotlib.pyplot as plt
cum_sum_exp = np.cumsum(variance_exp)plt.bar(range(0,len(variable_exp)), variance_exp, alpha=0.5, align='center', label='Individual explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal component index')
plt.legend(loc='best')
plt.tight_layout()
plt.show()
# Plot the cumulative variance
plt.step(range(0,len(cum_sum_exp)), cum_sum_exp, where='mid',label='Cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal component index')
plt.legend(loc='best')
plt.tight_layout()
plt.show()
โ๐ป Now taking the first two components for visualization:
pca_df = pd.DataFrame(columns = ['pca1','pca2'])pca_df['pca1'] = X_reduced[:,0]
pca_df['pca2'] = X_reduced[:,1]
โ๐ป To reconstruct the dataset, let's concatenate with the target variable
pca_df = pd.concat([pca_df , pd.DataFrame(target)] , axis = 1)
โ๐ป Visualize the PCA results:
import plotly.express as pxfig = px.scatter(pca_df, x="pca1", y="pca2", color="label", hover_data=['label'])
fig.show()
The above principal components can be used for clustering or features for classification.
Thatโs it :) Its simple if we break down!