Dimensionality Reduction 🐊 -> 🦎

Sangeetha Venkatesan
6 min readSep 17, 2022

As Naval Ravikant says β€œEveryone should Learn Computers, so you don't get scared when something breaks”. The same thing applies to math and machine learning. Learn the mathematical intuition behind it to some extent so you don't get scared working with the algorithms.

πŸ¦Ήβ€β™‚οΈ Deciphering the intuition!

We could approach most natural language problems to decipher the intuition behind developing a specific representation of a method or solution.

The progress in the solution toward understanding is drilling down to the properties or characteristics of datasets that significantly impact the intuition towards finding a better representation method or solution.

✍🏻 Consider text representation in understanding the semantics β€” one started with denotation semantics and then moved to distributed semantics.

Above all, applied based on considering the following properties:

πŸ™Œ Dimensionality

πŸ™Œ Sparsity

πŸ™Œ Resolution

Vectors and matrix concepts progressed based on the limitations behind each idea of representation.

πŸ‘€ Representation of semantics started with denotation β€” objectifying the meaning in a simple way (wordnet) β€” The characteristic of sparsity comes into play.

πŸ‘€ with distributional semantics, more dimensions are utilized to preserve the context, and the concept of the β€œcurse of dimensionality” comes into play.

πŸ‘€ with the resolution, the concept of multimodality and applying different perspectives to really understand the ground truth of the sentence comes into play.

✍🏻 Data compression with little loss of information, better visualization!

πŸ•Έ Two methods:

  1. Feature Selection (selecting new features from the subset of the old attributes)
  2. Feature Extraction

PCA comes under the Feature Extraction type.

πŸ’ƒπŸ» Benefits:

  1. Data mining algorithms work better if the dimensionality or number of attributes of data is less (possible indication of noise or outliers)
  2. Explainability of the model
  3. Visualization
  4. Less memory or computing time

πŸ¦Ήβ€β™‚οΈ Curse of higher dimensions:

  1. Data becomes increasingly sparse
  2. Impact on classification: There are not enough data objects to be modeled that can reliably assign the class to all the data objects involved
  3. Effect on clustering: Density and distance between data points become less meaningful.

🦊 Principal Component Analysis (Linear dimensionality technique)

🧠 Intuitions:

  1. Find new attributes(principal components) that are linear combinations of original attributes
  2. The components are orthogonal to each other.
  3. Maximizing the amount of variation in data (Mathematical Intuition: Calculating the eigenvectors that correspond to the largest eigenvalues). Hence, maximizing the variance of original data.

We use Singular value decomposition since we drill it down to a unit vector of length since we have the hypothesis of vector projection being similar in direction but different in magnitude.

πŸ‘‘ Algorithm:

πŸ’₯ Input: Data contains noise, which basically has information too. Noise is the essence of a sample. It has good noise.

😈 Goal: To reduce the dimensionality but retain much of the variance in the data samples. It's a linear mapping of the data to a lower-dimensional space in such a way that the variance of the data in the lower-dimensional space is maximized. The components created in PCA are independent of one another.

πŸ‘‘ Dataset: We are going to use the banking dataset β€” https://huggingface.co/datasets/banking77. To encode the dataset, I am going to use the Sentence transformers model β€œall-MiniLM-L6-v2” which encodes the data samples into β€œ384-dimensional vector space”

What’s the point behind PCA?

X = data samples above with D = 384 dimensions, To reduce the dimensions to M where M < D, hence we come up with newly transformed data samples preserving the variance, z.

X belongs to Real valued numbers in D dimensions

z belongs to Real valued numbers in M dimensions

PCA: Linear Dimensionality Reduction. It uses linear transformations to encode in the reduced feature space.

  1. Install sentence transformers: !pip install sentence_transformers
from sentence_transformers import SentenceTransformerimport numpy as npmodel = SentenceTransformer(β€˜all-MiniLM-L6-v2’)

2) Load the dataset

from datasets import load_datasetdataset = load_dataset("banking77")

3) Take the training examples:

df = dataset['train'] 
df_pandas = df.to_pandas()

5) Below is the dataframe:

6) Now we come up with the data points we want to reduce the dimension

X = model.encode(df_pandas['text'])

7) checking the dimensions of data points

X.shape

10003 data points with 384 dimensions or variables. The categories are described in this: https://huggingface.co/datasets/banking77#data-fields

8) Since the dimensionality reduction is performed on the data samples, here we separate the target variable which could be used for categorization.

target = df_pandas.iloc[:,1] 
target

9) Now we are ready to perform the principal component analysis:

What’s behind β€” from sklearn.decomposition import PCA and

pca = PCA(n_components=4)

pca_result = pca.fit_transform(X)

πŸ™Œ let's break it down!

Let's start from step 1

✍🏻 Subtract the mean of each variable

Subtracting the mean of each variable so that the dataset could be centered around the origin. Since the ultimate intuition is to fit the data in a direction of the greatest variance of the data, calculating covariance is crucial step.

X_meaned = X - np.mean(X , axis = 0)

✍🏻 Find the covariance matrix

Its a square matrix giving the covariance between each pair of elements of a given vector.

cov_mat = np.cov(X_meaned , rowvar = False)

✍🏻 Find the Eigen values and Eigen vectors

This is based on the intuition of the orthogonality(mutually perpendicular) of the vectors. Higher the Eigenvalue higher the variance. Hence, the axis on which the higher eigenvalue is observed captures the higher variability on the data points. This is done component by component, each capturing different layers of variation of data points.

eigen_values , eigen_vectors = np.linalg.eigh(cov_mat)

✍🏻 Sort the Eigenvalues in descending order

Since we wanted to capture higher variance amongst the data points, sort the eigenvalues corresponding to the eigenvector.

sorted_index = np.argsort(eigen_values)[::-1]
sorted_eigenvalue = eigen_values[sorted_index]
sorted_eigenvectors = eigen_vectors[:,sorted_index]

✍🏻 Specify the number of components to break down the data points

Since the data samples are large enough to decipher some useful results, I am setting the number of principal components to 50

num_components = 50

The results of sorted_eigenvectors will be principal components that captures the highest variability.

✍🏻 Select the subset of components from the above Eigen value matrix

eigenvector_subset = sorted_eigenvectors[:,0:num_components]

✍🏻 Transforming the data by applying a sequence of transpose

Perform dot product of subset along with the original mean-centered data and applying a final transpose gives the vector in reduced dimensions.

X_reduced = np.dot(eigenvector_subset.transpose() , X_meaned.transpose() ).transpose()

Now X_reduced will have data points with reduced dimensions as specified by the number of components.

X_reduced.shape(10003, 50)

✍🏻 Computing the explained variance of the reduced datapoints dimensions

From the above data, perform the calculation of explained variance.

total_eigenvalues = sum(eigen_values)
variance_exp = [(i/total_eigenvalues) for i in sorted(eigen_values, reverse=True)]

Now lets plot the explained variance in the principal components we have broken down:

# Plot the explained variance
import matplotlib.pyplot as plt
cum_sum_exp = np.cumsum(variance_exp)
plt.bar(range(0,len(variable_exp)), variance_exp, alpha=0.5, align='center', label='Individual explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal component index')
plt.legend(loc='best')
plt.tight_layout()
plt.show()
# Plot the cumulative variance
plt.step(range(0,len(cum_sum_exp)), cum_sum_exp, where='mid',label='Cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal component index')
plt.legend(loc='best')
plt.tight_layout()
plt.show()

✍🏻 Now taking the first two components for visualization:

pca_df = pd.DataFrame(columns = ['pca1','pca2'])pca_df['pca1'] = X_reduced[:,0]
pca_df['pca2'] = X_reduced[:,1]

✍🏻 To reconstruct the dataset, let's concatenate with the target variable

pca_df = pd.concat([pca_df , pd.DataFrame(target)] , axis = 1)

✍🏻 Visualize the PCA results:

import plotly.express as pxfig = px.scatter(pca_df, x="pca1", y="pca2", color="label", hover_data=['label'])
fig.show()

The above principal components can be used for clustering or features for classification.

That’s it :) Its simple if we break down!

--

--

Sangeetha Venkatesan

NLP Engineer, Language actors talking the stage with large language models!