Deep Learning
DEEP LEARNING
Deep Learning is a subset of machine learning where artificial neural networks, algorithms inspired by the human brain, learn from large amounts of data. It leverages multiple layers (hence "deep") of interconnected neurons to model complex patterns in data. This method is particularly effective for tasks involving unstructured data such as images, audio, and text.
Importance of Deep Learning:
- Automation: Reduces the need for manual feature extraction by automatically learning representations from raw data.
- Scalability: Can handle vast amounts of data and leverage powerful computational resources to improve performance.
- Innovation: Drives advancements in numerous fields, from healthcare and autonomous driving to entertainment and finance.
Differences Between AI, ML & DL:
To understand deep learning, it's essential to distinguish it from broader concepts like artificial intelligence (AI) and machine learning (ML).Example: A chess-playing program using handcrafted rules and strategies.
- Machine Learning (ML): A subset of AI that focuses on the development of algorithms that allow computers to learn from and make predictions based on data. It emphasizes learning patterns and making decisions with minimal human intervention.
Example: A spam email filter that improves its accuracy as it processes more emails.
Example: A convolutional neural network (CNN) that identifies objects in images with high accuracy.
DS(Data Science):-
Data science is the study of data. The role of data scientist involves developing the methods of recording, storing, and analyzing data to effectively extract useful information. The Final goal of data science is to gain insights and knowledge from any type of data.
Why Deep Learning is becoming more popular?
Deep Learning is gaining much popularity due to it's supremacy in terms of accuracy when trained with huge amount of data.
Fundamentals of Neural Networks
Neurons
Forward Propagation is nothing but the input data is fed in to the neural network. i.e. in simple words the input data is fed in to the different hidden layers layers & by applying the suitable activation function, we will get the output results.
-->In forward propagation we take the inputs & fed this inputs into hidden layers and finally we get the output.-->After getting the result we calculate Loss function, if the result of Loss function is near to 1. When we can conclude that this is not the result we expected.
-->We need to update the weights to reduce loss function value. In order to update the weights we perform Backward Propagation.
Backward Propagation: -
Backward Propagation is nothing but the process of moving from the right to left i.e. backward from the Output to the Input layer is called the Backward Propagation.
-->Backward Propagation is the preferable method of adjusting or correcting the weights to reach the minimized loss function.
Vanishing Gradient Problem: -
-->The activation function we have are
1. Sigmoid function
The function formula and chart are as follows
Advantages of Sigmoid Function : -
- Sigmoid function is Smooth gradient, preventing “jumps” in output values.
- Output values ranges between 0 and 1.
- Sigmoid activation function results in Clear predictions, i.e. the output very close to either 1 or 0.
- Sigmoid activation function is Prone to gradient vanishing
- The result of the sigmoid activation function output is not zero-centered. Zero centered means the curve passes through origin i.e. (0,0) of the data.
- It is a time consuming function.
2. Tanh function
The tanh function formula and curve are as follows
3. ReLU function
ReLU function formula and curve are as follows
--> ReLU =max(0,z)
-->When ever x value is -ve, ReLU=max(0,-ve)=0 i.e whenever x is -ve the result of ReLU is 0.
4. Leaky ReLU function:-
5. ELU (Exponential Linear Units) function- Their is no problem of Dead ReLU issues in ELU.
- The output mean is close to 0 i.e. it is zero centered.
- The main problem in ELU is ELU takes more computational time.
Technique to find out which activation should we use:-
- Their is no problem of Dead ReLU issues in ELU.
- The output mean is close to 0 i.e. it is zero centered.
- The main problem in ELU is ELU takes more computational time.
--> We generally avoid using Sigmoid and Tanh in hidden layers because they can cause the vanishing gradient problem, especially in deep neural networks.
--> For Binary Classification problem, in the hidden layers always prefer to use ReLU activation function and in the output layer use Sigmoid activation function.
Binary Cross Entropy is the loss function applied here.
--> In case convergence is not happening with ReLU, use Leaky ReLU / PReLU / ELU in the hidden layers. But for the output layer in Binary Classification, always use Sigmoid activation function only.
Binary Cross Entropy is the loss function applied here.
--> For Multiclass Classification problem, use ReLU or Leaky ReLU or PReLU or ELU in the hidden layers, but in the output layer always use Softmax activation function.
Categorical Cross Entropy is the loss function applied here.
--> In Linear Regression problem, in the hidden layers use ReLU or its variants, but in the output layer use Linear activation function.
The loss functions applied here are MSE, MAE, or Huber Loss.
-->For non-linear regression, we use non-linear activations like ReLU in hidden layers, but still use linear activation in the output layer with MSE/MAE/Huber loss.
Loss Function and cost function: -
--> In loss function, the error is calculated for a single data record (one sample) during forward propagation.
--> In cost function, the error is calculated over a batch of records or the entire dataset by taking the average loss.
Regression loss functions are:-
-
MSE (Mean Squared Error)
-
MAE (Mean Absolute Error)
-
Huber Loss
-
Pseudo Huber Loss
Advantages:-
1)Mean square error is differentiable
2)It as only one Local or Global Minima.
3)It convergence faster
Disadvantages:-
1)It is not robust to outliers.
2)MAE(Mean Absolute error): -
-->Mean absolute error is robust to outliers.
-->Mean Absolute Error is nothing but the amount of error that occur in measurements. It is the difference between the measured value and actual value.
| = | mean absolute error | |
| = | prediction | |
| = | true value | |
| = | total number of data points |
3)Huber Loss: -
Classification loss functions are:-
--> In classification, we generally use Cross Entropy–based loss functions
--> Cross Entropy is divided into:
-
Binary Cross Entropy – for binary classification
-
Categorical Cross Entropy – for one-hot encoded multiclass labels
-
Sparse Categorical Cross Entropy – for integer-labeled multiclass data
-
Hinge Loss – mainly used in SVM-based classifiers
Optimizers: -
Instead of processing the entire dataset at once (as in Batch Gradient Descent) or updating weights after every single data point (as in Stochastic Gradient Descent), Mini-Batch Gradient Descent takes a middle approach:
-
Divide the dataset into smaller batches.
- Example: If you have 1,000 records and choose a batch size of 100, you will have 10 batches in total.
-
Iteratively update the weights for each batch:
- Take the first batch (100 records), compute the loss, and update the weights.
- Move to the next batch, using the updated weights from the previous step, compute the loss, and update the weights again.
- Repeat this process for all batches.
-
Multiple weight updates per epoch:
- If the entire dataset were processed at once, the weights would be updated only once per epoch.
- With mini-batches, weights get updated multiple times per epoch (in this case, 10 times for 10 batches), leading to more frequent learning and better optimization.
Why Use Mini-Batch Gradient Descent?
✅ More stable updates compared to Stochastic Gradient Descent (SGD).
✅ Faster and more memory-efficient than Batch Gradient Descent.
✅ Helps generalization by introducing some randomness in learning.
-->In this we reducing the noise
-->It have a quicker conversation.
5)AdaGrad: -
-->Adagrad is nothing but the Adaptive Gradient Descent. In gradient descent the learning rate if fixed in Adagard we are making changes in Adagrad such the initially the learning rate will be high but decreases as we move towards Global Minimum point.
-->In this we are bringing the adaptiveness in the learning, in AdaGrad the learning rate wont be fixed, the learning rate will be decreasing as we reach to the Global Minimum point.
6)RMSPROP: -
-->RMSPROP is nothing but the Root Mean Squared Propagation, it is an extension of gradient descent and the AdaGrad version of gradient descent.
7)Adam Optimizer: -
-->In Adam Optimizer we combine momentum along with RMSPROP.
-->It is the best optimizer among all optimizers.
-->It is solving the problem of smoothening
-->The Learning Rate becomes adaptive.
Batch Normalization:
Batch Normalization is a technique in deep learning that helps speed up training and improve the stability of neural networks. Here’s a simple way to understand it:
Why Do We Need Batch Normalization?
Benefits of Batch Normalization
✅ Faster training – allows higher learning rates.
✅ More stable training – prevents drastic changes in activations.
✅ Reduces overfitting – acts like a regularizer.
✅ Less sensitive to weight initialization.
sequential:-
Taking the entire neural network at once as a block is called sequential. This indicated we can do forward propagation & backward propagation.
dense: -
Dense helps us to create hidden layers, input layers & output layer.
Activation: -
Activation helps us to use different types of Optimizers.
-->ANN,CNN,RNN is a Black Box Model
-->We need to scale the data before we apply ANN & CNN.
-->For scaling of data for ANN mostly Standard scaler is used.
-->One of the library we used to implement ANN is TensorFlow which was developed by Google. Before TensorFlow 2.0 versions we need to install TensorFlow & Kera's separately. But from the versions greeter than 2.0 TensorFlow is integrated with Kera's.
Click here to see the ANN implementation
Dropout in Neural Networks
Why is Dropout Needed?
- Sometimes, a neural network overfits, meaning it performs well on training data but poorly on test data.
- Overfitting happens when neurons memorize training data instead of learning general patterns.
- To prevent this, we use Dropout, a regularization technique.
How Dropout Works
- Dropout is a regularization layer that randomly deactivates a percentage of neurons during training.
- If we set dropout = 0.3, then 30% of neurons are randomly turned off during training.
- Key point:
- In Batch 1, 30% of neurons are randomly deactivated.
- In Batch 2, a different set of 30% of neurons are deactivated.
- This continues for all batches during training.
- Dropout does not remove neurons permanently; it reduces the dependency of one neuron on others, helping the model learn more generalized patterns.
Benefits of Dropout
✅ Reduces overfitting
✅ Makes the model more robust
✅ Ensures neurons don’t become overly reliant on each other
🔹 Works well in fully connected (dense) layers, but not always needed in convolutional layers (CNNs).
🔹 Common Dropout values: 0.2 - 0.5
Early Stopping
- Stops training when validation loss stops improving.
- Prevents the model from continuing to learn noise after reaching the optimal point.
- Saves training time and prevents overfitting.
🔹 Steps:
- Monitor validation loss.
- If loss stops decreasing for a set number of epochs, stop training.
- Use the model from the epoch with the best validation performance.
Data Augmentation
- Increases the diversity of the training data artificially.
- Commonly used in image processing (rotating, flipping, adding noise).
- Helps prevent overfitting by ensuring the model learns more general patterns.
🔹 Examples in Image Classification:
✔ Random cropping, flipping, rotation
✔ Adding noise or blurring
✔ Changing brightness or contrast
📌 Works best for computer vision tasks.
Label Smoothing
- Instead of using hard labels (e.g., 1 for cat, 0 for dog), soften the labels (e.g., 0.9 for cat, 0.1 for dog).
- Prevents the model from becoming too confident, improving generalization.
- Used in classification tasks like NLP and computer vision.
Gradient Clipping
- Limits the size of gradients to prevent exploding gradients, especially in RNNs.
- Helps in stabilizing training when using deep networks.
📌 Common in recurrent neural networks (RNNs) and deep transformers like GPT.
Weight Initialization in Deep Learning
Weight initialization is the process of assigning initial values to the weights of a neural network before training. Proper initialization is crucial to ensure stable gradient updates and faster convergence.
Why is Weight Initialization Important?
- Avoids Exploding/Vanishing Gradients: Poor initialization can cause gradients to become too large or too small, making training unstable.
- Speeds Up Convergence: Proper initialization helps the model learn faster by providing meaningful starting points.
- Prevents Dead Neurons: Ensures neurons remain active and contribute to learning.
Types of Weight Initialization
1️⃣ Zero Initialization:
- All weights are set to zero.
- Problem: Neurons will learn the same features, leading to a lack of diversity.
- ❌ Not recommended.
- Causes symmetry issue, no learning
2️⃣ Random Initialization:
- Weights are set randomly.
- If values are too high, gradients may explode. If too low, gradients may vanish.
- ❌ Not ideal for deep networks.
- Can cause vanishing/exploding gradients
3️⃣ Xavier (Glorot) Initialization:
- Designed for sigmoid & tanh activations.
- Ensures balanced variance of activations and gradients.
- ✅ Good for shallow networks.
- Maintains variance across layers
4️⃣ He Initialization (Kaiming Initialization):
- Designed for ReLU activations.
- Prevents neurons from becoming "dead."
- ✅ Best for deep networks with ReLU.
5️⃣ Lecun Initialization:
- Designed for Leaky ReLU / SELU activations.
- ✅ Works best for self-normalizing networks.
How to Apply in Keras?
from keras.layers import Dense
from keras.initializers import HeNormal, GlorotUniform
# Using He Initialization for ReLU activation
layer1 = Dense(128, activation="relu", kernel_initializer=HeNormal())
# Using Xavier Initialization for Tanh activation
layer2 = Dense(64, activation="tanh", kernel_initializer=GlorotUniform())
from keras.layers import Dense
from keras.initializers import HeNormal, GlorotUniform
# Using He Initialization for ReLU activation
layer1 = Dense(128, activation="relu", kernel_initializer=HeNormal())
# Using Xavier Initialization for Tanh activation
layer2 = Dense(64, activation="tanh", kernel_initializer=GlorotUniform())
CONVOLUTION NEAURAL NETWORK(CNN): -
When ever we work with Images & Video Frames like image classification, object detection we prefer to use CNN.
-->CNN works similar to Visual cortex which is present in cerebral cortex which again present in our human brain i.e. CNN will go under different layers to process the result.
-->The first step we do in CNN is we bring the values between 0 & 1 by dividing each pixel value by 255 this step is called MIN-MAX SCALLING.
Before we go to the working of CNN let’s see the basics how an image is represented i.e. images are classified into 2 types 1)Black & White images 2)RGB Images i.e. RGB means RED,GREEN,BLUE
--> The filter is applied on the convolution matrix i.e. each value of convolution matrix is multiplied with each value in filter/kernel & sum is calculated the sum valued is assigned to output matrix, this process continues
-->Feature scaling is applied on the matrix that we got after applying the filter by lowest value is converted to 0 & highest value converted to 255.
-->255 is referred as WHITE color & 0 is referred as as BLACK
Whenever we pass 6X6 image through 3X3 filter we are getting 4X4 matrix as output this we can calculate as below
for example lets assume as input image is 6X6 ,n=6. As filter is 3X3 matrix f=3. The formula to calculate the output is n-f+1=6-3+1=4 i.e. output is 4X4 matrix.
-->But the major problem here is the input is 6X6 but the output is 4X4 this means the image size is decreasing i.e. is nothing but the we are losing some data. In order to prevent this loss we use PADDING.
-->PADDING is nothing but the building the specific compound around the image. The concept of applying a layer on image is called as PADDING.
-->After building the compound around the image the 6X6 image is converted to 8X8
-->Lets calculate the size of output image after padding, the formula to calculate after padding is as below
n+2p-f+1/S
where p=no of layers of padding applied according the above use case p=1
S=No of strides , according to above use case s=1
n+2p-f+1/1=6+2-3+1=6
-->In ANN as we update the weights in back propagation similarly we update the filter in CNN in back propagation.
-->On each & every value of output we apply ReLU activation function in CNN
-->The above all steps will come under convolution operation. After completing convolution operation we do Max Pooling
Pooling:-
Pooling is nothing but the down sampling of an image. Pooling helps us to solve the problem of location invariant. We have three types of Pooling
1)Avg Pooling
2)Min Pooling
3)Max Pooling
-->In Avg pooling the average of the pool is considered , where as in Min pooling the minimum of the pool is considered and in Max pooling the maximum of the pool is considered.
Flattening Layer:-
This is the next step after completing pooling. Flattening is used to convert all the resultant 2-D arrays from pooled feature maps into a single long linear vector as below. This flattening layer is passed through fully connected neural network & finally we get the results.
Click here th know the implementation of CNN
LSTM RNN Networks:-
Bidirectional LSTM 🚀
What is it?
A Bidirectional LSTM (BiLSTM) is an advanced LSTM that processes data in both forward and backward directions to capture more context from sequences.
How it works?
🔹 Standard LSTM reads input from past to future (left to right).
🔹 BiLSTM has two LSTMs:
-
Forward LSTM (left → right)
-
Backward LSTM (right → left)
🔹 Both outputs are combined to improve understanding.
Why use it?
✅ Captures past & future context
✅ Improves performance in NLP tasks (e.g., translation, speech recognition)
What is Sequence-to-Sequence (Seq2Seq)?
Seq2Seq models are designed for tasks where both the input and output are sequences of variable lengths. They are widely used in:
-
Machine Translation (e.g., English → French)
-
Text Summarization (e.g., Long text → Summary)
-
Speech-to-Text (e.g., Audio → Transcription)
A Seq2Seq model consists of two primary components: Encoder and Decoder, both usually implemented using Recurrent Neural Networks (RNNs) like LSTM (Long Short-Term Memory) or GRU (Gated Recurrent Unit).
🔥 Final Thoughts
The Attention Mechanism revolutionized Seq2Seq models, allowing them to handle long sequences efficiently. This led to the development of Transformer-based architectures like T5, BART, GPT, which rely entirely on attention (instead of RNNs).
Comments
Post a Comment