Deep Learning

 DEEP LEARNING

Deep Learning is a subset of machine learning where artificial neural networks, algorithms inspired by the human brain, learn from large amounts of data. It leverages multiple layers (hence "deep") of interconnected neurons to model complex patterns in data. This method is particularly effective for tasks involving unstructured data such as images, audio, and text.

Importance of Deep Learning:

- Performance: Deep learning models often outperform traditional machine learning models, especially in tasks such as image recognition, natural language processing, and speech recognition.
- Automation: Reduces the need for manual feature extraction by automatically learning representations from raw data.
- Scalability: Can handle vast amounts of data and leverage powerful computational resources to improve performance.
- Innovation: Drives advancements in numerous fields, from healthcare and autonomous driving to entertainment and finance.

Differences Between AI, ML & DL:

To understand deep learning, it's essential to distinguish it from broader concepts like artificial intelligence (AI) and machine learning (ML).

- Artificial Intelligence (AI): A broad field encompassing any technique that enables computers to mimic human intelligence. This includes rule-based systems, statistical methods, and various forms of learning.
  Example: A chess-playing program using handcrafted rules and strategies.

- Machine Learning (ML): A subset of AI that focuses on the development of algorithms that allow computers to learn from and make predictions based on data. It emphasizes learning patterns and making decisions with minimal human intervention.
  Example: A spam email filter that improves its accuracy as it processes more emails.

- Deep Learning (DL): A specialized subset of ML that uses neural networks with many layers (deep networks) to model complex patterns in large datasets. It eliminates the need for manual feature extraction by learning hierarchical representations.
  Example: A convolutional neural network (CNN) that identifies objects in images with high accuracy.

DS(Data Science):-

Data science is the study of data. The role of data scientist  involves developing the methods of recording, storing, and analyzing data to effectively extract useful information. The Final goal of data science is to gain insights and knowledge from any type of data.

Why Deep Learning is becoming more popular?

   Deep Learning is gaining much popularity due to it's supremacy in terms of accuracy when trained with huge amount of data.

In above we can see that as the amount of data increases machine learning models are failed to perform well, so for huge data deep learning is more preferred.
From the trends of deep learning , we can observe that deep learning is started evolving more from 2015, because from 2015 the people started using more social media platforms so huge amount of data started creating. By using this data the company started building AI models to increase user experience.
-->The advancement in Hardware is also one of the reason for increase in usage of deep learning.


Fundamentals of Neural Networks

Basic Concepts

Neurons

Neurons are the fundamental units of neural networks, inspired by biological neurons in the human brain. A neuron receives input, processes it, and produces an output. In the context of artificial neural networks, a neuron typically performs a weighted sum of its inputs, applies an activation function, and outputs the result.

- **Mathematical Representation**:


Forward propagation: -

                     Forward Propagation is nothing but the  input data is fed in to the neural network. i.e. in simple words the input data is fed in to the different hidden layers layers & by applying the suitable activation function, we will get the output results.

-->In forward propagation we take the inputs & fed this inputs into hidden layers and finally we get the output.

-->After getting the result we calculate Loss function, if the result of Loss function is near to 1. When we can conclude that this is not the result we expected.

-->We need to update the weights to reduce loss function value. In order to update the weights we perform Backward Propagation.

Backward Propagation: -

              Backward Propagation is nothing but the process of moving from the right to left i.e. backward from the Output to the Input layer is called the Backward Propagation.

-->Backward Propagation is the preferable method of adjusting or correcting the weights to reach the minimized loss function.

--->In Backward Propagation we are going to discuss
         Weight Updating Formula


                Where Wnew = New Weight
                            Wold  = Old Weight
                                  α  = Learning Rate
         Gradient is nothing but the derivative of Loss with respect to Old weight
-->By using Weight Updating Formula we get Global Minimum Point.
-->The derivative of Loss with respect to Old weight is also called as Slope.
-->When slope is negative the new weight we will get is grater than the old weight . If the slope is positive new weight is less than the old weight. By using this concept only we construct Gradient Decent Graph & We obtain the Global Minimum Point.
-->The Learning Rate should be very less. Usually the preferred Learning Rate is 0.001 or 0.01

-->Similarly for bias updating use the below formula

Vanishing Gradient Problem: -

                         Vanishing gradient happens when gradients become so small that weights stop updating, mainly due to activation functions like sigmoid in deep networks. i.e Vanishing Gradient Problem is nothing but their is a small change in  between the new weight & old weight. i.e. new weight is almost similar to old weight.
-->Some of the activation functions like the sigmoid function, squishes a large input space into a small input space between 0 and 1. 
-->During backpropagation, their derivatives are very small, which causes gradients to shrink layer by layer, leading to vanishing gradients.
-->The large change in the input of the sigmoid function will results in small change in the output. Therefore the derivative becomes small.
                                 

                                              The sigmoid function and its derivative 
-->Above image is the sigmoid function and its derivative. We can notice that the derivative becomes close to zero.
-->In order to avoid Vanishing Gradient Problem we use different activation Function.

-->The activation function we have are 

                        1)Sigmoid
                        2)Tanh
                        3)ReLU (Rectified Linear Uni)
                        4)Leaky ReLU
                        5)Parameterised ReLU 

1. Sigmoid function

              The function formula and chart are as follows

          

The Sigmoid function is the mostly used activation function in the beginning of deep learning. 
-->Sigmoid activation function is a  smoothing function that is easy to derive.
-->If the function output is centered to 0 means, it will take less time to for weight updating but in sigmoid function the output is not centered on 0, which will take more time for  weight update.
-->The derivative of sigmoid function ranges from 0 to 0.25.

Advantages of Sigmoid Function : -

  1. Sigmoid function is Smooth gradient, preventing “jumps” in output values.
  2. Output values ranges between 0 and 1.
  3. Sigmoid activation function results in Clear predictions, i.e. the output very close to either 1 or 0.
Disadvantages:
  • Sigmoid activation function is Prone to gradient vanishing
  • The result of the sigmoid activation function output is not zero-centered. Zero centered means the curve passes through origin i.e. (0,0) of the data.

  • It is a time consuming  function.    

2. Tanh function

The tanh function formula and curve are as follows

  
-->Tanh function is also called as hyperbolic tangent function.
-->When ever we apply Tanh activation function the value ranges from -1 to +1 and in the case of derivative it ranges from 0 to 1.
-->Even though we use tanh activation function, still there are chances of Vanishing Gradient Problem

3. ReLU function

ReLU function formula and curve are as follows

--> ReLU =max(0,z)

-->When ever x value is -ve, ReLU=max(0,-ve)=0 i.e whenever x is -ve the result of ReLU is 0.

-->ReLU full form is Rectified Linear Unit.
-->The derivative of ReLU function value is either 0 or 1.
-->In  Back Propagation if the derivative of ReLU is 0, then that neuron completely dead i.e. Wnew approximately equal to Wold.
-->The ReLU function is so fast in execution.
-->The disadvantage of ReLU is not zero centered.

4. Leaky ReLU function:-

            
                          In order to solve the problem of dead neuron in ReLU, use Leaky ReLU function.
                 

In Leaky ReLU we can observe that the small change in graph in X-Axis -ve values i.e. the line is slightly bent down. So that we wont get 0 as result instead we may get -ve values.


5. ELU (Exponential Linear Units) function
        
  • Their is no problem of  Dead ReLU issues in ELU.
  • The output mean is close to 0 i.e. it is zero centered.
  • The main problem in ELU is ELU takes more computational time.

Technique to find out which activation should we use:-

--> We generally avoid using Sigmoid and Tanh in hidden layers because they can cause the vanishing gradient problem, especially in deep neural networks.

--> For Binary Classification problem, in the hidden layers always prefer to use ReLU activation function and in the output layer use Sigmoid activation function.
Binary Cross Entropy is the loss function applied here.

--> In case convergence is not happening with ReLU, use Leaky ReLU / PReLU / ELU in the hidden layers. But for the output layer in Binary Classification, always use Sigmoid activation function only.
Binary Cross Entropy is the loss function applied here.

--> For Multiclass Classification problem, use ReLU or Leaky ReLU or PReLU or ELU in the hidden layers, but in the output layer always use Softmax activation function.
Categorical Cross Entropy is the loss function applied here.

--> In Linear Regression problem, in the hidden layers use ReLU or its variants, but in the output layer use Linear activation function.
The loss functions applied here are MSE, MAE, or Huber Loss.

-->For non-linear regression, we use non-linear activations like ReLU in hidden layers, but still use linear activation in the output layer with MSE/MAE/Huber loss.


Loss Function and cost function: -

--> In loss function, the error is calculated for a single data record (one sample) during forward propagation.

--> In cost function, the error is calculated over a batch of records or the entire dataset by taking the average loss.

Regression loss functions are:-

  1. MSE (Mean Squared Error)

  2. MAE (Mean Absolute Error)

  3. Huber Loss

  4. Pseudo Huber Loss


1)MSE(Mean Squared error):-
     Mean squared error is nothing but the average of the squares of the errors.
\mathrm{MSE} = \frac{1}{n} \sum_{i=1}^{n}(Y_{i}-\hat{Y}_{i})^2
\mathrm{MSE}=mean squared error
{n}=number of data points
Y_{i}=observed values
\hat{Y}_{i}=predicted values

Advantages:-

1)Mean square error is differentiable

2)It as only one Local or Global Minima.

3)It convergence faster

Disadvantages:-

1)It is not robust to outliers.

2)MAE(Mean Absolute error): -

-->Mean absolute error is robust to outliers.

-->Mean Absolute Error is nothing but  the amount of error that occur in measurements. It is the difference between the measured value and actual value.

\mathrm {MAE}= \frac {\sum _{i=1}^{n} {|y_i-x_i|}}{n}
\mathrm {MAE}=mean absolute error
y_i=prediction
x_i=true value
n=total number of data points

3)Huber Loss: -
-->Huber loss is nothing but the combination of both MSE(Mean Squared error) & MAE(Mean Absolute error)


Classification loss functions are:-

--> In classification, we generally use Cross Entropy–based loss functions

--> Cross Entropy is divided into:

  1. Binary Cross Entropy – for binary classification

  2. Categorical Cross Entropy – for one-hot encoded multiclass labels

  3. Sparse Categorical Cross Entropy – for integer-labeled multiclass data

  4. Hinge Loss – mainly used in SVM-based classifiers

1)Binary Cross Entropy:-
      Binary cross entropy is used for Binary classification.
2)Categorical Cross Entropy: -
 Categorical Cross Entropy is used for multiclass classification.
-->In Categorical Cross Entropy first step we perform is one-hot encoding.

Optimizers: -

                 Optimizers are much important in backward propagation to understand how weights are getting updated.
--> Different types of optimizers are
              1)Gradient Descent
              2)SGD(Stochastic Gradient Descent)
              3)Mini Batch SGD
              4)Mini Batch SGD with momentum
              5)Adagrad
              6)RMSPROP
              7)Adam Optimizer(Mostly used optimizer)

1)Gradient Descent: -
                                        The weight updating formula in backward propagation is
The major disadvantage of Gradient decent is it require huge resources i.e. huge RAM etc. to process huge epoch.
Epoch :-
     one full cycle which one consist both forward propagation & backward propagation is called Epoch
-->In gradient decent, If we have hundred million records for each epoch we pass one million records.
 
2)SGD(Stochastic Gradient Descent): -
   In Stochastic Gradient Descent for each epoch we pass only one record.
-->Processing 1st record through forward propagation, finding ŷ, finding out the Loss & processing the record through Backward Propagation for Updating the weight. This entire process is called Iteration
-->When we pass the first record it is called Iteration 1 etc.
-->The major disadvantage in SGD is the convergence will be very as each & every record passed individually.
-->The time complexity will be very high.
-->In above figure we can observe that their was no zig zag moment in data points in Gradient Descent but the Gradient Descent require huge resources
-->In case of SGD we can observe that their was a huge zig zag moment in the data points. 

3)Mini Batch SGD: -
                   In mini batch SGD instead of passing each & every record individually or passing a huge bulk of records. We set a optimal batch size & pass the batch size records. i.e. for example if our batch size=1000, we pass 1000 records each & every time. 
-->In Mini Batch SGD also the zig zag moment in the data point will be present but it was less when compare to SGD. The zig zag moment in data points is called Noise.

Instead of processing the entire dataset at once (as in Batch Gradient Descent) or updating weights after every single data point (as in Stochastic Gradient Descent), Mini-Batch Gradient Descent takes a middle approach:

  1. Divide the dataset into smaller batches.

    • Example: If you have 1,000 records and choose a batch size of 100, you will have 10 batches in total.
  2. Iteratively update the weights for each batch:

    • Take the first batch (100 records), compute the loss, and update the weights.
    • Move to the next batch, using the updated weights from the previous step, compute the loss, and update the weights again.
    • Repeat this process for all batches.
  3. Multiple weight updates per epoch:

    • If the entire dataset were processed at once, the weights would be updated only once per epoch.
    • With mini-batches, weights get updated multiple times per epoch (in this case, 10 times for 10 batches), leading to more frequent learning and better optimization.

Why Use Mini-Batch Gradient Descent?

✅ More stable updates compared to Stochastic Gradient Descent (SGD).
✅ Faster and more memory-efficient than Batch Gradient Descent.
✅ Helps generalization by introducing some randomness in learning.


4)Mini Batch SGD with momentum: -
                 In order to remove the noise present in SGD & Mini Batch SGD we use SGD with momentum. Momentum helps us to reduce the noise.
-->By the using the concept of Exponential Weight average we smoothen the curve such that Noise reduces.

-->In this we reducing the noise

-->It have a quicker conversation.


 5)AdaGrad: -

-->Adagrad is nothing but the Adaptive Gradient Descent. In gradient descent the learning rate if fixed in Adagard we are making changes in Adagrad such the initially the learning rate  will be high but decreases as we move towards Global Minimum point.

-->In this we are bringing the adaptiveness in the learning, in AdaGrad the learning rate wont be fixed, the learning rate will be decreasing as we reach to the Global Minimum point.

 6)RMSPROP: -

-->RMSPROP is nothing but the Root Mean Squared Propagation, it is an extension of gradient descent and the AdaGrad version of gradient descent. 

7)Adam Optimizer: -

   -->In Adam Optimizer we combine momentum along with RMSPROP.

-->It is the best optimizer among all optimizers.

-->It is solving the problem of smoothening

-->The Learning Rate becomes adaptive.

Batch Normalization:

Batch Normalization is a technique in deep learning that helps speed up training and improve the stability of neural networks. Here’s a simple way to understand it:

Why Do We Need Batch Normalization?

Benefits of Batch Normalization

✅ Faster training – allows higher learning rates.
✅ More stable training – prevents drastic changes in activations.
✅ Reduces overfitting – acts like a regularizer.
✅ Less sensitive to weight initialization.

sequential:-

     Taking the entire neural network  at once as a block is called sequential. This indicated we can do forward  propagation & backward propagation.

dense: -

                 Dense helps us to create hidden layers, input layers & output layer.

Activation: -

      Activation helps us to use different types of Optimizers.


-->ANN,CNN,RNN is a Black Box Model

-->We need to scale the data before we apply ANN & CNN.

-->For scaling of data for ANN mostly Standard scaler is used.

-->One of the library we used to implement ANN is TensorFlow which was developed by Google. Before TensorFlow 2.0 versions we need to install TensorFlow & Kera's separately. But from the versions greeter than 2.0 TensorFlow is integrated with Kera's.

Click here to see the ANN implementation


 Dropout in Neural Networks

Why is Dropout Needed?

  • Sometimes, a neural network overfits, meaning it performs well on training data but poorly on test data.
  • Overfitting happens when neurons memorize training data instead of learning general patterns.
  • To prevent this, we use Dropout, a regularization technique.

How Dropout Works

  • Dropout is a regularization layer that randomly deactivates a percentage of neurons during training.
  • If we set dropout = 0.3, then 30% of neurons are randomly turned off during training.
  • Key point:
    • In Batch 1, 30% of neurons are randomly deactivated.
    • In Batch 2, a different set of 30% of neurons are deactivated.
    • This continues for all batches during training.
  • Dropout does not remove neurons permanently; it reduces the dependency of one neuron on others, helping the model learn more generalized patterns.

Benefits of Dropout

✅ Reduces overfitting
✅ Makes the model more robust
✅ Ensures neurons don’t become overly reliant on each other

🔹 Works well in fully connected (dense) layers, but not always needed in convolutional layers (CNNs).
🔹 Common Dropout values: 0.2 - 0.5

 Early Stopping

  • Stops training when validation loss stops improving.
  • Prevents the model from continuing to learn noise after reaching the optimal point.
  • Saves training time and prevents overfitting.

🔹 Steps:

  1. Monitor validation loss.
  2. If loss stops decreasing for a set number of epochs, stop training.
  3. Use the model from the epoch with the best validation performance.

Data Augmentation

  • Increases the diversity of the training data artificially.
  • Commonly used in image processing (rotating, flipping, adding noise).
  • Helps prevent overfitting by ensuring the model learns more general patterns.

🔹 Examples in Image Classification: ✔ Random cropping, flipping, rotation
✔ Adding noise or blurring
✔ Changing brightness or contrast

📌 Works best for computer vision tasks.


 Label Smoothing

  • Instead of using hard labels (e.g., 1 for cat, 0 for dog), soften the labels (e.g., 0.9 for cat, 0.1 for dog).
  • Prevents the model from becoming too confident, improving generalization.
  • Used in classification tasks like NLP and computer vision.

Gradient Clipping

  • Limits the size of gradients to prevent exploding gradients, especially in RNNs.
  • Helps in stabilizing training when using deep networks.

📌 Common in recurrent neural networks (RNNs) and deep transformers like GPT.

Weight Initialization in Deep Learning

Weight initialization is the process of assigning initial values to the weights of a neural network before training. Proper initialization is crucial to ensure stable gradient updates and faster convergence.


Why is Weight Initialization Important?

  1. Avoids Exploding/Vanishing Gradients: Poor initialization can cause gradients to become too large or too small, making training unstable.
  2. Speeds Up Convergence: Proper initialization helps the model learn faster by providing meaningful starting points.
  3. Prevents Dead Neurons: Ensures neurons remain active and contribute to learning.

Types of Weight Initialization

1️⃣ Zero Initialization:

  • All weights are set to zero.
  • Problem: Neurons will learn the same features, leading to a lack of diversity.
  • Not recommended.
  • Causes symmetry issue, no learning

2️⃣ Random Initialization:

  • Weights are set randomly.
  • If values are too high, gradients may explode. If too low, gradients may vanish.
  • Not ideal for deep networks.
  • Can cause vanishing/exploding gradients

3️⃣ Xavier (Glorot) Initialization:

  • Designed for sigmoid & tanh activations.
  • Ensures balanced variance of activations and gradients.
  • Good for shallow networks.
  • Maintains variance across layers

4️⃣ He Initialization (Kaiming Initialization):

  • Designed for ReLU activations.
  • Prevents neurons from becoming "dead."
  • Best for deep networks with ReLU.

5️⃣ Lecun Initialization:

  • Designed for Leaky ReLU / SELU activations.
  • Works best for self-normalizing networks.

How to Apply in Keras?

from keras.layers import Dense
from keras.initializers import HeNormal, GlorotUniform

# Using He Initialization for ReLU activation
layer1 = Dense(128, activation="relu", kernel_initializer=HeNormal())

# Using Xavier Initialization for Tanh activation
layer2 = Dense(64, activation="tanh", kernel_initializer=GlorotUniform())


CONVOLUTION NEAURAL NETWORK(CNN): -

  When ever we work with Images & Video Frames like image classification, object detection we prefer to use CNN.

-->CNN works similar to Visual cortex which is present in cerebral cortex which again present in our human brain i.e. CNN will go under different layers to process the result.



-->The first step we do in CNN is we bring the values between 0 & 1 by dividing each pixel value by 255 this step is called MIN-MAX SCALLING.

Before we go to the working of CNN let’s see the basics  how an image is represented i.e. images are classified into 2 types 1)Black & White images 2)RGB Images i.e. RGB means RED,GREEN,BLUE


-->    The filter is applied on the convolution matrix
i.e. each value of convolution matrix is multiplied with each value in filter/kernel & sum is calculated the sum valued is assigned to output matrix, this process continues

-->Feature scaling is applied on the matrix that we got after applying the filter by lowest value is converted to 0 & highest value converted to 255.

-->255 is referred as WHITE color & 0 is referred as as BLACK


Whenever we pass 6X6 image through 3X3 filter we are getting 4X4 matrix as output this we can calculate as below

for example lets assume as input image is 6X6 ,n=6. As filter is 3X3 matrix f=3. The formula to calculate the output is n-f+1=6-3+1=4 i.e. output is 4X4 matrix.

-->But the major problem here is the input is 6X6 but the output is 4X4 this means the image size is decreasing i.e. is nothing but the we are losing some data. In order to prevent this loss we use PADDING.

-->PADDING is nothing but the building the specific compound around the image. The concept of applying a layer on image is called as PADDING.

-->After building the compound around the image the 6X6 image is converted to 8X8


-->PADDING can be done in 2 ways , first one is filling the newly created cells with ZERO(0) & next one is filling the cells with the nearest value to that cell.
-->Below is the example of nearest value filling



-->Below is the example for zero filling.


-->Lets calculate the size of output image after padding, the formula to calculate after padding is as below

n+2p-f+1/S

where p=no of layers of padding applied according the above use case p=1

S=No of strides , according to above use case s=1

n+2p-f+1/1=6+2-3+1=6

-->In ANN as we update the weights in back propagation similarly we update the filter in CNN in back propagation.

-->On each & every value of output we apply ReLU activation function in CNN

-->The above all steps will come under convolution operation. After completing convolution operation we do Max Pooling

Pooling:-

 Pooling is nothing but the down sampling of an image. Pooling helps us to solve the problem of location invariant. We have three types of Pooling

1)Avg Pooling

2)Min Pooling

3)Max Pooling

-->In Avg pooling the average of the pool is considered , where as in Min pooling the minimum of the pool is considered and in Max pooling the maximum of the pool is considered.

Flattening Layer:-

This is the next step after completing pooling. Flattening is used to convert all the resultant 2-D arrays from pooled feature maps into a single long linear vector as below. This flattening layer is passed through fully connected neural network & finally we get the results.



Click here th know the implementation of CNN



Recurrent Neural Network (RNN):-
                  Lets understand Recurrent Neural Network by below example. At t=t-1, we pass our record to RNN model, the pre-processing will happen & we will get the output. For the next input i.e. at t=t whatever the output we got from  first input  is also sent along with input in to recurring neural network and because of this entire process the sequences information is maintained and this process continues at t=t+1, t=t+2 .
-->This entire above process occur in forward propagation.
-->In Back propagation, we will update the weights to reach to gradient decent/global  minimum


-->In Backward propagation, the derivate of  weights will get updated continuously with respect to time by considering chain rule with respect to derivate.
-->If we apply sigmoid activation function to update the weights, the vanishing gradient problem occurs i.e. the derivate of sigmoid will lie between 0 to 1 & as the back propagation proceeds' the derivate of sigmoid we will get is very small. So that the new weight will be almost similar to old. 
-->So that their will be only small moment & it wont to reach to global minimum point.
-->If we other activation functions like ReLU , the derivative will be greater than 1 as the back propagation proceeds' , it creates exploding gradient problem i.e. the weights changes will happen so big, so that wont reach the global minimum.
-->To overcome this problem we use LSTM.

-->Types of RNN
      1)One to One RNN
                In One to One RNN we will be having one input and one output
      2)One to Many RNN
               In One to Many RNN we will be having one input and many outputs.
      3)Many to One RNN
                In Many to One RNN we will be having many inputs and one output.
      4)Many to Many RNN
                In Many to Many RNN we will be having many inputs and many outputs.


LSTM RNN Networks:-

     In RNN the problem of vanishing gradient or dead neauron occurs and if we have a very deep neaural network and some of the output really depend on first word in this case the context can not be captured easily with RNN. So we prefer LSTM RNN
    In order to control the flow of information in LSTM(Long short-term memory) network LSTM mainly contains 3 Gates
                   1)Forget Gate
                   2)Input Gate
                   3)Output Gate


1)Memory Cell:-
         Memory Cell is used for remembering & forgetting based on the context of the input.
2)Forget Gate:-
        Forget Gate decided what  information we should forget from previous cell information to update the current cell state of LSTM Unit.
3)Input Gate:-
        Input Gate decided what  information we should retain from previous cell information to update the current cell state of LSTM Unit i.e. we are adding information to memory cell.
4)Output Gate:-
      The information from the memory cell after passing through tanh function, which converts into -1 to +1 and sigmoid function which is between 0 to 1 will get combined by the point wise operation. All the above process helps to get meaningful the information  & this result is passed to next cell.
-->For hyper pramater use keras tuner grid search method
BIDIRECTIONAL RNN:-
                    In RNN all the information passes in a unidirectional way, we cannot get the information of future words. By considering this case we have a concept of Bidirectional RNN.
-->Bidirectional RNN is slow in working when compare to RNN

Similart we have bidirectional LSTM

Bidirectional LSTM  🚀

What is it?

A Bidirectional LSTM (BiLSTM) is an advanced LSTM that processes data in both forward and backward directions to capture more context from sequences.

How it works?

🔹 Standard LSTM reads input from past to future (left to right).
🔹 BiLSTM has two LSTMs:

  • Forward LSTM (left → right)

  • Backward LSTM (right → left)
    🔹 Both outputs are combined to improve understanding.

Why use it?

✅ Captures past & future context
✅ Improves performance in NLP tasks (e.g., translation, speech recognition)


 What is Sequence-to-Sequence (Seq2Seq)?

Seq2Seq models are designed for tasks where both the input and output are sequences of variable lengths. They are widely used in:

  • Machine Translation (e.g., English → French)

  • Text Summarization (e.g., Long text → Summary)

  • Speech-to-Text (e.g., Audio → Transcription)

A Seq2Seq model consists of two primary components: Encoder and Decoder, both usually implemented using Recurrent Neural Networks (RNNs) like LSTM (Long Short-Term Memory) or GRU (Gated Recurrent Unit).


🔥 Final Thoughts

The Attention Mechanism revolutionized Seq2Seq models, allowing them to handle long sequences efficiently. This led to the development of Transformer-based architectures like T5, BART, GPT, which rely entirely on attention (instead of RNNs).

Comments