Deep Learning

 DEEP LEARNING

Content:-

### Introduction to Deep Learning

1. **What is Deep Learning?**

   - Definition and importance

   - Historical context and evolution

   - Differences between AI, Machine Learning, and Deep Learning


2. **Applications of Deep Learning**

   - Computer vision (image classification, object detection, etc.)

   - Natural language processing (text generation, sentiment analysis, etc.)

   - Speech recognition

   - Healthcare (medical imaging, drug discovery, etc.)

   - Autonomous vehicles

   - Other real-world applications


### Fundamentals of Neural Networks

3. **Basic Concepts**

   - Neurons and perceptrons

   - Activation functions (sigmoid, ReLU, tanh, etc.)

   - Loss functions (MSE, cross-entropy, etc.)


4. **Training Neural Networks**

   - Forward and backward propagation

   - Gradient descent and optimization algorithms (SGD, Adam, RMSprop, etc.)

   - Overfitting and underfitting

   - Regularization techniques (L2 regularization, dropout, etc.)


5. **Building Blocks of Neural Networks**

   - Layers: dense, convolutional, recurrent, etc.

   - Batch normalization

   - Initialization methods


### Deep Learning Architectures

6. **Feedforward Neural Networks**

   - Introduction to multilayer perceptrons (MLPs)

   - Practical considerations and common issues


7. **Convolutional Neural Networks (CNNs)**

   - Convolutional layers and pooling layers

   - Popular CNN architectures (LeNet, AlexNet, VGG, ResNet, etc.)

   - Applications in image processing


8. **Recurrent Neural Networks (RNNs)**

   - Basics of RNNs and LSTMs/GRUs

   - Applications in time series and sequence modeling

   - Introduction to Transformer models


9. **Autoencoders and Generative Models**

   - Autoencoders and variational autoencoders (VAEs)

   - Generative adversarial networks (GANs)

   - Applications in data generation and image synthesis


### Practical Deep Learning

10. **Data Preparation and Preprocessing**

    - Data collection and labeling

    - Data augmentation techniques

    - Feature scaling and normalization


11. **Model Evaluation and Validation**

    - Train/test split and cross-validation

    - Evaluation metrics (accuracy, precision, recall, F1 score, etc.)

    - Model selection and hyperparameter tuning


12. **Deep Learning Frameworks**

    - Introduction to popular frameworks (TensorFlow, PyTorch, Keras, etc.)

    - Basic usage and examples

    - Building and training models


### Advanced Topics and Emerging Trends

13. **Transfer Learning**

    - Pretrained models and fine-tuning

    - Transfer learning strategies


14. **Reinforcement Learning**

    - Basics of reinforcement learning

    - Deep Q-learning and policy gradients


15. **Explainable AI and Interpretability**

    - Importance of model interpretability

    - Techniques for interpreting models


16. **Ethical Considerations in Deep Learning**

    - Bias and fairness

    - Privacy and security concerns

    - Ethical use of AI


### Practical Projects and Case Studies

17. **End-to-End Projects**

    - Project 1: Image Classification

    - Project 2: Sentiment Analysis

    - Project 3: Speech Recognition


18. **Case Studies**

    - Analysis of successful deep learning applications

    - Lessons learned and best practices


### Conclusion and Future Directions

19. **Future Trends in Deep Learning**

    - Current research and breakthroughs

    - Speculative future applications


20. **Resources for Further Learning**

    - Books, courses, and online resources

    - Research papers and journals

     Introduction to Deep Learning

1.1.What is Deep Learning?

1.1.1 Definition and Importance

**Deep Learning** is a subset of machine learning where artificial neural networks, algorithms inspired by the human brain, learn from large amounts of data. It leverages multiple layers (hence "deep") of interconnected neurons to model complex patterns in data. This method is particularly effective for tasks involving unstructured data such as images, audio, and text.

**Importance of Deep Learning**:

- **Performance**: Deep learning models often outperform traditional machine learning models, especially in tasks such as image recognition, natural language processing, and speech recognition.
- **Automation**: Reduces the need for manual feature extraction by automatically learning representations from raw data.
- **Scalability**: Can handle vast amounts of data and leverage powerful computational resources to improve performance.
- **Innovation**: Drives advancements in numerous fields, from healthcare and autonomous driving to entertainment and finance.


1.1.2 Historical Context and Evolution

The concept of neural networks dates back to the 1940s, but several key milestones have shaped the evolution of deep learning:
1. **1943**: Warren McCulloch and Walter Pitts propose a model of artificial neurons.
2. **1950s**: The **Perceptron** algorithm is developed by Frank Rosenblatt, representing one of the earliest forms of neural networks.
3. **1980s**: Introduction of backpropagation by Geoffrey Hinton, David Rumelhart, and Ronald Williams, enabling efficient training of multi-layer networks.
4. **1990s**: Neural networks fall out of favor due to limited computational power and data.
5. **2006**: Geoffrey Hinton and his colleagues reintroduce deep learning through unsupervised pre-training of deep networks, sparking renewed interest.
6. **2012**: The breakthrough of AlexNet in the ImageNet competition, led by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, demonstrating the potential of deep convolutional networks.
7. **2014-Present**: Rapid advancements in hardware (GPUs and TPUs), algorithms, and availability of large datasets propel deep learning to new heights, with applications across various domains.


1.1.3 Differences Between AI, Machine Learning, and Deep Learning
To understand deep learning, it's essential to distinguish it from broader concepts like artificial intelligence (AI) and machine learning (ML).

- **Artificial Intelligence (AI)**: A broad field encompassing any technique that enables computers to mimic human intelligence. This includes rule-based systems, statistical methods, and various forms of learning.
  Example: A chess-playing program using handcrafted rules and strategies.

- **Machine Learning (ML)**: A subset of AI that focuses on the development of algorithms that allow computers to learn from and make predictions based on data. It emphasizes learning patterns and making decisions with minimal human intervention.
  Example: A spam email filter that improves its accuracy as it processes more emails.

- **Deep Learning (DL)**: A specialized subset of ML that uses neural networks with many layers (deep networks) to model complex patterns in large datasets. It eliminates the need for manual feature extraction by learning hierarchical representations.
  Example: A convolutional neural network (CNN) that identifies objects in images with high accuracy.


In summary:
- **AI** is the overarching concept of machines exhibiting human-like intelligence.
- **ML** is a methodology within AI that involves learning from data.
- **DL** is a sophisticated form of ML that uses multi-layered neural networks to automatically discover patterns in data.



DS(Data Science):-

Data science is the study of data. The role of data scientist  involves developing the methods of recording, storing, and analyzing data to effectively extract useful information. The Final goal of data science is to gain insights and knowledge from any type of data.

1.1.4: Why Deep Learning is becoming more popular?

   Deep Learning is gaining much popularity due to it's supremacy in terms of accuracy when trained with huge amount of data.

In above we can see that as the amount of data increases machine learning models are failed to perform well, so for huge data deep learning is more preferred.
From the trends of deep learning , we can observe that deep learning is started evolving more from 2015, because from 2015 the people started using more social media platforms so huge amount of data started creating. By using this data the company started building AI models to increase user experience.
-->The advancement in Hardware is also one of the reason for increase in usage of deep learning.


2.Applications of Deep Learning


 2.1: Computer Vision

**Computer vision** is one of the most prominent fields where deep learning has made a significant impact. It involves enabling computers to interpret and make decisions based on visual data from the world.

- **Image Classification**: Identifying objects within an image and classifying them into predefined categories. Example: Recognizing animals in photos.
  
- **Object Detection**: Locating and classifying multiple objects within an image. Example: Detecting pedestrians and vehicles in autonomous driving.

- **Image Segmentation**: Dividing an image into segments or regions of interest. Example: Segmenting different organs in medical imaging.

- **Face Recognition**: Identifying or verifying individuals from their facial features. Example: Unlocking smartphones using facial recognition.

2.2: Natural Language Processing (NLP)

**Natural language processing (NLP)** focuses on the interaction between computers and human language, enabling machines to understand, interpret, and generate human language.

- **Text Generation**: Creating coherent and contextually relevant text. Example: Generating news articles or product descriptions.

- **Sentiment Analysis**: Determining the sentiment or emotional tone of a piece of text. Example: Analyzing customer reviews to gauge satisfaction.

- **Language Translation**: Converting text from one language to another. Example: Translating documents from English to Spanish.

- **Chatbots and Virtual Assistants**: Facilitating human-computer interactions through natural language. Example: Customer support chatbots.

2.3: Speech Recognition

**Speech recognition** involves converting spoken language into text, allowing for various applications where voice commands or dictation are required.

- **Voice Assistants**: Powering devices like Amazon Alexa, Google Assistant, and Apple Siri, enabling them to respond to voice commands.

- **Transcription Services**: Automatically transcribing spoken content into written text. Example: Transcribing meetings or interviews.

- **Voice-Activated Controls**: Enabling hands-free control of devices and applications. Example: Voice-controlled smart home systems.

2.4: Healthcare

Deep learning has revolutionized various aspects of healthcare by improving diagnostic accuracy, treatment planning, and patient outcomes.

- **Medical Imaging**: Enhancing the analysis of medical images (X-rays, MRIs, CT scans) for disease detection. Example: Identifying tumors in radiology images.

- **Drug Discovery**: Accelerating the discovery of new drugs by predicting molecular interactions. Example: Finding potential compounds for treating diseases.

- **Personalized Medicine**: Tailoring treatment plans based on individual patient data. Example: Predicting patient responses to specific treatments.

2.5: Autonomous Vehicles

**Autonomous vehicles** leverage deep learning for navigating and making decisions in complex environments without human intervention.

- **Object Detection and Recognition**: Identifying other vehicles, pedestrians, road signs, and obstacles on the road.

- **Path Planning**: Determining the optimal route and making real-time adjustments based on traffic and road conditions.

- **Driver Assistance Systems**: Enhancing safety through features like adaptive cruise control, lane-keeping assistance, and automated parking.

2.6: Other Real-World Applications

Deep learning has found applications in numerous other fields, demonstrating its versatility and wide-ranging impact.

- **Finance**: Fraud detection, algorithmic trading, credit scoring, and risk management.
  
- **Entertainment**: Content recommendation systems for streaming services like Netflix and Spotify.

- **Agriculture**: Crop monitoring, yield prediction, and pest detection using satellite imagery and sensor data.

- **Manufacturing**: Predictive maintenance, quality control, and automation of industrial processes.

- **Retail**: Customer behavior analysis, inventory management, and personalized marketing.

- **Energy**: Predicting energy consumption, optimizing grid operations, and identifying patterns in energy usage.

In conclusion, deep learning's ability to process and analyze vast amounts of data has led to significant advancements across various industries, transforming how tasks are performed and leading to new innovations.

Fundamentals of Neural Networks

3: Basic Concepts

3.1: Neurons and Perceptrons


**Neurons** are the fundamental units of neural networks, inspired by biological neurons in the human brain. A neuron receives input, processes it, and produces an output. In the context of artificial neural networks, a neuron typically performs a weighted sum of its inputs, applies an activation function, and outputs the result.

- **Mathematical Representation**:
  1. Specific Model:

    • A perceptron is a type of artificial neuron, the simplest form of a neural network, and an early model for binary classifiers.
  2. Functionality:

    • A single-layer perceptron takes a set of binary inputs, applies weights, sums them, and passes the result through a step function (activation function) to produce a binary output.
    • It is essentially a linear classifier.
  3. Structure:

    • The perceptron performs the following operations: z=i=1nwixi+boutput={1if z00if z<0\text{output} = \begin{cases} 1 & \text{if } z \geq 0 \\ 0 & \text{if } z < 0 \end{cases}
    • Here, the step function (or Heaviside step function) is the activation function.
  4. Limitations:

    • A single-layer perceptron can only solve linearly separable problems, which limits its capability to model complex patterns.
    • Multilayer perceptrons (MLPs) with multiple layers of neurons and non-linear activation functions can overcome this limitation.
  5. Historical Significance:

    • The perceptron was introduced by Frank Rosenblatt in the 1950s and laid the foundation for more complex neural network architectures.
    • Despite its limitations, it has historical significance as one of the first models of artificial neural networks.

Summary

  • Neuron: General term for the basic unit in a neural network that processes inputs using weights and an activation function, capable of handling complex patterns when arranged in layers.
  • Perceptron: A specific type of neuron that acts as a linear binary classifier, with historical significance but limited in its ability to solve only linearly separable problems.
##### Activation Functions

Activation functions introduce non-linearity into the neural network, allowing it to model complex relationships in data. Common activation functions include:

- **Sigmoid**:
  \[
  \sigma(x) = \frac{1}{1 + e^{-x}}
  \]
  - Outputs values in the range (0, 1)
  - Often used in binary classification tasks

- **Tanh** (Hyperbolic Tangent):
  \[
  \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}
  \]
  - Outputs values in the range (-1, 1)
  - Zero-centered, making it often preferred over the sigmoid function

- **ReLU** (Rectified Linear Unit):
  \[
  \text{ReLU}(x) = \max(0, x)
  \]
  - Outputs zero for negative inputs and the input value for positive inputs
  - Helps mitigate the vanishing gradient problem, making training faster and more effective

- **Leaky ReLU**:
  \[
  \text{Leaky ReLU}(x) = \begin{cases} 
  x & \text{if } x \geq 0 \\
  \alpha x & \text{if } x < 0 
  \end{cases}
  \]
  - Similar to ReLU, but allows a small, non-zero gradient when the unit is not active

##### Loss Functions

Loss functions measure the difference between the predicted output and the actual target value. They guide the optimization process to minimize prediction errors.

- **Mean Squared Error (MSE)**:
  \[
  \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
  \]
  - Used for regression tasks
  - Measures the average squared difference between predicted and actual values

- **Cross-Entropy Loss**:
  - Binary Cross-Entropy:
    \[
    \text{Binary Cross-Entropy} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]
    \]
    - Used for binary classification tasks
    - Measures the difference between the true label distribution and the predicted probability distribution
  
  - Categorical Cross-Entropy:
    \[
    \text{Categorical Cross-Entropy} = -\sum_{i} y_i \log(\hat{y}_i)
    \]
    - Used for multi-class classification tasks
    - Compares the one-hot encoded true labels with the predicted probabilities for each class

Understanding these fundamental concepts—neurons and perceptrons, activation functions, and loss functions—is crucial for grasping how neural networks learn and make predictions. These elements form the building blocks of more complex deep learning architectures.




Forward propagation: -

                     Forward Propagation is nothing but the  input data is fed in to the neural network. i.e. in simple words the input data is fed in to the different hidden layers layers & by applying the suitable activation function, we will get the output results.

-->In forward propagation we take the inputs & fed this inputs into hidden layers and finally we get the output.

-->After getting the result we calculate Loss function, if the result of Loss function is near to 1. When we can conclude that this is not the result we expected.

-->We need to update the weights to reduce loss function value. In order to update the weights we perform Backward Propagation.

Backward Propagation: -

              Backward Propagation is nothing but the process of moving from the right to left i.e. backward from the Output to the Input layer is called the Backward Propagation.

-->Backward Propagation is the preferable method of adjusting or correcting the weights to reach the minimized loss function.

--->In Backward Propagation we are going to discuss
         Weight Updating Formula


                Where Wnew = New Weight
                            Wold  = Old Weight
                                  α  = Learning Rate
         Gradient is nothing but the derivative of Loss with respect to Old weight
-->By using Weight Updating Formula we get Global Minimum Point.
-->The derivative of Loss with respect to Old weight is also called as Slope.
-->When slope is negative the new weight we will get is grater than the old weight . If the slope is positive new weight is less than the old weight. By using this concept only we construct Gradient Decent Graph & We obtain the Global Minimum Point.
-->The Learning Rate should be very less. Usually the preferred Learning Rate is 0.001 or 0.01

-->Similarly for bias updating use the below formula

Vanishing Gradient Problem: -
                         Vanishing Gradient Problem is nothing but the the is only small change in  between the new weight & old weight. i.e. new weight is almost similar to old weight.
-->Vanishing Gradient Problem is nothing but the no change in new weight 
-->Some of the activation functions like the sigmoid function, squishes a large input space into a small input space between 0 and 1. 
-->The large change in the input of the sigmoid function will results in small change in the output. Therefore the derivative becomes small.
                                 

                                              The sigmoid function and its derivative 
-->Above image is the sigmoid function and its derivative. We can notice that the derivative becomes close to zero.
-->In order to avoid Vanishing Gradient Problem we use different activation Function.
-->The activation function we have are 
                        1)Sigmoid
                        2)Tanh
                        3)ReLU (Rectified Linear Uni)
                        4)Leaky ReLU
                        5)Parameterised ReLU 

1. Sigmoid function

              The function formula and chart are as follows

          

The Sigmoid function is the mostly used activation function in the beginning of deep learning. 
-->Sigmoid activation function is a  smoothing function that is easy to derive.
-->If the function output is centered to 0 means, it will take less time to for weight updating but in sigmoid function the output is not centered on 0, which will take more time for  weight update.
-->The derivative of sigmoid function ranges from 0 to 0.25.

Advantages of Sigmoid Function : -

  1. Sigmoid function is Smooth gradient, preventing “jumps” in output values.
  2. Output values ranges between 0 and 1.
  3. Sigmoid activation function results in Clear predictions, i.e. the output very close to either 1 or 0.
Disadvantages:
  • Sigmoid activation function is Prone to gradient vanishing
  • The result of the sigmoid activation function output is not zero-centered. Zero centered means the curve passes through origin i.e. (0,0) of the data.

  • It is a time consuming  function.    

2. Tanh function

The tanh function formula and curve are as follows

  
-->Tanh function is also called as hyperbolic tangent function.
-->When ever we apply Tanh activation function the value ranges from -1 to +1 and in the case of derivative it ranges from 0 to 1.
-->Even though we use tanh activation function, still there are chances of Vanishing Gradient Problem

3. ReLU function

ReLU function formula and curve are as follows

--> ReLU =max(0,z)

-->When ever x value is -ve, ReLU=max(0,-ve)=0 i.e whenever x is -ve the result of ReLU is 0.

-->ReLU full form is Rectified Linear Unit.
-->The derivative of ReLU function value is either 0 or 1.
-->In  Back Propagation if the derivative of ReLU is 0, then that neuron completely dead i.e. Wnew approximately equal to Wold.
-->The ReLU function is so fast in execution.
-->The disadvantage of ReLU is not zero centered.

4. Leaky ReLU function:-

            
                          In order to solve the problem of dead neuron in ReLU, use Leaky ReLU function.
                 

In Leaky ReLU we can observe that the small change in graph in X-Axis -ve values i.e. the line is slightly bent down. So that we wont get 0 as result instead we may get -ve values.


5. ELU (Exponential Linear Units) function
        
  • Their is no problem of  Dead ReLU issues in ELU.
  • The output mean is close to 0 i.e. it is zero centered.
  • The main problem in ELU is ELU takes more computational time.

Technique to find out which activation should we use:-

-->We cant use the sigmoid activation function, Tanh function as the problem of vanishing gradient occurs.

-->For Binary Classification problem in the hidden layer always prefer to use ReLU activation function & in the output layer use sigmoid activation function. Categorical Cross Entropy is the Loss Function applied here
-->In the case of convergence is not happening with the help of ReLU use Pre ReLU or ELU. But for the output layer in Binary Classification always use sigmoid activation function only. Binary Cross Entropy is the Loss Function applied here,
-->For Multiclass classification problem use ReLU or Pre ReLU or ELU in hidden layer but in the output layer always use softmax activation function.
-->In Linear Regression problem in hidden layer use ReLU or any other other forms of ReLU but in the  output layer use Linear Activation function. The loss we apply here are MSE or MAE or Huber Loss

Exploding Gradient Problem 🚀

What is it?

The exploding gradient problem occurs when gradients become too large during backpropagation, causing unstable training and NaN values in the model.

Why does it happen?

  • Common in deep networks (especially RNNs & LSTMs).

  • During backpropagation, gradients are repeatedly multiplied—if weights are too large (>1), they grow exponentially, leading to "explosion."

Effects:

❌ Training becomes unstable
❌ Loss function diverges (keeps increasing)
❌ Model fails to learn

How to fix it?

Gradient Clipping → Limits gradient values to a fixed range
Proper Weight Initialization → Like Xavier or He initialization
Lower Learning Rate → Prevents large updates



Loss Function and cost function: -
-->In Loss function suppose if we have 100 records, only 1 records is passed through through forward propagation
-->In case of cost function we will pass a batch of records. Generally the batch of records are called as epoxy.

Regression loss functions are: - 

                  1)MSE(Mean Squared error)

                   2)MAE(Mean Absolute error)

                   3)Huber Loss

                   4)pseudo huber loss


1)MSE(Mean Squared error):-
     Mean squared error is nothing but the average of the squares of the errors.
\mathrm{MSE} = \frac{1}{n} \sum_{i=1}^{n}(Y_{i}-\hat{Y}_{i})^2
\mathrm{MSE}=mean squared error
{n}=number of data points
Y_{i}=observed values
\hat{Y}_{i}=predicted values

Advantages:-

1)Mean square error is differentiable

2)It as only one Local or Global Minima.

3)It convergence faster

Disadvantages:-

1)It is not robust to outliers.

2)MAE(Mean Absolute error): -

-->Mean absolute error is robust to outliers.

-->Mean Absolute Error is nothing but  the amount of error thst occur in measurements. It is the difference between the measured value and actual value.

\mathrm {MAE}= \frac {\sum _{i=1}^{n} {|y_i-x_i|}}{n}
\mathrm {MAE}=mean absolute error
y_i=prediction
x_i=true value
n=total number of data points

3)Huber Loss: -
-->Huber loss is nothing but the combination of both MSE(Mean Squared error) & MAE(Mean Absolute error)


Classification loss functions are: - 

 In classification we generally use Cross Entropy
-->Cross Entropy is divided in to two types
            1)Binary Cross Entropy
            2)Categorical Cross Entropy
            3)Sparse Categorical Cross Entropy
            4)Hinge Loss
1)Binary Cross Entropy:-
      Binary cross entropy is used for Binary classification.
2)Categorical Cross Entropy: -
 Categorical Cross Entropy is used for multiclass classification.
-->In Categorical Cross Entropy first step we perform is one-hot encoding.

Optimizers: -
                 Optimizers are much important in backward propagation to understand how weights are getting updated.
--> Different types of optimizers are
              1)Gradient Descent
              2)SGD(Stochastic Gradient Descent)
              3)Mini Batch SGD
              4)Mini Batch SGD with momentum
              5)Adagrad
              6)RMSPROP
              7)Adam Optimizer(Mostly used optimizer)

1)Gradient Descent: -
                                        The weight updating formula in backward propagation is
The major disadvantage of Gradient decent is it require huge resources i.e. huge RAM etc. to process huge epoch.
Epoch :-
     one full cycle which one consist both forward propagation & backward propagation is called Epoch
-->In gradient decent, If we have hundred million records for each epoch we pass one million records.
 
2)SGD(Stochastic Gradient Descent): -
   In Stochastic Gradient Descent for each epoch we pass only one record.
-->Processing 1st record through forward propagation, finding ŷ, finding out the Loss & processing the record through Backward Propagation for Updating the weight. This entire process is called Iteration
-->When we pass the first record it is called Iteration 1 etc.
-->The major disadvantage in SGD is the convergence will be very as each & every record passed individually.
-->The time complexity will be very high.
-->In above figure we can observe that their was no zig zag moment in data points in Gradient Descent but the Gradient Descent require huge resources
-->In case of SGD we can observe that their was a huge zig zag moment in the data points. 

3)Mini Batch SGD: -
                   In mini batch SGD instead of passing each & every record individually or passing a huge bulk of records. We set a optimal batch size & pass the batch size records. i.e. for example if our batch size=1000, we pass 1000 records each & every time. 
-->In Mini Batch SGD also the zig zag moment in the data point will be present but it was less when compare to SGD. The zig zag moment in data points is called Noise.

Instead of processing the entire dataset at once (as in Batch Gradient Descent) or updating weights after every single data point (as in Stochastic Gradient Descent), Mini-Batch Gradient Descent takes a middle approach:

  1. Divide the dataset into smaller batches.

    • Example: If you have 1,000 records and choose a batch size of 100, you will have 10 batches in total.
  2. Iteratively update the weights for each batch:

    • Take the first batch (100 records), compute the loss, and update the weights.
    • Move to the next batch, using the updated weights from the previous step, compute the loss, and update the weights again.
    • Repeat this process for all batches.
  3. Multiple weight updates per epoch:

    • If the entire dataset were processed at once, the weights would be updated only once per epoch.
    • With mini-batches, weights get updated multiple times per epoch (in this case, 10 times for 10 batches), leading to more frequent learning and better optimization.

Why Use Mini-Batch Gradient Descent?

✅ More stable updates compared to Stochastic Gradient Descent (SGD).
✅ Faster and more memory-efficient than Batch Gradient Descent.
✅ Helps generalization by introducing some randomness in learning.


4)Mini Batch SGD with momentum: -
                 In order to remove the noise present in SGD & Mini Batch SGD we use SGD with momentum. Momentum helps us to reduce the noise.
-->By the using the concept of Exponential Weight average we smoothen the curve such that Noise reduces.

-->In this we reducing the noise

-->It have a quicker conversation.


 5)AdaGrad: -

-->Adagrad is nothing but the Adaptive Gradient Descent. In gradient descent the learning rate if fixed in Adagard we are making changes in Adagrad such the initially the learning rate  will be high but decreases as we move towards Global Minimum point.

-->In this we are bringing the adaptiveness in the learning, in AdaGrad the learning rate wont be fixed, the learning rate will be decreasing as we reach to the Global Minimum point.

 6)RMSPROP: -

-->RMSPROP is nothing but the Root Mean Squared Propagation, it is an extension of gradient descent and the AdaGrad version of gradient descent. 

7)Adam Optimizer: -

        -->In Adam Optimizer we combine momentum along with RMSPROP.

-->It is the best optimizer among all optimizers.

-->It is solving the problem of smoothening

-->The Learning Rate becomes adaptive.

Batch Normalization is a technique in deep learning that helps speed up training and improve the stability of neural networks. Here’s a simple way to understand it:

Why Do We Need Batch Normalization?

  1. Neural networks learn better when input data is well-scaled.
    • If some features in the data have very large values and others have very small values, the network may struggle to learn efficiently.
  2. Internal Covariate Shift:
    • As data passes through multiple layers, its distribution keeps changing, making training harder. Batch Normalization helps to keep things stable.

How Does Batch Normalization Work?

  • It normalizes the inputs to each layer so they have a mean of 0 and a standard deviation of 1.
  • This normalization is done for each mini-batch during training.
  • It then applies two learnable parameters (scale & shift) to maintain flexibility.

Steps in Batch Normalization:

  1. Compute the mean and variance for the mini-batch.
  2. Normalize the values: xnorm=xμσx_{\text{norm}} = \frac{x - \mu}{\sigma} where μ\mu is the mean and σ\sigma is the standard deviation of the batch.
  3. Scale and shift using two parameters γ\gamma and β\beta: y=γxnorm+βy = \gamma x_{\text{norm}} + \beta
    • γ\gamma (scale) and β\beta (shift) are learned during training.
  4. Generally batch normalization for every 2 hidden layers once is prefered, insted of for every hidden layer

Benefits of Batch Normalization

✅ Faster training – allows higher learning rates.
✅ More stable training – prevents drastic changes in activations.
✅ Reduces overfitting – acts like a regularizer.
✅ Less sensitive to weight initialization.

sequential:-

     Taking the entire neural network  at once as a block is called sequential. This indicated we can do forward  propagation & backward propagation.

dense: -

                 Dense helps us to create hidden layers, input layers & output layer.

Activation: -

      Activation helps us to use different types of Optimizers.


-->ANN,CNN,RNN is a Black Box Model

-->We need to scale the data before we apply ANN & CNN.

-->For scaling of data for ANN mostly Standard scaler is used.

-->One of the library we used to implement ANN is TensorFlow which was developed by Google. Before TensorFlow 2.0 versions we need to install TensorFlow & Kera's separately. But from the versions greeter than 2.0 TensorFlow is integrated with Kera's.

Click here to see the ANN implementation

What is Regularization in Deep Learning?

Regularization is a technique used in deep learning to prevent overfitting and improve the generalization of a model. It does this by adding constraints or modifications during training to ensure the model doesn't memorize the training data but instead learns meaningful patterns.


Why is Regularization Important?

Deep learning models have millions of parameters, and if trained on a limited dataset, they can easily memorize the data instead of learning general patterns. This leads to overfitting, where the model performs well on training data but poorly on new, unseen data.

Regularization helps by: 

✅ Preventing overfitting and making the model more generalizable.
✅ Avoiding high variance in predictions.
✅ Allowing the model to learn robust features instead of noise.


Types of Regularization in Deep Learning

Regularization can be applied in multiple ways, each addressing overfitting differently.

1️⃣ L1 and L2 Regularization (Weight Decay)

These methods modify the loss function by adding a penalty on large weights.

🔹 L1 Regularization (Lasso)

  • Adds the sum of absolute values of weights to the loss function: Loss=Original Loss+λWLoss = Original\ Loss + \lambda \sum |W|
  • Encourages sparsity, meaning some weights become zero, effectively eliminating less important features.
  • Useful for feature selection in high-dimensional data.

🔹 L2 Regularization (Ridge)

  • Adds the sum of squared weights to the loss function: Loss=Original Loss+λW2Loss = Original\ Loss + \lambda \sum W^2
  • Encourages smaller weights rather than eliminating them.
  • Helps in preventing extreme values, making the model more stable.

🔹 L1 vs. L2 Regularization

Feature L1 (Lasso) L2 (Ridge)
Weight effect Some weights become zero All weights shrink but remain nonzero
Feature selection Yes, removes less important features No, but helps in generalization
Best for Sparse models, high-dimensional data Deep networks, general regularization

💡 L1 and L2 are often combined as Elastic Net, which balances both techniques.


2️⃣ Dropout in Neural Networks

Why is Dropout Needed?

  • Sometimes, a neural network overfits, meaning it performs well on training data but poorly on test data.
  • Overfitting happens when neurons memorize training data instead of learning general patterns.
  • To prevent this, we use Dropout, a regularization technique.

How Dropout Works

  • Dropout is a regularization layer that randomly deactivates a percentage of neurons during training.
  • If we set dropout = 0.3, then 30% of neurons are randomly turned off during training.
  • Key point:
    • In Batch 1, 30% of neurons are randomly deactivated.
    • In Batch 2, a different set of 30% of neurons are deactivated.
    • This continues for all batches during training.
  • Dropout does not remove neurons permanently; it reduces the dependency of one neuron on others, helping the model learn more generalized patterns.

Benefits of Dropout

✅ Reduces overfitting
✅ Makes the model more robust
✅ Ensures neurons don’t become overly reliant on each other

🔹 Works well in fully connected (dense) layers, but not always needed in convolutional layers (CNNs).
🔹 Common Dropout values: 0.2 - 0.5


3️⃣ Batch Normalization (BN)

  • Normalizes the activations at each layer, stabilizing training.
  • Helps models train faster and with higher learning rates.
  • Acts as a form of regularization by introducing slight noise in training.
  • Works well in deep networks like CNNs and Transformers.

📌 Alternative: Layer Normalization (better for NLP tasks & RNNs).


4️⃣ Early Stopping

  • Stops training when validation loss stops improving.
  • Prevents the model from continuing to learn noise after reaching the optimal point.
  • Saves training time and prevents overfitting.

🔹 Steps:

  1. Monitor validation loss.
  2. If loss stops decreasing for a set number of epochs, stop training.
  3. Use the model from the epoch with the best validation performance.

5️⃣ Data Augmentation

  • Increases the diversity of the training data artificially.
  • Commonly used in image processing (rotating, flipping, adding noise).
  • Helps prevent overfitting by ensuring the model learns more general patterns.

🔹 Examples in Image Classification: ✔ Random cropping, flipping, rotation
✔ Adding noise or blurring
✔ Changing brightness or contrast

📌 Works best for computer vision tasks.


6️⃣ Label Smoothing

  • Instead of using hard labels (e.g., 1 for cat, 0 for dog), soften the labels (e.g., 0.9 for cat, 0.1 for dog).
  • Prevents the model from becoming too confident, improving generalization.
  • Used in classification tasks like NLP and computer vision.

7️⃣ Gradient Clipping

  • Limits the size of gradients to prevent exploding gradients, especially in RNNs.
  • Helps in stabilizing training when using deep networks.

📌 Common in recurrent neural networks (RNNs) and deep transformers like GPT.


Comparison of Regularization Techniques

Technique Prevents Overfitting? Works on? Best for
L1/L2 Regularization ✅ Yes Any deep learning model General use cases
Dropout ✅ Yes Dense (fully connected) layers Deep feedforward networks
Batch Normalization ✅ Yes CNNs, Deep Networks Training stability & speed
Early Stopping ✅ Yes Any model Saving time, avoiding unnecessary training
Data Augmentation ✅ Yes Image-based models Image classification, object detection
Label Smoothing ✅ Yes Classification tasks NLP, Computer Vision
Gradient Clipping ✅ Yes RNNs, deep models Preventing exploding gradients

Which Regularization Should You Use?

🔹 For general deep learning problems: Use L2 Regularization + Dropout
🔹 For CNNs: Use Batch Normalization + Data Augmentation
🔹 For RNNs (LSTMs, GRUs): Use Gradient Clipping + Layer Normalization
🔹 For NLP tasks: Use Label Smoothing + Layer Normalization
🔹 For faster training: Use Batch Normalization + Early Stopping

Weight Initialization in Deep Learning

Weight initialization is the process of assigning initial values to the weights of a neural network before training. Proper initialization is crucial to ensure stable gradient updates and faster convergence.


Why is Weight Initialization Important?

  1. Avoids Exploding/Vanishing Gradients: Poor initialization can cause gradients to become too large or too small, making training unstable.
  2. Speeds Up Convergence: Proper initialization helps the model learn faster by providing meaningful starting points.
  3. Prevents Dead Neurons: Ensures neurons remain active and contribute to learning.

Types of Weight Initialization

1️⃣ Zero Initialization:

  • All weights are set to zero.
  • Problem: Neurons will learn the same features, leading to a lack of diversity.
  • Not recommended.
  • Causes symmetry issue, no learning

2️⃣ Random Initialization:

  • Weights are set randomly.
  • If values are too high, gradients may explode. If too low, gradients may vanish.
  • Not ideal for deep networks.
  • Can cause vanishing/exploding gradients

3️⃣ Xavier (Glorot) Initialization:

  • Designed for sigmoid & tanh activations.
  • Formula: W=N(0,1nin+nout)W = \mathcal{N}\left(0, \frac{1}{n_{in} + n_{out}}\right)
  • Where ninn_{in}= number of inputs, noutn_{out} = number of outputs.
  • Ensures balanced variance of activations and gradients.
  • Good for shallow networks.
  • Maintains variance across layers

4️⃣ He Initialization (Kaiming Initialization):

  • Designed for ReLU activations.
  • Formula: W=N(0,2nin)W = \mathcal{N}\left(0, \frac{2}{n_{in}}\right)
  • Prevents neurons from becoming "dead."
  • Best for deep networks with ReLU.

5️⃣ Lecun Initialization:

  • Designed for Leaky ReLU / SELU activations.
  • Formula: W=N(0,1nin)W = \mathcal{N}\left(0, \frac{1}{n_{in}}\right)
  • Works best for self-normalizing networks.

How to Apply in Keras?

from keras.layers import Dense
from keras.initializers import HeNormal, GlorotUniform

# Using He Initialization for ReLU activation
layer1 = Dense(128, activation="relu", kernel_initializer=HeNormal())

# Using Xavier Initialization for Tanh activation
layer2 = Dense(64, activation="tanh", kernel_initializer=GlorotUniform())

Which Initialization to Use?

  • For ReLU / Leaky ReLU → Use He Initialization
  • For Sigmoid / Tanh → Use Xavier (Glorot) Initialization
  • For SELU → Use Lecun Initialization

Choosing the right weight initialization improves training stability and speed while avoiding common deep learning problems. 🚀

CONVOLUTION NEAURAL NETWORK(CNN): -

  When ever we work with Images & Video Frames like image classification, object detection we prefer to use CNN.

-->CNN works similar to Visual cortex which is present in cerebral cortex which again present in our human brain i.e. CNN will go under different layers to process the result.



-->The first step we do in CNN is we bring the values between 0 & 1 by dividing each pixel value by 255 this step is called MIN-MAX SCALLING.

Before we go to the working of CNN let’s see the basics  how an image is represented i.e. images are classified into 2 types 1)Black & White images 2)RGB Images i.e. RGB means RED,GREEN,BLUE


-->    The filter is applied on the convolution matrix
i.e. each value of convolution matrix is multiplied with each value in filter/kernel & sum is calculated the sum valued is assigned to output matrix, this process continues

-->Feature scaling is applied on the matrix that we got after applying the filter by lowest value is converted to 0 & highest value converted to 255.

-->255 is referred as WHITE color & 0 is referred as as BLACK


Whenever we pass 6X6 image through 3X3 filter we are getting 4X4 matrix as output this we can calculate as below

for example lets assume as input image is 6X6 ,n=6. As filter is 3X3 matrix f=3. The formula to calculate the output is n-f+1=6-3+1=4 i.e. output is 4X4 matrix.

-->But the major problem here is the input is 6X6 but the output is 4X4 this means the image size is decreasing i.e. is nothing but the we are losing some data. In order to prevent this loss we use PADDING.

-->PADDING is nothing but the building the specific compound around the image. The concept of applying a layer on image is called as PADDING.

-->After building the compound around the image the 6X6 image is converted to 8X8


-->PADDING can be done in 2 ways , first one is filling the newly created cells with ZERO(0) & next one is filling the cells with the nearest value to that cell.
-->Below is the example of nearest value filling



-->Below is the example for zero filling.


-->Lets calculate the size of output image after padding, the formula to calculate after padding is as below

n+2p-f+1/S

where p=no of layers of padding applied according the above use case p=1

S=No of strides , according to above use case s=1

n+2p-f+1/1=6+2-3+1=6

-->In ANN as we update the weights in back propagation similarly we update the filter in CNN in back propagation.

-->On each & every value of output we apply ReLU activation function in CNN

-->The above all steps will come under convolution operation. After completing convolution operation we do Max Pooling

Pooling:-

 Pooling is nothing but the down sampling of an image. Pooling helps us to solve the problem of location invariant. We have three types of Pooling

1)Avg Pooling

2)Min Pooling

3)Max Pooling

-->In Avg pooling the average of the pool is considered , where as in Min pooling the minimum of the pool is considered and in Max pooling the maximum of the pool is considered.

Flattening Layer:-

This is the next step after completing pooling. Flattening is used to convert all the resultant 2-D arrays from pooled feature maps into a single long linear vector as below. This flattening layer is passed through fully connected neural network & finally we get the results.



Click here th know the implementation of CNN



Recurrent Neural Network (RNN):-
                  Lets understand Recurrent Neural Network by below example. At t=t-1, we pass our record to RNN model, the pre-processing will happen & we will get the output. For the next input i.e. at t=t whatever the output we got from  first input  is also sent along with input in to recurring neural network and because of this entire process the sequences information is maintained and this process continues at t=t+1, t=t+2 .
-->This entire above process occur in forward propagation.
-->In Back propagation, we will update the weights to reach to gradient decent/global  minimum


-->In Backward propagation, the derivate of  weights will get updated continuously with respect to time by considering chain rule with respect to derivate.
-->If we apply sigmoid activation function to update the weights, the vanishing gradient problem occurs i.e. the derivate of sigmoid will lie between 0 to 1 & as the back propagation proceeds' the derivate of sigmoid we will get is very small. So that the new weight will be almost similar to old. 
-->So that their will be only small moment & it wont to reach to global minimum point.
-->If we other activation functions like ReLU , the derivative will be greater than 1 as the back propagation proceeds' , it creates exploding gradient problem i.e. the weights changes will happen so big, so that wont reach the global minimum.
-->To overcome this problem we use LSTM.

-->Types of RNN
      1)One to One RNN
                In One to One RNN we will be having one input and one output
      2)One to Many RNN
               In One to Many RNN we will be having one input and many outputs.
      3)Many to One RNN
                In Many to One RNN we will be having many inputs and one output.
      4)Many to Many RNN
                In Many to Many RNN we will be having many inputs and many outputs.


LSTM RNN Networks:-

     In RNN the problem of vanishing gradient or dead neauron occurs and if we have a very deep neaural network and some of the output really depend on first word in this case the context can not be captured easily with RNN. So we prefer LSTM RNN
    In order to control the flow of information in LSTM(Long short-term memory) network LSTM mainly contains 3 Gates
                   1)Forget Gate
                   2)Input Gate
                   3)Output Gate


1)Memory Cell:-
         Memory Cell is used for remembering & forgetting based on the context of the input.
2)Forget Gate:-
        Forget Gate decided what  information we should forget from previous cell information to update the current cell state of LSTM Unit.
3)Input Gate:-
        Input Gate decided what  information we should retain from previous cell information to update the current cell state of LSTM Unit i.e. we are adding information to memory cell.
4)Output Gate:-
      The information from the memory cell after passing through tanh function, which converts into -1 to +1 and sigmoid function which is between 0 to 1 will get combined by the point wise operation. All the above process helps to get meaningful the information  & this result is passed to next cell.
-->For hyper pramater use keras tuner grid search method
BIDIRECTIONAL RNN:-
                    In RNN all the information passes in a unidirectional way, we cannot get the information of future words. By considering this case we have a concept of Bidirectional RNN.
-->Bidirectional RNN is slow in working when compare to RNN

Similart we have bidirectional LSTM

Bidirectional LSTM  🚀

What is it?

A Bidirectional LSTM (BiLSTM) is an advanced LSTM that processes data in both forward and backward directions to capture more context from sequences.

How it works?

🔹 Standard LSTM reads input from past to future (left to right).
🔹 BiLSTM has two LSTMs:

  • Forward LSTM (left → right)

  • Backward LSTM (right → left)
    🔹 Both outputs are combined to improve understanding.

Why use it?

✅ Captures past & future context
✅ Improves performance in NLP tasks (e.g., translation, speech recognition)


1️⃣ What is Sequence-to-Sequence (Seq2Seq)?

Seq2Seq models are designed for tasks where both the input and output are sequences of variable lengths. They are widely used in:

  • Machine Translation (e.g., English → French)

  • Text Summarization (e.g., Long text → Summary)

  • Speech-to-Text (e.g., Audio → Transcription)

A Seq2Seq model consists of two primary components: Encoder and Decoder, both usually implemented using Recurrent Neural Networks (RNNs) like LSTM (Long Short-Term Memory) or GRU (Gated Recurrent Unit).


2️⃣ In-depth Intuition of Encoder & Decoder Architecture

🔹 Encoder (Processing the Input)

The encoder’s job is to understand the input sequence and convert it into a context vector (fixed-size representation) that summarizes the entire sequence.

✅ Steps:

  1. Takes input word-by-word (or token-by-token).

  2. Processes it using an RNN, LSTM, or GRU.

  3. Produces a final hidden state (context vector) that contains the encoded meaning.

📌 Example: If translating “I love NLP” into French, the encoder processes the sentence and produces a compact representation of its meaning.


🔹 Decoder (Generating the Output)

The decoder takes the context vector from the encoder and generates the output sequence one token at a time.

✅ Steps:

  1. Takes the context vector as input.

  2. Generates words sequentially using an RNN/LSTM/GRU.

  3. Uses the previously generated word as input for the next step.

📌 Example: If translating “I love NLP” → “J’adore le NLP,” the decoder starts generating words in French based on the encoded context.


3️⃣ Problem with Encoder-Decoder Architecture

🚨 Issue: Long Sequences Lead to Information Loss

A major limitation of the traditional Seq2Seq model is that it compresses the entire input sequence into a single context vector.

🔴 As sentence length increases, BLEU score drops (BLEU score is a metric for evaluating text generation models like machine translation).
🔴 Context vector mainly retains information from the latest words, causing information loss from earlier words.

📌 Example:

  • If processing “I love NLP”, the context vector will hold more information about “NLP” than “I” or “love.”

  • What happens if a sentence has 100 words? The first words will be almost forgotten.

Thus, a single context vector fails to retain long-term dependencies, leading to inaccurate translations or text generation.


4️⃣ Solution: Attention Mechanism in Seq2Seq

To fix the limitations of the encoder-decoder model, we introduce the Attention Mechanism. Instead of relying on a single fixed context vector, attention allows the decoder to dynamically focus on different parts of the input at each step.

🔹 How Attention Works

Instead of passing only the final hidden state from the encoder, we allow the decoder to look at all encoder hidden states and assign weights to them dynamically.

✅ Steps of Attention Mechanism:

  1. Each encoder hidden state contributes to the context, weighted differently at each step.

  2. The decoder calculates an attention score for each input word.

  3. Words with higher attention scores influence the output more.

  4. The decoder generates words based on a weighted sum of encoder hidden states, not just the last one.


5️⃣ Advanced: Attention Mechanism In-Depth Architecture

🔹 Bidirectional LSTM

To further improve performance, Bidirectional LSTMs (BiLSTMs) are used instead of standard LSTMs.

  • A forward LSTM reads the input in its original order.

  • A backward LSTM reads the input in reverse order.

  • The outputs from both directions are combined, creating a richer representation for each word.

📌 Example:
If the input is "I love NLP", a bidirectional LSTM will process:

  • Forward: (I → love → NLP)

  • Backward: (NLP → love → I)

  • The final representation for each word is enriched with both past and future context.

🔹 Types of Attention Mechanisms

1️⃣ Additive Attention (Bahdanau Attention)

  • Computes attention scores using a feedforward network.

  • Better for variable-length sequences.

2️⃣ Multiplicative Attention (Luong Attention)

  • Uses dot product similarity between hidden states.

  • Faster but works best when input and output lengths are similar.


🔥 Final Thoughts

The Attention Mechanism revolutionized Seq2Seq models, allowing them to handle long sequences efficiently. This led to the development of Transformer-based architectures like T5, BART, GPT, which rely entirely on attention (instead of RNNs).

Comments