Auxiliary matrials: Overview of the Architecture
Before introducing the inception module, the main theme that was adopted during the development of the CNNs classifiers was about stacking as many as possible convolutional layers.
In most of the standard network architectures, the intuition is not clear why and when to perform the max-pooling operation, when to use the convolutional operation. For example, in AlextNet we have the convolutional operation and max-pooling operation following each other whereas in VGGNet, we have 3 convolutional operations in a row and then 1 max-pooling layer.
Scale Variability
→ Thus, we can not dictate specific kernel size. It actually all depends on the type of information we are looking for (Global OR local).
Overfitting
→ This blindly adopted technique has made the model prone to overfitting.
Gradient Vanishing
→ Stacking huge number convolutional layers introduces new problem which is the gradient depletion.
Computationally Expensive
Multiple Sizes on same level (Deeper → Wider)
It adds more computational burdon on the model as the number of weights in (3x3) and (5x5) convolutional layers is huge since we work on the entire width of the previous layer output.
→ So solve this problem we have used an adaptor convolutional layer (1x1 conv).
The max pooling convolutional layer introduces two problems.