Before introducing the inception module, the main theme that was adopted during the development of the CNNs classifiers was about stacking as many as possible convolutional layers.
In most of the standard network architectures, the intuition is not clear why and when to perform the max-pooling operation, when to use the convolutional operation. For example, in AlextNet we have the convolutional operation and max-pooling operation following each other whereas in VGGNet, we have 3 convolutional operations in a row and then 1 max-pooling layer.
Inception v1
Inception v2 & v3 (Same Paper)
Inception v4 & Inception-ResNet

Inception v1 (GoogLeNet)

The Purported Problem

Scale Variability

→ Thus, we can not dictate specific kernel size. It actually all depends on the type of information we are looking for (Global OR local).
Overfitting

→ This blindly adopted technique has made the model prone to overfitting.
Gradient Vanishing

→ Stacking huge number convolutional layers introduces new problem which is the gradient depletion.
Computationally Expensive

Multiple Sizes on same level (Deeper → Wider)