AlexNet is a convolutional neural network (CNN) that rose to prominence when it won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), an annual challenge that aims to evaluate algorithms for object detection and image classification. The model is trained on more than a million images and can classify images into 1000 object categories.
The ILSVRC evaluates the success of image classification solutions by using two important metrics, the top-5 and top-1 errors. When given a set of N images, often called test images, and mapped to a target class for each metric:
- top-1 error checks if the top predicted class is the same as the target class.
- top-5 error checks if the target class is one of the top five predictions.
For both metrics, the top error is calculated as, "the number of times the predicted class does not match the target class, divided by the total number of test images". In simpler terms, a lower score is better.
AlexNet achieved a top-5 error around 16%, which was a extremely good result back in 2012. To put it into context, until that year no other classifier had been able achieve results under 20%. AlexNet was also more than 10% more accurate than the runner up.
Since 2012, other CNNs, such as VGG and ResNet, have improved on AlexNet's performance, as illustrated in this graph.
What does AlexNet consist of?
AlexNet is made up of eight trainable layers, five convolution layers and three fully connected layers. All the trainable layers are followed by a ReLu activation function, except for the last fully connected layer, where the Softmax function is used.
Besides the trainable layers, the network also has:
- Three pooling layers.
- Two normalization layers.
- One dropout layer. This is only used for training to reduce the overfitting.
This table shows the layers and their details:
|1||Convolution||11x11x3x96 - (Stride(4,4) - Pad(0,0)|
|4||Pooling||3x3 - Stride(2,2)|
|5||Grouping Convolution||5x5x96x256 - Stride(1,1) - Pad(2,2)|
|8||Pooling||3x3 - Stride(2,2)
|9||Convolution||3x3x256x384 - Stride(1,1) - Pad(1,1)|
|11||Grouping Convolution||3x3x384x384 - Stride(1,2) - Pad(1,1)|
|13||Grouping Convolution||3x3x384x256 - Stride(1,1) - Pad(1,1)|
|15||Pooling||3x3 - Stride(2,2)|
|16||Fully connected||4096x9216 - Stride(1,1) - Pad(1,1)|
|18||Fully connected||4096x9216 - Stride(1,1) - Pad(1,1)|
|20||Fully connected||1000x4096 - Stride(1,1) - Pad(1,1)|
In the table, there are some convolution layers that are actually grouping convolutions. This is an efficient engineering trick that allows the acceleration of the network over two GPUs without sacrificing accuracy.
If the group size is set to two, then the first half of the filters will be connected to the first half of the input feature maps and the second half will be connected to the second half of the input feature maps, as this image shows.
The grouping convolution not only allows you to spread the worlkload over multiple GPUs, it also reduces the number of MACs needed for the layer by half.