Introducing AlexNet

AlexNet is a convolutional neural network (CNN) that rose to prominence when it won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), an annual challenge that aims to evaluate algorithms for object detection and image classification. The model is trained on more than a million images and can classify images into 1000 object categories.

The ILSVRC evaluates the success of image classification solutions by using two important metrics, the top-5 and top-1 errors. When given a set of N images, often called test images, and mapped to a target class for each metric: 

  • top-1 error checks if the top predicted class is the same as the target class. 
  • top-5 error checks if the target class is one of the top five predictions.  

For both metrics, the top error is calculated as, "the number of times the predicted class does not match the target class, divided by the total number of test images". In simpler terms, a lower score is better. 

AlexNet achieved a top-5 error around 16%, which was an extremely good result back in 2012. To put it into context, until that year no other classifier had been able achieve results under 20%. AlexNet was also more than 10% more accurate than the runner up.  

Since 2012, other CNNs, such as VGG and ResNet, have improved on AlexNet's performance, as illustrated in this graph.

This graph shows the ImageNet classification. 
What does AlexNet consist of?

AlexNet is made up of eight trainable layers, five convolution layers and three fully connected layers. All the trainable layers are followed by a ReLU activation function, except for the last fully connected layer, where the Softmax function is used.

Besides the trainable layers, the network also has:

  • Three pooling layers.
  • Two normalization layers.
  • One dropout layer. This is only used for training to reduce the overfitting.

This table shows the layers and their details:

 n. Layer  Info 
 1 Convolution 11x11x3x96 - (Stride(4,4) - Pad(0,0) 
 2 Activation ReLU
 3 Normalization Cross Map, 5,0.0001,0.75 
 4 Pooling 3x3 - Stride(2,2) 
 5 Grouping Convolution  5x5x96x256 - Stride(1,1) - Pad(2,2) 
 6 Activation ReLU
 7 Normalization  Cross Map, 5,0.0001,0.75 
 8 Pooling 3x3 - Stride(2,2) 
 9 Convolution 3x3x256x384 - Stride(1,1) - Pad(1,1)
 10 Activation ReLU
 11 Grouping Convolution 3x3x384x384 - Stride(1,2) - Pad(1,1) 
 12 Activation ReLU
 13 Grouping Convolution 3x3x384x256 - Stride(1,1) - Pad(1,1)
 14 Activation ReLU
 15 Pooling 3x3 - Stride(2,2)
 16 Fully connected 4096x9216 - Stride(1,1) - Pad(1,1)
 17 Activation ReLU
 18 Fully connected 4096x9216 - Stride(1,1) - Pad(1,1) 
 19 Activation ReLU
 20 Fully connected 1000x4096 - Stride(1,1) - Pad(1,1)
 21 Softmax  

In the table, there are some convolution layers that are actually grouping convolutions. This is an efficient engineering trick that allows the acceleration of the network over two GPUs without sacrificing accuracy.

If the group size is set to two, then the first half of the filters is connected to the first half of the input feature maps. The second half is connected to the second half of the input feature maps, as this image shows.

This diagram shows the input and output tensor. 

The grouping convolution not only allows you to spread the workload over multiple GPUs, it also reduces the number of MACs needed for the layer by half.

Previous Next