Adversarial attacks and the robustness measure

Machine learning is applied in industries where it is crucial for algorithms to be reliable. (You don’t want your self-driving car to be prone to malware!) Oftentimes, malicious hackers attempt to undermine the integrity of neural networks, causing the algorithms to misclassify certain inputs in the test phase.

An adversarial perturbation is when attackers make very small perturbations to the test data in a well-crafted way, which hinders the target classifier’s accuracy.

Below is an example of an adversarial perturbation on our neural net whose job is to recognize handwritten digits from the MNIST dataset. Although both images look like the number “2” to humans, our neural net that has a 98% test accuracy on normal images, thinks the picture on the right is the number “1”.


There are many ways for conducting adversarial attacks. One of the most popular methods is the Fast Gradient Sign Method (FGSM) attack. It is an efficient method that works by perturbing the inputs to the direction of the sign of the gradients.

    #apply FGSM pertubation to images
    pertubation = tf.sign(tf.gradients(loss, x))
    perturbed_op = tf.squeeze(epsilon * pertubation) + images

There are also many (failed and successful) attempts in the research community to defend against these adversarial perturbations, to make the classifier more “robust”. For this post, the robustness of a classifier refers to the classifier’s ability to resist the adversarial perturbations from the FGSM attack.

Overfitting vs. Adversarial Perturbation

The behaviors for an overfitted classifier is similar to an classifier facing adversarial perturbations. Overfitting is when the classifier performs well on the training data but does badly on the test set.

In the case for adversarial attacks, the neural net performs well in the training set and the current test set, but does badly on the perturbed test data, which is often not distinguishable to the original test data by the human eyes.

In overfitting, the classifier remembers the training space that does not generalize to the test space. In the adversarial attack scenarios, the classifier remembers the normal space (training space + test space) that does not generalize to the perturbed space.

So an interesting question would be, what will happen to the robustness of a classifier if we intentionally overfit it? A classifier is robust if it generalizes well in the perturbed space.

The intuition was: if a classifier does not generalize in the test space, it should not generalize in the perturbed space either, because the perturbed space is farther away from the training space than the normal test space. Therefore, overfitting a classifier would make it less robust to adversarial perturbations. However, the results from my experiments say the exact opposite. Our overfitted classifier seems more robust to FGSM perturbations.

Experiment Design

# Hidden layer 1
with tf.variable_scope("layer1"):
    w1 = tf.get_variable("w",[input_dimension, l1_units], dtype = 
        tf.float64, initializer = tf.contrib.layers.xavier_initializer(seed = 1))
    b1 = tf.get_variable("b", [l1_units], dtype = tf.float64, initializer = tf.zeros_initializer())
    z1 = tf.matmul(x, w1) + b1
    y1 = tf.nn.relu(z1)

In the experiment, we compare the classifier’s test accuracy on perturbed data when the classifier has various levels of overfitting. We repeat the experiments 5 times and take the average accuracies to observe the general trend.

The classifier setup: we developed a simple 3-layer classifier for the MNIST data set. The number of neurons in each layer are 200, 300 and 10. We used “relu” as the activation functions for the intermediate layers, and “softmax” for the output layer. We used the Adam optimizer in Tensorflow and minimized the cross entropy loss. We initialized the weights with the xavier initializer and the biases with the zero initializer. The learning rate we used was 0.002.


We overfit the classifier by training it for 181 epochs. We store the classifier every 20 epochs, so we have generated 10 classifiers each trained with 1, 21, 41, 61 … 181 epochs. We know that overfitting starts occurring before the 25th epoch by graphing the train and validation losses. Then we compare the 10 models’ accuracy on the perturbed test data.

Experiment Results

Epoch 1 - 81 Epoch 101 - 181

The graphs demonstrate the relationship between the number of epochs used in training and perturbed accuracy under an FGSM attack with epsilon = 0.05. We noticed that the perturbed accuracy goes up (the classifier becomes more robust) as we train the classifier for longer. This is the different from our intuition.

For this MNIST classifier, overfitting actually made the classifier more robust against FGSM attacks. Note that due to the nature of the MNIST dataset, overfitting does not lead to a significance loss of test accuracy. It would be interesting to see if this result replicates for other larger datasets.

If you have any hypothesis on why the counterintuitive results happened, please comment or contact me. I’d also love to discuss about this more and conduct further experiments.

The code for the experiments can be found here.