Transfer learning is a machine learning research problem that focuses on storing knowledge gained while solving one problem and applying it to another related problem. For example, knowledge gained while learning to recognize different species of dogs can be useful for classifying cats as well.
Yosinski et al discovered that transfer learning can also help the classifier to generalize better. For example, a dog classifier trained with transfer learning tend to produce higher test accuracy on dog images it has never seen at training time. We refer to this concept as the “generalization boost”. However, as that paper’s focus was on quantifying the generalization results at different layers, the authors only spent a few sentences on this phenomenon.
This discovery is worth a closer look. If the result generalizes, it could be a useful technique for improving deep neural network’s performances.
In this post, we will first go over some basic concepts of transfer learning. Then, we will elaborate on the experiments that discovered the generalization boost. Finally, we will come up with a few hypotheses on what caused this improvement in generalization error, and design a set of experiments to verify our hypotheses.
Transfer Learning Basics
This section briefly summarizes transfer learning and two ways for fine tuning weights after the transfer: by freezing the weights of the transferred layers or fine tuning all the weights with labeled data. If you’re familiar with these concepts, feel free to skip to the next section on the discovery of the “generalization boost”.
Deep ConvNets achieved human level performance on large scale image classification tasks, thanks to researchers’ hard work in creating large datasets like ImageNet.
However, we don’t always have a large dataset available for our task in hand. For example, if I were to make a dog breed classifier, I have to rely on a smaller dataset such as the Stanford Dogs Dataset, which only has 20k images (much less than the 14 million images in ImageNet).
The good news is, part of what a ConvNet learns is generally applicable to other tasks. For example, when trained on images, the features captured by the first layer of ConvNets tend to resemble Gabor filters and edge detectors. The image below visualizes some of the features learned by the first layer of a ConvNet. These features are generally useful for both large scaled image classification, as well as my dog breed classifier.
This idea naturally arises: can we transfer a model’s knowledge about task A (large scaled image classification) to solve a different task B (dog breed classification), when only a small labeled dataset for B is available?
Absolutely! In 2014, numerous papers achieved state-of-the-art results on a variety of vision tasks by transferring the knowledge from another dataset.
- CNN features off-the-Shelf: An astounding baseline for recognition, A. Razavian et al.
- Learning and transferring mid-Level image representations using convolutional neural networks, M. Oquab et al.
- Decaf: A deep convolutional activation feature for generic visual recognition, J. Donahue et al.
The “knowledge transfer” is done by training the network on base task A first. We call this our base network. Then we copy its first n layers to the first n layers of the target network, which is for solving task B, the problem in hand. The remaining layers for the target network is initialized randomly. Then we train the target network with the limited amount of target labels.
We can either leave the weights for the first n layers frozen and only update the remaining weights, or fine tunes all the weights in the network. The latter approach results in better training performance, because it gives more flexibility to the network to fit any arbitrary distributions. However, it may also result in overfitting if the number of labeled data is very small for the target task.
In general, if we have a large amount of labeled data available for the task in hand, no transfer learning should be needed. If we have a very small amount of data, we want to keep the weights of the few layers frozen to avoid overfitting. If it’s somewhere in the middle, we adopt the method of transfer learning with fine tuning all the weights.
Yosinski et al’s “How transferable are features in deep neural networks” discovered that transfer learning boosts generalization accuracy even after sufficient fine tuning to a large target dataset. Let’s first look at what the authors observed that brought them to this conclusion.
In their experiments, 1000 ImageNet classes were split into two random groups, each containing 500 classes and approximately half the data. Image classification on the first group is referred as base task A, and classification on the second group is referred as target task B.
They first trained an 8-layer ConvNet A on task A, and transferred its first N (1 <= N <= 7) layers to Transferred Network B which has the same architecture (while randomly initialize the remaining layers). Transferred Network B is then fine tuned by retraining on target dataset B. It was not specified that how many epochs of retraining was performed, but the authors said “It is surprising that this effect lingers through so much retraining”.
They then compared the test accuracy between Transferred Network B to another ConvNet of the same structure but was only trained on target dataset B. They observed that Transferred Network B has a 1-2% higher test accuracy. When more layers from A were transferred, higher test accuracy was achieved.
I am curious what this generalization boost was caused by. In this post, I will list some of the speculations, and include relevant experiments in a later post.
Longer Training Time
The authors thought the performance improvement could plausibly be attributed to the longer total training time.
They created another network BnB+ that has the same training mechanism as Transferred Network B. BnB+ was first trained on dataset B, then randomly initialized its later layers, and retrained itself on dataset B. BnB+ was trained for the same amount of time as Transferred Network B but never observed any data from dataset A.
BnB+ did not show a performance improvement. The authors concluded that the longer training time does not attribute to the generalization boost.
Closer to Better Local Minima
We can think of transfer learning as a weight initialization technique based on the knowledge of another network trained on a similar task. The image classification task is not convex. Therefore, gradient descent might get the network stuck in a local minimum of its loss function.
Not all the local minima are equal. In the diagram above, local min 2 is better than local min 1 because it results in a lower loss. We think the weights initialized via transfer learning might make it easier for the network to reach a better local minimum than some randomly initialized weights, therefore resulting in a better performance.
The better performance due to better weight initialization should not be limited to the generalization accuracy. Unfortunately, the authors in the paper did not provide observations on training accuracy. But if this hypothesis were true, we should expect the training accuracy on Transferred ConvNet B to be lower as well.
Regularized Weights Initialization
It is very likely that transferring from task A makes the network less likely to overfit on B due to having seen more training data.
Without transfer learning, only the target dataset is available to the network. With transfer learning, the network as a whole has seen training data from the source dataset as well as the target dataset. It makes the network more regularized, and the effect is shown in its better generalization accuracy.
If this hypothesis were true, we would observe a larger gap between the training and test accuracy after fine tuning on the target dataset when initialized randomly than initialized via transfer learning. The target network with transfer learning would likely lead to a better performance on the base dataset than a randomly initialized network.
Additionally, we would expect that transferring from a more diverse task A or on a larger dataset A would result in a larger generalization boost, because this would initialize the network weights in a region that solves more similar tasks. This initialization makes overfitting on the target task less likely.
We want to design an experiment that shows whether the generalization boost is due to “Closer to Better Local Minima”, “Regularized Weights Initialization” or a combination of both. Our intuition is that “Regularized Weights Initialization” is the major cause.
To confirm or deny our intuition, we first need to reproduce the generalization boost phenomenon. We will pick the CIFAR-10 image dataset for faster iterations and lower cost. We will split the dataset into 3 groups. group A has 4, group B has 4 and group C has 2 classes.
We will train a small ConvNet with two conv layers and one fully connected layer on the group A, transfers its first conv layer to the target task B, and retrains on group B. We call this Network Transferred.
We will train a ConvNet of the same structure on the group B directly, only keep the weights for the first layer, and retrain on group B again. Note that this network has only seen group B. We call this Network Non-transferred.
We will create a new dataset which contains all of group C and 2 classes in group A. We train a transferred network transfering from this new dataset to the target dataset group B. We call this Network Transferred-Diverse.
We train a transferred network transfering from all the data in A and C to the target dataset group B. We call this Network Transferred-More-Data.
1. We compare the training loss of the Network Transferred and Network Non-transferred. We expect the values to be very close.
2. We plot the training and validation accuracy for the trained Network Transferred and trained Network Non-transferred. We expect the gap between training and validation accuracy for Network Transferred to be minimum and for Network Non-transferred to be higher.
3. We expect Network Transferred-Diverse to have better generalization accuracy than Network Transferred.
4. We expect Transferred-More-Data to have the best generalization accuracy.
This post aims to explain the concept of generalization boost in transfer learning, and offers some of my speculations on why the boost was observed. A follow-up post will present some of my experiments’ results and discuss which hypotheses are closer to reality. Please feel free to comment below or email me if you have some thoughts on this topic.