On The Power of Curriculum Learning in Training Deep Networks
Hacohen and Weinshall, 2019
Summary
- Various, sometimes seemingly contradictory, methods for curriculum learning in CNNs have shown empirical benefits
- Analyzes the effect of curriculum learning by studying different scoring and pacing functions
- Provides theoretical evidence that curriculum learning changes the optimization landscape, but not the global minimum
- Links: [ website ] [ pdf ]
Background
- Taking inspiration from formal education, curriculums have been applied to DNNs
- The intuition is that presenting easier examples first helps the learner
- Creating a curriculum involves addressing two challenges
- Arranging the content in a way that reflects its difficulty
- Presenting the content at an appropriate pace
- In contrast to curriculum learning which ranks examples with respect to a target hypothesis, teacher-student methods uses current hypothesis of the learner
- Methods like hard data mining and boosting prefer more difficult examples with respect to the current hypothesis
Methods
- Curriculum learning attempts to leverage prior information about the difficulty of training examples
- Scoring function: specifies the difficulty of any given example
- Transfer scoring: confidence score from classifier trained on top of pre-trained ImageNet features
- Self-taught scoring: confidence score from network trained using vanilla method
- Pacing function: determines the sequence of data subsets from which batches of examples are sampled – limit to monotonically increasing staircase functions
- Fixed exponential pacing: fixed step length, exponentially increasing at each step
- Varied exponential pacing: varying step length
- Single step pacing: single step staircase
Results
- Baselines:
- Anti-curriculum: training examples are sorted in descedning order of difficulty
- Random curriculum: uses random scoring function
- Vanilla: uniformly sample mini-batches from whole dataset
- Case 1: moderate sized network trained on 5 classes (same super-class) from CIFAR-100
- Curriculum learning (transfer scoring, fixed exponential pacing) learns faster and better than the other methods
- Anti-curriculum is the worst, with random and vanilla in the mdidle
- Self-taught scoring has simliar performance to transfer scoring, but self-paced learning (using current hypothesis) reduces the test accuracy
- Single step pacing does as well as fixed exponential pacing, which is surprising since it only uses a small fraction of the easiest examples
- Advantage of curriculum learning is larger when task is harder, based on using different super-classes
- Case 2 and 3: moderate sized network on CIFAR-10 and CIFAR-100
- Like before, curriculum learning has a larger effect for CIFAR-100
- Case 4 and 5: large VGG-based network on CIFAR-10 and CIFAR-100
- Curriculum learning (transfer scoring, varied exponential pacing) gives smaller benefit, possibly because of larger network
- Case 6: moderate sized network on 7 classes of cats from ImageNet
Conclusion
- They don’t really show any empirical differences between the different scoring and pacing functions, and the empirical benefit of curriculum learning has already been shown in various contexts
- The datasets they use are all very simple, ranging from 3k-60k images and 5-100 classes