Alex Graves - Automated Curriculum Learning for Neural Networks (2017)
History /
Edit /
PDF /
EPUB /
BIB /
Created: June 23, 2017 / Updated: November 2, 2024 / Status: finished / 4 min read (~682 words)
Created: June 23, 2017 / Updated: November 2, 2024 / Status: finished / 4 min read (~682 words)
- Neural architecture search where picking the right action is seen as a multi-armed bandit
- One reason for the slow adoption of curriculum learning is that its effectiveness is highly sensitive to the mode of progression through the tasks
- One popular approach is to define a hand-chosen performance threshold for advancement to the next task, along with a fixed probability of returning to earlier tasks, to prevent forgetting
- We propose to treat the decision about which task to study next as a stochastic policy, continuously adapted to optimize some notion of learning progress
- We consider supervised or unsupervised learning problems where target sequences b1,b2,… are conditionally modelled given their respective input sequences a1,a2,…
- We suppose that the targets are drawn from a finite set B
- As is typical for neural networks, sequences may be grouped together in batches (b1:B,a1:B) to accelerate training
- The conditional probability output by the model is
p(b1:B,a1:B)=∏i,jp(bij | bi1:j−1,ai1:j−1)
- We consider each batch as a single exemple x from X:=(A×B)N, and write p(x):=p(b1:B,a1:B) for its probability
- A task is a distribution D over sequences from X
- A curriculum is an ensemble of tasks D1,…,DN
- A sample is an example drawn from one of the tasks of the curriculum
- A syllabus is a time-varying sequence of distributions over tasks
- We consider two related settings:
- The multiple tasks setting, the goal is to perform as well as possible on all tasks in the ensemble
- The target task setting, the goal is to minimize the loss on the final task, the other tasks acting as a series of stepping stones to the real problem
- We view a curriculum containing N tasks as an N-armed bandit, and a syllabus as an adaptive policy which seeks to maximize payoffs from this bandit
- In the bandit setting, an agent selects a sequence of arms (actions) a1,…,aT over T rounds of play (at∈{1,…,N})
- After each round, the selected arm yields a payoff rt; the payoffs for the other arms are not observed
- The classical algorithm for adversarial bandits is Exp3, which uses multiplicative weight updates to guarantee low regret with respect to the best arm
- On a round t, the agent selects an arm stochastically according to a policy πt. This policy is defined by a set of weights wt,i:
πEXP3t(i):=ewt,i∑Nj=1ewt,j
- Ideally we would like the policy to maximize the rate at which we minimize the loss, and the reward should reflect this rate
- However, it usually is computationally undesirable or even impossible to measure the effect of a training sample on the target objective, and we therefore turn to surrogate measures of progress
- Two types of measures:
- Loss-driven, in the sense that they equate reward with a decrease in some loss
- Complexity-driven, when they equate reward with an increase in model complexity
- According to the Minimum Description Length (MDL) principle, increase in the model complexity by a certain amount is only worthwhile if it compresses the data by a greater amount
- We would therefore expect the complexity to increase most in response to the training examples from which the network is best able to generalize
- We note that uniformly sampling from all tasks is a surprisingly strong benchmark. We speculate that this is because learning is dominated by gradients from the tasks on which the network is making fastest progress, inducing a kind of implicit curriculum, albeit with the inefficiency of unnecessary samples
- Graves, Alex, et al. "Automated Curriculum Learning for Neural Networks." arXiv preprint arXiv:1704.03003 (2017).