Bradly Stadie - Third-Person Imitation Learning (2017)

History / Edit / PDF / EPUB / BIB /
Created: May 26, 2017 / Updated: November 2, 2024 / Status: finished / 2 min read (~361 words)
Machine learning

  • Learning in the third-person is posed as the adversarial challenge of replicating an expert trace (sequence of actions/states) which are considered as optimal under some unknown reward policy

  • One of the major weaknesses of RL is the need to manually specify a reward function
  • This weakness is addressed by the field of Inverse Reinforcement Learning (IRL)
    • Given a set of expert trajectories, IRL algorithms produce a reward function under which these expert trajectories enjoy the property of optimality
  • While IRL algorithms are appealing, they impose the somewhat unrealistic requirement that the demonstrations should be provided from the first-person point of view with respect to the agent
  • The high-level idea is to introduce an optimizer under which we can recover both a domain-agnostic representation of the agent's observations, and a cost function which utilizes this domain-agnostic representation to capture the essence of expert trajectories
  • The approach uses Trust Region Policy Optimization

  • In the (first-person) imitation learning setting, we are not given the reward function. Instead we are given traces (i.e., sequences of states traversed) by an expert who acts according to an unknown policy πE. The goal is to find a policy πθ that performs as well as the expert against the unknown reward function
  • Find a policy πθ that makes it impossible for a discriminator to distinguish states visited by the expert from states visited by the imitator agent

  • Formally, the third-person imitation learning problem can be stated as follows
  • Suppose we are given two Markov Decision Processes MπE and Mπθ
  • Suppose further there exists a set of traces ρ={(s1,,sn)}ni=0 which were generated under a policy πE acting optimally under some unknown reward RπE
  • In third-person imitation learning, one attempts to recover by proxy through ρ a policy πθ=f(ρ) which acts optimally with respect to Rπθ

  • Stadie, Bradly C., Pieter Abbeel, and Ilya Sutskever. "Third-Person Imitation Learning." arXiv preprint arXiv:1703.01703 (2017).