Paul Christiano - Deep Reinforcement Learning from Human Preferences (2017)

History / Edit / PDF / EPUB / BIB /
Created: August 5, 2017 / Updated: November 2, 2024 / Status: finished / 4 min read (~635 words)
Machine learning

  • Human evaluators are provided short video clips (trajectory segments) where they have to rate two clips against each other
  • The RL agent is given information as to which trajectory segments are preferred other other trajectory segments
    • This allows it to try and determine which state/action are better than others given the human feedback

  • In order to practically train deep RL systems with human feedback, we need to decrease the amount of feedback required by several orders of magnitude
  • Our approach is to learn a reward function from human feedback and then to optimize that reward function
  • We desire a solution to sequential decision problems without a well-specified reward function that
    • enables us to solve tasks for which we can only recognize the desired behavior, but not necessarily demonstrate it
    • allows agents to be taught by non-expert users
    • scales to large problems
    • is economical with user feedback
  • Our work could also be seen as a specific instance of the cooperative inverse reinforcement learning framework
  • This framework considers a two-player game between a human and a robot interacting with an environment with the purpose of maximizing the human's reward function
  • In our setting the human is only allowed to interact with this game by stating their preferences
  • Our key contribution is to scale human feedback up to deep reinforcement learning and to learn much more complex behaviors

  • Instead of assuming that the environment produces a reward signal, we assume that there is a human overseer who can express preferences between trajectory segments
  • A trajectory segment is a sequence of observations and actions, σ=((o0,a0),(o1,a1),,(ok1,ak1))(0×A)k
  • σ1σ2 indicates that the human prefers trajectory segment σ1 over trajectory segment σ2
  • Informally, the goal of the agent is to produce trajectories which are preferred by the human, while making as few queries as possible to the human
  • We will evaluate our algorithms' behavior in two way:
    • Quantitative: We say that preference are generated by a reward function r:0×AR if

      ((o10,a10),,(o1k1,a1k1))((o20,a20),,(o2k1,a2k1))

      whenever

      r(o10,a10)++r(o1k1,a1k1)>r(o20,a20)++r(o2k1,a2k1)

    • Qualitative: Qualitatively evaluate how well the agent satisfies the human's preferences

  • The human overseer is given a visualization of two trajectory segments, in the form of short movie clips
  • The human then indicates which segment they prefer, that the two segments are equally good (or bad?), or that they are unable to compare the two segments
  • The human judgments are recorded in a database D of triples (σ1,σ2,μ), where σ1 and σ2 are the two segments and μ is a distribution over {1, 2} indicating which segment the user preferred
    • If the human selects one segment as preferable, then μ puts all of its mass on that choice
    • If the human marks the segments as equally preferable, then μ is uniform
    • If the human marks the segments as incomparable, then the comparison is not included in the database

  • In general we discovered that asking humans to compare longer clips was significantly more helpful per clip, and significantly less helpful per frame
  • We found that for short clips it took human rates a while just to understand the situation, while for longer clips the evaluation time was a roughly linear function of the clip length
  • We tried to choose the shortest clip length for which the evaluation time was linear

  • Christiano, Paul, et al. "Deep reinforcement learning from human preferences." arXiv preprint arXiv:1706.03741 (2017).