Joshua Achiam, David Held, Aviv Tamar, Pieter Abbeel; Berkeley AI Research
Faulty reward functions can cause the agent to learn unintended and sometimes harmful behaviour. In situations where RL algorithms are needed to be applied in real-life situations such as UAV control or household robots, the first concern to address is: safety. So, even while learning situations in RL setting, one needs to make sure there is a very minimum or no unsafe exploratory behaviour in initial stages.CPO is designed to meet this need in Deep RL setting.
Search for best policy is policy optimization. Constrained policy optimization is a local policy search method, in which the policies are learned in such a way that each new policy is close (local) in some way to the old one; and it is iterated until convergence.
Another example of local policy search is policy gradients, explained well in Andrej Karpathy’s blog.It keeps policies close by taking small steps in the direction of the gradient of the performance.
In Trust Region method, each new policy has to be close to the old one in terms of average KL-divergence. Since policies output probability distrbution over actions, and KL-divergence measures how different two probability distributions are over each other, this seems a natural way to update policies.
CPO also uses trust region method of updating policies, and is for constrained RL which approximately enforces constraints in every policy update.It uses approximations of the constraints to predict how much the constraint costs might change after any given update, and then chooses the update that will most improve performance while keeping the constraint costs below their limits.