DDPG & TRPO
DDPG : Deep Deterministic Policy Gradient
Deterministic Policy Gradient Theorem
\[\begin{aligned} J(\theta) = \mathbb{E}_{s\sim \rho_\mu}[Q^{\mu}(s, a)] = \int_\mathcal{S} \rho_\mu (s) Q^{\mu} (s, a) ds \quad \text{where } a = \mu_\theta(s) \end{aligned}\] \[\begin{aligned} \nabla_\theta J(\theta) &= \nabla_\theta \int_\mathcal{S} \rho_\mu (s) Q^{\mu} (s, a) \, ds\\ &= \int_\mathcal{S} \rho_\mu (s) \nabla_\theta Q^{\mu} (s, a) \, ds\\ &= \int_\mathcal{S} \rho_\mu (s) \frac{\partial}{\partial \theta} Q^{\mu} (s, \mu_\theta (s)) \, ds \\ &= \int_\mathcal{S} \rho_\mu (s) \frac{\partial}{\partial \mu_\theta (s)}\frac{\partial \mu_\theta (s)}{\partial \theta} Q^{\mu} (s, \mu_\theta (s)) \, ds \\ &= \int_\mathcal{S} \rho_\mu (s) \nabla_a \nabla_\theta \mu_\theta (s) Q^{\mu} (s, \mu_\theta (s)) \, ds \\ &= \int_\mathcal{S} \rho_\mu (s) \nabla_a Q^{\mu} (s, a) |_{a=\mu_\theta(s)} \nabla_\theta \mu_\theta (s) \, ds \end{aligned}\]TRPO : Trust Region Policy Optimization
DDPG의 monotonic improve를 보장하지 못한다는 한계점을 해결한다. monotonic이란 한국어로는 “단조”로 해석되며 수학에서 monotonic increase, monotonic decrease 등 단조 증가/감소라는 용어로 사용된다.
TRPO는 다음의 두 가지
This post is licensed under CC BY 4.0 by the author.