Comments on On The Highway: Breaking the Caesar cipher with Deep Reinforcement Learning

Hi Sergey, Thank you so much for taking the time ...

2017-08-20T15:51:57.934+03:00

Hi Sergey,

Thank you so much for taking the time to answer these questions, it helped a lot!

MathJAX is very broken, reposting the second part ...

2017-08-20T14:19:01.682+03:00

MathJAX is very broken, reposting the second part of the answer.
As far as I remember, `\mu_\theta(x)` is "self.policy_network" in the code and `\mu_{old}` is "self.prev_policy". "self.kl_divergence_op" is how KL-divergence is computed. `M` from the article is the second derivation of the KL-divergence and that's what is computed from "self.kl_divergence_op" and is finally used in "fisher_vector_product" function.

I haven't touched this code for almost a year ...

2017-08-20T14:13:27.149+03:00

I haven't touched this code for almost a year so some of my understanding of TRPO may have faded. But here you go:
1) This is not directly related to TRPO but is related to the conjugate gradients method. I've seen some different damping methods when conjugate gradients are involved and this is one of them. Removing this part should not break the algorithm in theory but in reality we start hitting the safeguard and this breaks the line_search and slows down the learning (compare https://gym.openai.com/evaluations/eval_VlBZIU6zTVu5PXVr9Ntdkg that uses damping and https://gym.openai.com/evaluations/eval_Nyl6z8QWTi2jb1jlsY2vBA that doesn't). There is a short "discussion" about introducing this kind of damping in the repo I picked up the damping (and most of the other ideas) from: https://github.com/wojzaremba/trpo/issues/2
2) As far as I remember, $mu_\theta(x)$ is `self.the policy_network` in the code and $\mu_{old}$ is `self.prev_policy`. `self.kl_divergence_op` is how KL-divergence is computed. `M` from the article is the second derivation of the KL-divergence and that's what is computed from `self.kl_divergence_op` and is finally used in `fisher_vector_product`.

Thanks for a nice blog post and the code. I have f...

2017-08-20T12:08:46.568+03:00

Thanks for a nice blog post and the code. I have followed the code and have two questions:

1) Why are you using the conj_grads_damping (=0.1)? Is this related somehow to conjugate gradients in general, or something in TRPO? I've seen something like it in general conjugate gradients, but then they have a convex combination like (1-conj_grads_damping) * A + conj_grads_damping * B.

2) Do you understand what's going on in the TRPO appendix C.1 "Computing the Fisher-Vector Product", when they introduce the mean-vector mu? I don't understand the weird kl-divergence discussion there (using some small kl, rather than D_kl etc..). And if you did understand that part, is it somehow apparent in the code too?

Sincerely,
Aleksis