## tl;dr;

You can teach your machine to break arbitrary Caesar cipher by observing enough training examples using Trusted Region Policy Optimization for Policy Gradients:

## Full text

Imagine the world where a hammer was introduced to the public just couple a years ago. Everyone is running around trying to apply the hammer to anything that even resembles a nail. This is the world we are living in and the hammer is deep learning.
Today I will be applying it to a task that can be much easier solved by other means but hey, it's Deep Learning Age! Specifically, I will teach my machine to break a simple cipher like Caesar cipher just by looking at several (actually, a lot) examples of English text and corresponding encoded strings.
You may have heard that machines are getting pretty good at playing games so I decided to formulate this code breaking challenge as a game. Fortunately there is this OpenAI Gym toolkit that can be used "for developing and comparing reinforcement learning algorithms". It provides some great abstractions that help us define games in terms that computer can understand. For instance, they have a game (or environment) called "Copy-v0" with the following setup and rules:
• There is an input tape with some characters.
• You can move cursor one step left or right along this tape.
• You can read symbols under the cursor and output characters one at a time to the output tape.
• You need to copy input tape characters to output tape to win.
But this is almost exactly what we need! Let's just change the win condition: instead of just copying input tape characters you need to decode them first to win.
Now let's talk a bit about the hammer itself. The hottest thing on the Reinforcement Learning market right now is Policy Gradients and specifically this flavorTrust Region Policy Optimization. There is an amazing article from Andrej Karpathy on Policy Gradients so I will not give here an introduction. If you are new to Reinforcement Learning you just stop reading this post and go read that one. Seriously, it's so much better!

Still here? Ok, I will tell you about TRPO then. TRPO is a technique for Policy Gradients optimization that produces much better results than vanilla gradient descent and even guarantees (theoretically, of course) that you can get an improved policy network on every iteration.
With vanilla PG you start by defining a policy network that produces scores for the actions given the current state. You then simulate hundreds and thousands of games taking actions suggested by the network and note which actions produced better results. Having this data available you can then use backpropagation to update your policy network and start all over again. The only thing that TRPO adds to this is that you solve a constrained optimization problem instead of an unconstrained one: $$\textrm{maximize } L(\theta) \textrm{ subject to } \bar{D}_{KL}(\theta_{\textrm{old}},\theta)<\delta$$ Here $$L(\theta)$$ is a loss that we are trying to optimize. It is defined as $$E_{a \sim q}[\frac{\pi_\theta(a|s_n)}{q(a|s_n)} A_{\theta_{\textrm{old}}}(s_n,a)],$$ where $$\theta$$ is our weights vector, $$\pi_\theta(a|s_n)$$ is a probability (score) of the selected action $$a$$ in state $$s_n$$ according to the policy network, $$q(a|s_n)$$ is a corresponding score using the policy network from the iteration before and $$A_{\theta_{\textrm{old}}}(s_n,a)$$ is an advantage (more on it later). Running simple gradient descent on this is the vanilla Policy Gradients approach. TRPO approach doesn't blindly descend along the gradient but takes into account the $$\bar{D}_{KL}(\theta_{\textrm{old}},\theta)<\delta$$ constraint. To make sure the constraint is satisfied we do the following. First, we approximately solve the following equation to find a search direction: $$Ax = g,$$ where A is the Fisher information matrix, $$A_{\textrm{ij}} = \frac{\partial}{\partial \theta_i}\frac{\partial}{\partial \theta_j}\bar{D}_{KL}(\theta_{\textrm{old}},\theta)$$ and $$g$$ is the gradient that you can get from the loss using backpropagation. This is done using conjugate gradients algorithm. Once we have a search direction we can easily find a maximum step along this direction that still satisfies the constraint.
One thing that I promised to get back to is the advantage. It is defined as $$A_\pi(s,a)= Q_\pi(s,a)−V_\pi(s),$$ where $$Q_\pi(s,a)$$ is a state-action value function (actual reward of taking an action in this state, it usually includes discounted rewards for all upcoming states) and $$V_\pi(s)$$ is a value function (in our case it's just a separate network that we train to predict the value of the state).

Bored enough already? I promise, it's not that scary in code. You can find the full implementation here: tilarids/reinforcement_learning_playground. Specifically, look at trpo_agent.py. You can reproduce the Caesar cipher breaking by running trpo_caesar.py.
For those of you who thinks the code resembles wojzaremba's implementation a lot - you are right. I was copying some TRPO code from there and then rewriting it to make it more readable and also to make sure it follows the paper closely.

## Results

As you can see, I've used The Zen of Python (which is already conveniently encoded with ROT13) to generate my training data. There are sample runs after 45000, 100000 and 164000 episodes (each episode == 1 game). After 164000 the agent is smart enough to decode long sentences: You may say (and you will be right) that it's an overkill to use such a complex approach to solve such a simple problem and there are much simpler ways to break Caesar cipher having large sets of encoded and decoded text pairs. But the beauty of the approach described above is that it can be trained to solve other tasks without a slightest change in the agent code and network configuration. For instance, here the same agent balances a pole on the cart and here it learns how to copy symbols. So this whole article is only half joke. Deep Reinforcement Learning using TRPO is a powerful technique and I look forward towards the future with fully self-trained robots strolling the streets. Scary, huh?

## Sunday, December 16, 2012

### Отчёт о Dee Mon Winter Contest

По традиции участвовал в этом контесте в составе команды THIRTEEN. Отчёт(осторожно, там СПОЙЛЕРЫ-СПОЙЛЕРЫ-СПОЙЛЕРЫ), написанный san'ом и brox'ом с моими небольшими дополнениями можно прочитать здесь. Контест выдался замечательный, вполне в духе, автору браво и аплодисменты.

## Thursday, October 11, 2012

### Talk: What every programmer should know about merging branches in Mercurial

Using named branches in Mercurial can be rather difficult especially when it gets to merging these branches. There are several useful techniques and process improvements that simplify the development/release loop. You can find some of them in the talk slides I did on Oct 11, 2012 here at Quickoffice(aka DoctorMobile). It doesn't restrain any development flow and can be useful for any Mercurial (or even non-Mercurial) user. For those who are not familiar with Mercurial at all I recommend to read some guides before looking at the slides. What every programmer should know about merging branches in Mercurial Sources for the slides can be found here

## Sunday, May 27, 2012

### Counting bits

How would you count the number of bits set in a 32-bit integer? This question is amongst the favorite interviewing questions. The naïve approach is to loop through the bits aggregating the sum in a variable. It's already fast enough but you can make the interviewer happy with a much better "lookup table" solution: just use a static 256-entry table that contains pre-evaluated bits count for any 8-bit integer. With such table you need only 4 lookups and 4 summations to get the answer. "Lookup table" algorithm is even implemented as a GCC builtin function. Take a look at the code:

#include <stdint.h>

const uint32_t tbl[] = {
4181239173/*,lots of random data, I used 16384 integers for tests*/,2866015773 } ;

#define TEST_COUNT 100000
static const uint32_t ARRAY_SIZE = sizeof(tbl)/sizeof(tbl[0]);

int main()
{
uint32_t i = 0;
volatile register uint32_t c;

for (; i< TEST_COUNT; ++i)
{
uint32_t j = 0;
for (; j < ARRAY_SIZE; ++j)
{
c = __builtin_popcount(tbl[j]);
}
}
}


Here __builtin_popcount is a function that calculates the population count. If compiled with
gcc -O3 t.c -o bitcount
one can get such results:
$time ./bitcount real 0m17.465s user 0m17.425s sys 0m0.000s  Not too fast, huh? Some of the time is wasted calling the function and inlining it can speed up things. But! There is even a better way. Compiling it with gcc -O3 -mpopcnt t.c -o bitcount results in: $ time ./bitcount
real 0m1.738s
user 0m1.728s
sys 0m0.004s

That's simply because modern processors have POPCNT instruction that can be used to count the number of bits and GCC is nice enough to support it. As you can see, in this case hardware optimization beats smart algorithm design with ease.

## Resume

Algorithms are designed on paper but executed on a real hardware. Knowing what your hardware is capable of is a must for a software engineer. And please don't tell me about all these patterns and object-oriented stuff: unless you invent a machine that can run concepts instead of machine instructions, you must be aware of all the low-level stuff because it will save lives because I don't want to buy a new laptop capable of running new version of nano/Notepad.

I also highly recommend an awesome article "Benchmarking CRC32 and PopCnt instructions" for further reading.

## Thursday, March 29, 2012

### Ruby-like blocks and yield "keyword" in C

Today I want to show you some of the plain old C black magic. This is the magic of direct user context manipulating. You can use this low-level feature to implement some high-level language abstractions such as generators and ruby-like blocks. Let's start with an example(test1.c):

#include <stdio.h>#include <string.h>#include "gens.h"#include "cblocks.h"int main(int argc, char** argv){    EXECUTE_BLOCK(enumerate, (1, 10), int, x)        printf("Got %d\n", *x);    END_BLOCK    char* buf = "Hello, world!";    EXECUTE_BLOCK(enumerate, (1, 4), int, z)        printf("Saying it %d time\n", *z);        EXECUTE_BLOCK(for_each, (buf, buf + strlen(buf)), char, c)            printf("%c.", *c);        END_BLOCK        printf("\n");    END_BLOCK    return 0;}

The code above is a valid C code. It produces such output:
Yielding 1Got 1Yielding 2Got 2Yielding 3Got 3Yielding 4Got 4Yielding 5Got 5Yielding 6Got 6Yielding 7Got 7Yielding 8Got 8Yielding 9Got 9Yielding 1Saying it 1 timeH.H.e.e.l.l.l.l.o.o.,.,. . .w.w.o.o.r.r.l.l.d.d.!.!.Yielding 2Saying it 2 timeH.H.e.e.l.l.l.l.o.o.,.,. . .w.w.o.o.r.r.l.l.d.d.!.!.Yielding 3Saying it 3 timeH.H.e.e.l.l.l.l.o.o.,.,. . .w.w.o.o.r.r.l.l.d.d.!.!.

Let's take a look on the implementation of used generators (enumerate, for_each in gens.c):
#include "gens.h"#include <stdio.h>START_GENERATOR(int, enumerate, (int start, int end))    int i = start;    for(; i<end; ++i)    {        printf("Yielding %d\n", i);        YIELD(i);    }END_GENERATORSTART_GENERATOR(char, for_each, (char* start, char* end))    while (start != end)    {        YIELD(*start);        YIELD(*start); // intentionally        ++start;    }END_GENERATOR

As you can see, the implementation of generators is pretty straightforward. All the magic is done in cblocks.h:
#include <stdlib.h>#include <ucontext.h>#define DECLARE_GENERATOR(type, name, params) extern ucontext_t name ## _context__; \extern ucontext_t name ## _out_context__; \extern int name ## _finished__; \extern type name ## _out_var__; \extern void name ## _fun__ params#define START_GENERATOR(type, name, params) ucontext_t name ## _context__; \ucontext_t name ## _out_context__; \int name ## _finished__ = 0; \type name ## _out_var__; \void name ## _fun__ params \{ \name ## _finished__ = 0; \int* current_finished = & name ## _finished__; \ucontext_t* current_context = & name ## _context__; \ucontext_t* current_out_context = & name ## _out_context__; \type* current_var = & name ## _out_var__;#define END_GENERATOR *current_finished = 1; \}#define YIELD(x) *current_var = x; swapcontext(current_context, current_out_context);#define EXECUTE_BLOCK(gen, params, type, varname) do{ type* varname = & gen ## _out_var__; \ucontext_t* current_context = & gen ## _context__; \ucontext_t* current_out_context = & gen ## _out_context__; \volatile int gen ## _temp_var__ = 0; \getcontext(& gen ## _out_context__); \if (gen ## _temp_var__ == 0) \{ \    gen ## _temp_var__ = 1; \    gen ## _fun__ params; \}\else \{ \    while (! gen ## _finished__) {          #define END_BLOCK swapcontext(current_out_context, current_context); } \} \} while(0);

Manual pages for swapcontext and getcontext can help in understanding what's going on here.

### Resume

• It's possible to implement nifty language features in C using its low-level features. The generators described above can be even used to build cooperative multitasking systems.
• It doesn't mean that you should do it just because you can ;). Beware that context manipulating doesn't work on all platforms.
• You can find the source code on github. The code should be considered just a playground so please forgive me its untidiness. If you want something mature(e.g., you're seeking for coroutines support in C), check out the libtask, libtorque or libcoro

As always I will be happy to accept any criticism on the code above. Feel free to contact me in any way you can imagine.