I have been inconsistent with my journal, but I am back and fresher than ever.
Vlog version as usual:
Today (and yesterday) I did & learned:
RL seems to have a lot of exploration going on vs some other ML tasks. One popular application it has is definitely beating videogames. The Mario AI was a viral hit in 2015. I decided to build a RL model that can beat atari breakout. This was soon classified as impossible given my current coding skills, so I chose to implement a medium article first that beat atari breakout. This article was great at linking the original Atari breakout RL paper with the code, but the full code was not posted, so I was stuck. Luckily, a user named boyuanf hit us up with the tensorflow implementation of the article on medium, here's the forked version of it.
I downloaded the trained weights and model, and I ran it after installing openAI gym in conda with pip. Unfortunately, atari-py seems incompatible with windows 10, so I had to go through a very annoying process to finally come through with this easy line of code to solve the problem:
Yea it is just one of those problems man.
Anyways, I then was able to run gym and see the beautiful pre-trained model doing work, it got to a pretty good high score, I think 57 or something.
It is actually after I implemented the project that I come back to reading the papers, this works for me. I usually try to guess what the original algorithm does by doing a project first. For me, doing a project first then reading the paper also gives that revelation of: "oh, the reason that I have this line in the code is because of that sentence in the paper".
The paper and this medium article helped my understanding a lot. This pseudocode in the paper opened the doors for me:
I'm going to try to explain this pseudocode with even English-er language. We will input the current frame and a few previous frames to our RL model. The RL model will interpret these inputs as the state, and it will either choose the action based on the Q-table or choose a random action. We can imagine that as the model gets more advanced, we will choose less random actions to let the model learn, but in the early stages, when the model has no idea what to do, we probably want to let it explore randomly, we will use a decreasing epsilon value to model this. The emulator will receive the action chosen by the RL model, run that action, then display the new image and return the reward. The Q-table will be updated based on this reward. The Q-table is just a table that has states mapping to potential actions. When the model is complex and epsilon is low, the RL model chooses actions based on the Q-table, a higher value (which means high rewards) in the state mapping to action will probably mean the model is choosing that.
That;s it for this one, I learned a lot since it was my first time exploring RL! Exciting, can't wait to do more.
submitted by /u/RedditAcy