Reinforcement Learning: Cart-pole, Deep Q learning

Karan Jakhar
4 min readOct 12, 2019

--

#day4 of #100daysofcode

Today I explored Deep Q learning further. The concept of Deep Q learning is very interesting. There are various tweaks done to train the agent well. I worked on the CartPole problem from Openai Gym.

The problem statement :

A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.

It took me time to understand how the Deep network will be used instead of Q-table. But as I got the idea rest is pretty easy. Another thing I want to explore is this formula:

On a brief it is clear but I want to understand it in detail. That is the work for the rest of the days. So far the Reinforcement Learning is very interesting and exciting. In Deep Q learning we observe the mixture of Reinforcement Learning and Deep Learning. Deep Q learning was first introduced by DeepMind when they publish the research paper “Playing Atari with Deep Reinforcement Learning”.

Let’s have some discussion about the code.

I will be explaining every part briefly.

This part you know :). Importing required libraries.

Here starts our Deep Q learning Class. Classes make the code modular and easy to reuse. So from I am trying to use classes only. Above we define some parameters which we will be using in the rest of the code. A few are self-explaining like “self.memory = deque(maxlen=2000)” it stores (state, action, reward, new_state, done). We are using a deque instead normal list in python is because we are performing append function a lot. The complexity of the append function in the list is O(N) and in deque is O(1) which is constant. Other variables like “self.epsilon”, “self.epsilon_min”, “self.epsilon_decay”, “self.learning_rate” and “self.gamma” has a lot of theory behind them. I will explain to them tomorrow with the above formula which I am going to explore tomorrow. The theory is simple and straight forward. Don’t worry we will get to it tomorrow.

This is a very simple model but it will be enough for our problem. Think of this model as a black box for now. Which maps the input to output. The state is the input to our model and action is the output. Earlier we were using Q-table to get action given state but that was not a memory-efficient solution for large problems.

We are going to store (state, action, reward, next_state, done) as our agent explores more and then we train our deep network model on these stored values. We train the model on a batch of the stored values, not for all.

We get a random value if that value is less than the epsilon (which a hyperparameter means we set it to a certain value) then our agent is going to explore the environment means it takes any random action. Otherwise, it exploits the environment and using the predicted action from our model.

Here we are training our model. This is called replay. We trained the model on a batch of stored values.

This is the main part of the code. From here we call all the functions. There is a lot to explore in it. I will be going through them one by one. Stay tuned for more. So this is how my day went. :) It took more than 1 hour but it was fun.

Here is the result:

Resources I used :

Happy Learning!!!

--

--

Karan Jakhar
Karan Jakhar

Written by Karan Jakhar

Generative AI | Computer Vision | Deep Learning | Blogger | Technology Enthusiast

No responses yet