The Playground of AI: Exploring the Basics of Reinforcement Learning
medium.com
The Playground of AI: Exploring the Basics of Reinforcement Learning18 min readJust now--Photo generated from OpenAIs ChatGPT (Prompted March 30, 2025)The current trend in the field of Data Science revolves around Generative Artificial Intelligence (GenAI) particularly chatbots utilizing Large Language Models (LLM). Before that, it was about predicting class or score using different features with the use of classical and deep learning models. However, throughout this timeline, a subset of machine learning has always existed though not as widely recognized and continues to evolve and thrive. This field is Reinforcement Learning (RL) which is focused on training agents to make decisions by interacting with their environment to maximize cumulative rewards.Photo taken from https://www.devopsschool.com/Unlike supervised learning, where the objective is to learn from labeled examples, or unsupervised learning, which focuses on identifying patterns in data; RL involves an autonomous agent that learns by making decisions and adapting based on the outcomes of its actions, often without prior data and typically in a trial-and-error process. [1][2] Reinforcement Learning even has deep roots in various disciplines, including psychology, neuroscience, economics, and engineering. The plethora of perspectives and influences makes RL a dynamic and highly interdisciplinary field. [3] In this introduction to Reinforcement Learning, we will explore the foundation and mathematics behind the field, the main framework with a brief teaser on the different advancements, and of course, a showcase on how RL works in Python.FundamentalsPhoto taken from https://thedecisionlab.com/At the core of Reinforcement Learning lies the Markov Decision Process (MDP) which is a mathematical framework that models decision-making in environments filled with uncertainty. A MDP consists of a set of states representing different situations an agent can encounter, actions the agent can take, and a transition probability that dictates the likelihood of moving between states. Additionally, a reward function provides feedback to the agent, helping it learn which actions lead to favorable outcomes. A key aspect of MDPs is the discount factor, which determines how much future rewards influence the agents decisions favoring either short-term or long-term gains. Another important concept of MDP is independent of past actions and states, relying the prediction of the next state solely on the current state.Photo taken from https://people.stfx.ca/Another key concept in RL is the Multi-Armed Bandits (MAB) problem whichtrade-off between exploration and exploitation. In this framework, an agent repeatedly chooses from K possible actions (or arms) to maximize cumulative rewards over time, even though the reward distribution for each action is unknown. The agent must balance exploring new options to gather information and exploiting the best-known choice for immediate benefit. Unlike supervised learning, which provides direct feedback on correct decisions, RL uses evaluative feedback that only reflects the effectiveness of chosen actions.DesignPhoto taken from https://lilianweng.github.io/The vanilla framework of Reinforcement learning involves an Agent which is the actor or decision-maker operating within the Environment defined as the world or system that defines the rules in which the agent can operate. Think of it as a game where the player is the Agent and the Environment is the confines of the said game. The Agent is bounded by the rules and design of the game, cannot think outside the box and no cheat codes(!!!). As a player or Agent plays the game, it will perform an Action (a, A) from the set of possible moves to interact with the Environment. Performing an action will lead to a State (s, S), a specific condition or configuration of the Environment at a given time as perceived by the Agent. As we play the game, we usually aim for an objective in order to progress, this is called Reward (r, R). Defined as the feedback or result from the Environment based on the Agents action, it tells the Agent how good or bad the action was. These are the basic parts of a vanilla or basic RL framework.Now we go deeper into the framework of RL. The main goal of a RL problem is to know the optimal strategy which is the sequence of Actions and States the Agent must follow in order to maximize the Rewards. A sequence of Actions and States is called Policy (). Think of this as the strategy in order to achieve things like highest score, finishing the game, or even like trolling around and achieving nothing.A simple RL model can now work with this framework and let the Agent solve the Environment by trial and error method, simulating all the possible combinations there is in order to identify the Policy or Policies that will maximize the Rewards. However, depending on the complexity of the Environment, there can be too many combinations of Actions and States which can is computationally expensive and time consuming. This is why algorithms were included in the initial framework to solve this dilemma.First, in order to determine the sort of quality of a Policy, we need to have a quantitative measure of the expected return of the agent being in a certain state. This is called a Value Function and it is derived from the Bellman equation that expresses the value of a state (or state-action pair) in terms of the expected immediate reward plus the discounted value of the next state (or next state-action pair). In RL, the Value Function can be divided into two broad categories; State Value Function and Action Value Function.Equation for State Value FunctionThe State Value Function represents expected cumulative reward an agent can achieve starting from a specific State and following a given Policy. This is crucial in the evaluation of deterministic policies or when understanding the value of being in a particular state is required.Equation for Action Value FunctionOn the other hand, Action Value Function represents the expected cumulative reward an agent can achieve from a defined State by taking a specific Action and following a given policy, thereafter. It is mainly used to evaluate and compare the potential for different actions when they are taking place in the same state. They are crucial for the selection of actions, where the goal is to determine the most appropriate action for each situation. As action-value functions take into account the expected return of different actions, they are particularly useful in environments with stochastic policies.Equation for Optimal State Value Function and Action Value FunctionSolving an RL task involves identifying a Policy that maximizes long-term Rewards which follows the Bellman Optimality Equation above. It also indicates the probabilistic nature of RL in transitioning to a State with a certain Reward given the current State and chosen Action. This equation serves as the baseline in terms of developing the RL algorithms and models being used currently in the field.PlaybookThere are a lot of models and methods currently developed in the field of Reinforcement Learning. To make things brief, we will be going through the general classifications of the models to give an overview of how things are defined.Photo taken from https://www.sciencedirect.comOne classification of the Reinforcement Learning is Model-free Method vs Model-based Method. As the figure above indicates, Model-free Methods determine the optimal policy or directly without creating a model of the environment. In this framework, the Agent learns only from Observation, Actions, and Rewards it experiences in the Environment (experience-based learning). This makes it a straightforward and flexible approach, especially for complex Environments where understanding the systems dynamics is difficult or impractical. However, it often requires a massive amount of interactions with the Environment, making it computationally expensive and slower to learn.[4]On the other hand, Model-based Methods build a representation of how the environment behaves for planning and improving decision-making. The process involves explicitly learning or using a model of the environments dynamics instead of relying only on direct experience. This makes model-based Method significantly sample-efficient, as it allows for planning and strategic decision-making rather than pure trial and error. The challenge and downside is learning an accurate model of the environment. If the model is imperfect or inaccurate, the agent may make poor decisions based on incorrect predictions.Photo taken from https://github.com/Another classification is in terms of how a Policy is updated based on the interaction in the Environment. This time it can be Online, Off-policy or Offline Reinforcement Learning. First, Online RL is a dynamic learning approach where an agent continuously interacts with the environment, takes Actions, and updates its Policy based on real-time feedback. This method allows the Agent to adapt quickly to changes in the Environment, making it suitable for tasks where conditions are unpredictable. However, since learning happens through direct interaction, Online RL often requires a large number of trials, making it computationally expensive and inefficient for complex problems.Unlike Online RL, Off-policy RL does not rely solely on real-time interactions. Instead, it allows agents to learn from previously collected data, making training more sample-efficient. This approach enables the agent to improve its Policy using experiences generated by other Policies or past iterations. While Off-policy RL provides flexibility and efficiency, it also introduces challenges such as distribution mismatch, where the data used for training may not fully align with the Optimal Policy being learned.Offline RL, also known as Batch RL, takes learning a step further by training Policies exclusively from pre-collected datasets without any interaction with the Environment. This makes it highly valuable in situations where real-world data collection is costly, dangerous, or impractical such as healthcare, robotics, and autonomous driving. Since Offline RL lacks direct interaction with the environment, it faces difficulties in generalizing to new situations and avoiding biases in the dataset.Again, this is only a glimpse of the diverse models in the field of Reinforcement Learning. Extensive discussions are needed to understand the ins and outs of each algorithm. But for now, let us move on to showcasing and visualizing how RL works.SimulationNow that we know the basics of Reinforcement Learning, we can now proceed to applying and simulating a RL problem. The section below include an overview to the library being used, initializing an Environment, Simulating an Action, Training an Agent on two different Environments: CartPole and Atari Breakout.In this code walkthrough, the main libraries used are: gymnasium which provides an Application Programming Interface (API) standard for reinforcement learning with a diverse collection of reference Environments; and the other is Stable Baselines3 (SB3) which contains set of reliable implementations (i.e. algorithms and wrapper) of reinforcement learning algorithms in PyTorch. Other modules used are for navigating file directory, ensuring compatibility, and visualizing the results by rendering the video of the Environment simulations.# File Directoryimport globimport ioimport base64import osimport shutil# RLimport gymnasium as gymfrom stable_baselines3 import PPO #Algorithm, check docs for othersfrom stable_baselines3.common.vec_env import DummyVecEnv # Wrapper for the env# For Rendering Video in Colabfrom gymnasium.wrappers import RecordVideofrom IPython.display import HTMLfrom pyvirtualdisplay import Displayfrom IPython import display as ipythondisplayimport matplotlib.pyplot as plt# Compatibilityimport numpy as npnp.bool8 = np.bool_CartPole Levelenv_name = "CartPole-v1"environ = gym.make(env_name, render_mode="rgb_array")The first part is initializing the Environment of the RL problem. We can choose from different Environments from the gymnasium documentation and other third party created Environments. Note that each environment has different dependencies so check the documentations first. For the first simulation, we will select CartPole-v1 where the task is to balance a pole attached to a cart by moving the cart left or right. The goal is to balance the pole as long as possible with the threshold of the Environment being set to 500 Frames to ensure that an episode (a trial/run of the game) will not be too long.environ = gym.make(env_name, render_mode="rgb_array")env = RecordVideo(environ, video_folder="./video", disable_logger=True, video_length=1000)for episode in range(5): obs, info = env.reset() done = False score = 0 while not done: action = env.action_space.sample() # Generate random action obs, reward, terminated, truncated, info = env.step(action) # Proceed on the generated action score += reward done = terminated or truncated # Ensure loop ends properly print([action, obs, reward, terminated]) print(f'Action: {action}') print(f'State: {obs}') print(f'Reward: {reward}') print(f'Episode: {episode} Total Score: {score}\n')env.close()To visualize the Cartpole Environment, we can simulate an episode by choosing random Actions, dictated by env.action_space.sample(), and check what happens. If we look at the output of the code, we can see that the Action can either by 0 (which is the cart moving left) and 1 (move to the right). For the State, as described in the documentation it is an array of length 4 with the elements (in sequence) being cart position, cart velocity, pole angle, and pole angular velocity. The third part is the Reward which can be 1 if the pole is still balanced at a certain angle and 0 if the pole exceeds the threshold angle of +12 or -12 degrees. Lastly, at the end of each Episode, we tally how long the pole is balanced with each movement of the cart and get the total Reward.# Opening Video of Policy in Colab# Similar with Env.render()def show_video(path='video/*.mp4'): mp4list = glob.glob(path) if len(mp4list) > 0: mp4 = mp4list[0] video = io.open(mp4, 'r+b').read() encoded = base64.b64encode(video) ipythondisplay.display(HTML(data='''<video alt="test" autoplay loop controls style="height: 400px;"> <source src="data:video/mp4;base64,{0}" type="video/mp4" /> </video>'''.format(encoded.decode('ascii')))) else: print("Could not find video")show_video()CartPole-v1 Episode with random ActionsAs we can see in the rendered video above, the pole was balanced for about 2 seconds before the game was terminated due to exceeding the threshold angle. Note that this is the result of doing random Actions per step which means the Policy is not optimal. With this, we can proceed to training the Agent by simulating multiple Episodes or trials for the Agent to know how to approach the Environment better than taking random Actions.#Wrap env into a dummy vectorized environment (for compatibility purposes)env = DummyVecEnv([lambda: env]) # Defining the agent (policy, environment, log path)model = PPO('MlpPolicy', env, tensorboard_log=log_path, verbose=1) model.learn(total_timesteps=20000) # Timesteps Depending on complexity of environmentTraining the model or the Agent required the Environment to be wrapped into a vectorized Environment which ensures compatibility with the code and can allow us to train multiple stacks of Environment per step to speed up the training process. Next is defining the algorithm for the Policy which for now is set to default Proximal Policy Optimization (PPO) with the main idea is that after an update, the new policy should be not too far from the old policy. For the MlpPolicy part, it is base Policy to be used on the which is dependent on the Environment. MlpPolicy is used with low-dimensional, vector observations. Next is we make the Agent learn the Environment by simulating multiple Policies and set a max number of timesteps to cap the learning process. Given that this is simple RL problem, 20,000 timesteps is almost enough for us to achieve high Rewards. Running the model.learn() outputs the state of the training, showing multiple metrics (losses, variance, deviation) on how training is performing.from stable_baselines3.common.evaluation import evaluate_policy #Testing/Validation# Random Agent, before trainingmean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=100)print('Trained Model')print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")model2 = PPO('MlpPolicy', env, tensorboard_log=log_path, verbose=1)mean_reward2, std_reward2 = evaluate_policy(model2, env, n_eval_episodes=100)print('Base Model')print(f"mean_reward:{mean_reward2:.2f} +/- {std_reward2:.2f}")Next part is evaluating how the Agent now performs after learning the Environment and simulating multiple Policies. As indicated above, the Agent now achieved the threshold Reward of 500 which means that it know the optimal Policy of the Environment. If we compare the trained model with the base model, we can the major improvement in the achieved Reward, going from average 3334 to 500 (with standard deviation of 0).folder_path = "/content/video_test/" # Change to your folder pathshutil.rmtree(folder_path)env_name = 'CartPole-v1'environ = gym.make(env_name, render_mode="rgb_array")env = RecordVideo(environ, video_folder="/content/video_test", disable_logger=True, video_length=1000)for episode in range(2): obs, info = env.reset() done = False score = 0 while not done: action, _ = model.predict(obs) obs, reward, terminated, truncated, info = env.step(action) # Proceed on the generated action score += reward done = terminated or truncated # Ensure loop ends properly # print([action, obs, reward, terminated]) print(f'Episode: {episode} Score: {score}\n')env.close()show_video('video_test/*.mp4')A CartPole-v1 Episode after Training with 20000 TimestepsVisualizing the results of the training, we can see in the above rendered video that the pole is balanced all throughout the simulation. The video here lasts around 10 seconds which is the maximum duration as set in the Environment (500 Frames). This is basically how a Reinforcement Learning framework is done: Initialize an Environment, Train the Agent with selected algorithm and choose for how long the training will be done, then check the results of the training.Breakout LevelNext, we proceed to a more difficult Environment which is Breakout, a famous Atari game. The dynamics of the Environment are similar to pong: moving a paddle to navigate the ball to the brick walls at the top of the screen. The goal is to destroy as much brick walls if not all before the ball hits the bottom section of the screen.from stable_baselines3 import A2Cfrom stable_baselines3.common.vec_env import VecFrameStackfrom stable_baselines3.common.env_util import make_atari_envimport ale_pygym.register_envs(ale_py)env_atari = gym.make('ALE/Breakout-v5', render_mode="rgb_array")env = RecordVideo(env_atari, video_folder="/content/atari", disable_logger=True, video_length=1000)for episode in range(5): obs, info = env.reset() done = False score = 0 while not done: action = env.action_space.sample() # Generate random action obs, reward, terminated, truncated, info = env.step(action) # Proceed on the generated action score += reward done = terminated or truncated # Ensure loop ends properly # print([action, obs, reward, terminated]) print(f'Action: {action}') print(f'State: {obs}') print(f'Reward: {reward}') print(f'Episode: {episode} Score: {score}\n')env.close()Again, we initialize the Environment (again check the documentation to ensure compatibility and installing necessary dependencies) and simulate an Episode using random Actions. For this game, there are four actions that can be done: 0 for no action, 1 to fire the ball (to start the game), 2 for right movement of the paddle, and 3 to move the paddle to the left. The state is an observation space of Box(0, 255, (210, 160, 3), np.uint8) which is the RGB and pixel values of the Environment. The Reward is if a brick is destroyed in the specific state. Again, the goal is destroy as much brick as possible before game over.show_video('atari/*.mp4')Breakout Episode with Random ActionsLooking at the results of the episode with random Actions, we can see that the agent does not follow the trajectory of the ball (as expected given Actions are random). It got lucky at the end and it moved the paddle to hit the ball twice, breaking two bricks and scoring two points. Again, we need to train the Agent and make it learn by simulating different Policies.env_atari = make_atari_env('ALE/Breakout-v5', n_envs=4, seed=0)env_atari_vec = VecFrameStack(env_atari, n_stack=4)# Reset environment to get initial framesobs = env_atari_vec.reset()# Capture a frame from each environmentframes = env_atari.get_images() # Returns a list of 4 frames (one per env)# Create a 2x2 grid to display the framesfig, axes = plt.subplots(2, 2, figsize=(10, 10))for i, ax in enumerate(axes.flat): ax.imshow(frames[i]) # Display the frame for each env ax.axis("off") ax.set_title(f"Env {i+1}")plt.tight_layout()plt.show()Different Breakout Instances to be Stacked During TrainingFor the training, we will be running four separate instances of the Environment at the same time. These instances run in parallel, speeding up training by processing multiple game states at once. One more thing to note is that for this Environment, trajectory of the ball is important so the paddle is moved correctly towards the direction of the ball as it goes down. A single frame is not enough to check where the ball is going which is why we need to stack frames (given by VecFrameStack).model_atari = A2C('CnnPolicy', env_atari_vec, verbose=1, tensorboard_log=log_path)model_atari.learn(total_timesteps=500000, log_interval=10000)For the Atari Breakout problem, we will be using A2C (Advantage Actor-Critic) which is an algorithm combines value-based and policy-based approaches. The CNNpolicy is required for image-based observations such as Breakout. Next is we train the Agent for 500,000 timesteps to simulate different Policies and updating it using the chosen algorithm. As indicated along with the different metrics of the training, it took around 30 minutes to complete 500,000 timesteps. This is considering we did parallelized the process by using four instances at the same time.mean_reward, std_reward = evaluate_policy(model_atari, env_atari_vec, n_eval_episodes=20)print('Trained Model')print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")folder_path = "/content/atari_test/" # Change to your folder pathshutil.rmtree(folder_path)# env_atari = gym.make('ALE/Breakout-v5', render_mode="rgb_array")env_atari = make_atari_env('ALE/Breakout-v5', n_envs=1, seed=13)env = VecFrameStack(env_atari, n_stack=4)env = VecVideoRecorder(env, video_folder="/content/atari_test", record_video_trigger=lambda x: x == 0, video_length=1000)for episode in range(5): obs = env.reset() done = False score = 0 while not done: action, _ = model_atari.predict(obs) obs, reward, terminated, info = env.step(action) # Proceed on the generated action score += reward done = terminated or truncated # Ensure loop ends properly # print([action, obs, reward, terminated]) print(f'Episode: {episode} Score: {score}\n')env.close()show_video('atari_test/*.mp4')A Breakout Episode after Training with 500,000 TimestepsThen looking at the results of the training, we achieved an average reward of 23 destroyed bricks. Looking at the sample simulation of trained model, we can see that the movement of the paddle is now slightly coordinated with the trajectory of the ball (although it failed badly after one successful life). With the maximum score of 432 for the Atari Breakout, imagine how many timesteps are needed train the model for the Agent reach the max score. This highlights the dilemma in RL being too computationally expensive, that it will take a very long time to train a simple Environment to know the optimal Policy.Next StepsIn the walkthrough above, we simulated a Reinforcement Learning problem and trained an Agent by trial and error process. There are multiple directions from this especially if we want to be achieve more reward, like for the Atari Breakout game since the agent only scored 23 on average after training. The most straightforward option is increasing the number of timesteps to millions to let the Agent experience more policies. This will take forever but will ensure better results. Other options, similar to traditional supervised learning, are hyperparameter tuning and algorithm selection. Each algorithm has its own strengths so it is important to check the papers of the high performing ones (if not all). And then each algorithm has its own parameters such as learning rate that can be tweaked for each Environment to improve the performance.Other explorations that can be done is instead of using model-free and on-policy methods; we can use model-based, off-policy or offline RL methods to check how it will fare for the specific Environment. There are a lot of branches of RL in terms of model and algorithms so best to know it all.Lastly, in my opinion, is the best way to understand RL and how to specialize in RL is to learn to create a custom Environment. Creating the specifics, the rules, the Agent, what Actions it can take, how to score the Rewards. Knowing how this works can let us be creative enough to apply RL in different situations.ConclusionReinforcement Learning is normally not as popular and widely used as our typical supervised and unsupervised machine learning world. However, it is a vast and quickly evolving field and similar with those two, provides insights that are new even for the domain experts. RL is already being applies in different fields such as Robotics which instead of investing a lot of money into the hardware to test, RL can do the simulations; Gaming where AI is used in different aspects such as game content, game testing, strategies, etc.; Autonomous Driving which enable vehicles to learn optimal behaviors to ensure safety and efficiency. As part of the education sector, it is noteworthy to highlight the application of RL in terms of creating customized curriculum in order to maximize the learnings and motivation of students in going through classes.Again, this is only part 1 of Reinforcement Learning series from someone who started with minimal knowledge of the field. This hopefully becomes a road to specializing the field of RL or at least a specific area of the field. This only the start of uncovering and knowing the ins and outs of the Playground of AI.Python notebook for the scripts provided: https://github.com/redvjames/RL_sandbox (Tested in Google Colab)Reference[1] Ghasemi, M., Moosavi, A. H., Sorkhoh, I., Agrawal, A., Alzhouri, F., & Ebrahimi, D. (2024). An introduction to reinforcement learning: Fundamental concepts and practical applications. arXiv preprint arXiv:2408.07712. https://doi.org/10.48550/arXiv.2408.07712[2] Naeem, M., Rizvi, S. T. H., & Coronato, A. (2020). A gentle introduction to reinforcement learning and its application in different fields. IEEE access, 8, 209320209344. https://doi.org/10.1109/ACCESS.2020.3038605[3] Ahilan, S. (2023). A Succinct Summary of Reinforcement Learning. arXiv preprint arXiv:2301.01379. https://doi.org/10.48550/arXiv.2301.01379[4] AlMahamid, F., & Grolinger, K. (2021, September). Reinforcement learning algorithms: An overview and classification. In 2021 IEEE canadian conference on electrical and computer engineering (CCECE) (pp. 17). IEEE. https://doi.org/10.1109/CCECE53047.2021.9569056
0 Comments ·0 Shares ·24 Views