
Quick Start Robotics and Reinforcement Learning with MuJoCo
towardsai.net
Author(s): Yasin Yousif Originally published on Towards AI. A starter tutorial of its basic structure, capabilities, and main workflowImages are rendered from xml source in menagerie repo under BSD-3-Clause for Trossen, and Apache license for Franka and ApptronikMujoCo is a physics simulator for robotics research developed by Google DeepMind and written in C++ with a Python API. The advantage of using MujoCo lies in its various implemented models along with full dynamic and physics properties, such as friction, inertia, elasticity, etc. This realism allows researchers to rigorously test reinforcement learning algorithms in simulations before deployment, mitigating risks associated with real-world applications. Simulating exact replicas of robot manipulators is particularly valuable, as it enables training in a safe virtual environment and then seamless transition to production. Notable examples of these models include popular brands like ALOHA, FRANKA, and KUKA readily available within MujoCo.Table of Content:OverviewMJCF FormatThe TaskContinuous Proximal Policy OptimizationTraining ResultsConclusionOverviewBeyond the core MujoCo library (installable via pip install mujoco), two invaluable packages enhance its capabilities: dm_control (https://github.com/google-deepmind/dm_control) and mujoco_menagerie (https://github.com/google-deepmind/mujoco_menagerie).mujoco_menagerie offers a wealth of open-source robot models in .xml format, simplifying the simulation of complex systems. These models encompass diverse designs, as illustrated in the image above.In dm_control, (installable also with pip: pip install dm_control with its own version of MujoCo), a very useful code base is provided for creating Reinforcement Learning pipelines from MujoCo models as environments classes with suitable .step(), .reward() methods. These pipelines are available via its suite subpackage, and are intended to serve as benchmarks, on which different proposed Reinforcement learning methods can be evaluated and compared. Therefore, it should not be alerted when used for that purpose.These benchmarks can be shown by running the following:# Control Suitefrom dm_control import suitefor domain, task in suite.BENCHMARKING: print(f'{domain:<{max_len}} {task}')which will give the following domains and tasks among others:Additionally dm_control allow the manipulation of the MJCF models of the entities from within the running script, utilizing its PyMJCF subpackage. Therefore, the user doesn't need to change the xml files in order to add new joints, or replicate a certain structure for example.MJCF MuJoCoMuJoCo XML Configuration File Format MJCFTo show a working example of an MJCF file, we will review the car.xml source code available with MujoCo github repository here. It basically exhibits a triple-wheel toy vehicle, with two front lights, where it has two main degrees of freedom DoFs: Forward-Backward and Left-Right movements.By taking a look at the first part of the code we note that the code always is between <mujoco> ..</mujoco> tags. We also note the compiler tag defines what <compiler> is used (Euler by default) and allow setting its options.<mujoco> <compiler autolimits="true"/>Next, as some objects in the model may need its own customized texture and geometric shape unlike standard shapes such as spheres and boxes, the <texture>, <material> and <mesh> tags can be utilized as follows. We note also in the <mesh> tag that the exact points coordinates are provided in the vertex option, where each row represent a point in the surface.<asset> <texture name="grid" type="2d" builtin="checker" width="512" height="512" rgb1=".1 .2 .3" rgb2=".2 .3 .4"/> <material name="grid" texture="grid" texrepeat="1 1" texuniform="true" reflectance=".2"/> <mesh name="chasis" scale=".01 .006 .0015" vertex=" 9 2 0/> -10 10 10 9 -2 0 10 3 -10 10 -3 -10 -8 10 -10 -10 -10 10 -8 -10 -10 -5 0 20" </asset>The <default> tag is helpful to set some default values for certain classes, like the wheelclass which will be always of certain shape, size and color (defined with type, size, and rgba respectively)<default> <joint damping=".03" actuatorfrcrange="-0.5 0.5"/> <default class="wheel"> <geom type="cylinder" size=".03 .01" rgba=".5 .5 1 1"/> </default> <default class="decor"> <site type="box" rgba=".5 1 .5 1"/> </default></default>The first body in Mujoco models is always is the <worldbody> with the order of 0, as a parent object for all the other bodies in the model. Since we have only one car, the only children body of it should be car.Within each body we can define its children of other bodies, geometries, joint or lights, with <body>, <geom>, <joint>, <light> tags respectively.This is shown in the next snippet, in which we note the options of name, class and pos among others, which define the unique name, the defined class in default and the initial position of the parent tag respectively.<worldbody> <geom type="plane" size="3 3 .01" material="grid"/> <body name="car" pos="0 0 .03"> <freejoint/> <light name="top light" pos="0 0 2" mode="trackcom" diffuse=".4 .4 .4"/> <geom name="chasis" type="mesh" mesh="chasis" rgba="0 .8 0 1"/> <geom name="front wheel" pos=".08 0 -.015" type="sphere" size=".015" condim="1" priority="1"/> <light name="front light" pos=".1 0 .02" dir="2 0 -1" diffuse="1 1 1"/> <body name="left wheel" pos="-.07 .06 0" zaxis="0 1 0"> <joint name="left"/> <geom class="wheel"/> <site class="decor" size=".006 .025 .012"/> <site class="decor" size=".025 .006 .012"/> </body> <body name="right wheel" pos="-.07 -.06 0" zaxis="0 1 0"> <joint name="right"/> <geom class="wheel"/> <site class="decor" size=".006 .025 .012"/> <site class="decor" size=".025 .006 .012"/> </body> </body></worldbody>As the car can move in any direction including jumping and flipping with respect to the ground floor, it gets <freejoint/> tag with 6 DoFs. While each of its wheels: right and left wheels, get one DoF, along its previously defined axis with the zaxis="0 1 0"option the yaxis.The active control handles in MujoCo are defined with the <tendon> tag, defining group of joints as the <fixed> tag, and then with the <actuator> tag, to define the exact name and control range of the motor <tag>. As in the following code.<tendon> <fixed name="forward"> <joint joint="left" coef=".5"/> <joint joint="right" coef=".5"/> </fixed> <fixed name="turn"> <joint joint="left" coef="-.5"/> <joint joint="right" coef=".5"/> </fixed></tendon><actuator> <motor name="forward" tendon="forward" ctrlrange="-1 1"/> <motor name="turn" tendon="turn" ctrlrange="-1 1"/></actuator>This system of tendons conveniently control the car, as we can control the linear movement of the car, with the "forward" tendon, having forward displacement of 0.5 for both wheels, and the turning movement with the "turn" tendon, having opposite directions of displacement for each of the wheels, which physically will make the car turn.The degree of displacement is controlled with both of the defined motors, by multiplying their values with the coef values of the tendons.Lastly, the <sensor> tag defines the joints that should read, as generalized displacements value on its DoF.<sensor> <jointactuatorfrc name="right" joint="right"/> <jointactuatorfrc name="left" joint="left"/> </sensor></mujoco>The TaskIn order to train and run the reinforcement learning agent to control the car, we must set a clear purpose of the intended behavior. For example we may aim to make the car take a circular path or drive towards a fixed, but unknown position.For this example, we will define a reward so that the car drive from its initial position A=(0,0,0) towards B=(-1,4,0). This point is somehow to the left of the car, so it has to turn as well as drive in straight line, as shown below.Made by authorFor this task, we must define a reward function in relation to the euclidean distance between the current position of the car and the target position. We choose to take the exponent of the negative distance np.exp(-np.linalg.norm(A,B)) to represent this reward so that the values are always in the range [0,1].Continuous Proximal Policy OptimizationAs we noted in the XML file, the range of the actuators values is continuous, from -1 to 1. This means that the action space is continuous too; therefore, the training algorithm should be suitable to address these scenarios.This means that algorithms like DQN will not be suitable, as it can only be applied to discrete action spaces. However, actor critic methods, like PPO, can still be used to train models of continuous action space.The PPO code used here for this task is based on CleanRL single-file implementation for the continuous PPO, but with modifying some parameters and replacing the environment with our newly written environment enveloping the previous MujoCo model.Practically we train for 2e6 steps, with 2500 steps per episode. As the default update rate in MujoCo is 2ms, then 2500 steps translates to 5 seconds.It is worth noting that the discrete PPO update formulas are the same for the continuous case, expect for the type of the output distribution in the policy model, where it will be categorical Categorical in the discrete case, and Gaussian Normal, or any other continuous distribution in the continuous case.Next we will show the used environment for stepping and simulating the MujuCo model, which will be used for the training program of PPO.Training EnvironmentAs we will be using the main MujoCo package (not dm_control), we will do the following imports:import mujocoimport mujoco.viewerimport numpy as npimport timeimport torchWe then define the init method of the environment class, in which:The XML file of the model is loaded with the command: mujoco.MjModel.from_xml_path(), which will result in the model structure containing the geometries and constants such as time-steps, and gravity constant in model.opt.The data structure are loaded from the model structure with the command data = mujoco.MjData(model). In this structure, the current state values (like generalized velocity data.qvel, generalized position data.qpos, as well as actuator values data.ctrl), can be read and set.Duration is 5 seconds, which can be mapped to the simulation time by delaying it in specific amount, as the simulation is usually much faster. For example 5 seconds maybe simulated in 0.5 seconds.Rendering: if the render variable is set to True. A viewer GUI will be initialized with mujoco.viewer.launch_passive(model,data). The passive feature is needed so that the GUI doesn't block the code execution. However, it will be updated to the recent values in data when viewer.sync() is called, and it should be closed with viewer.close()class Cars(): def __init__(self, max_steps = 3*500, seed=0,render=False): self.model = mujoco.MjModel.from_xml_path('./car.xml') self.data = mujoco.MjData(self.model) self.duration = int(max_steps//500) self.single_action_space = (2,) self.single_observation_space = (13,) self.viewer = None self.reset() if render: self.viewer = mujoco.viewer.launch_passive(self.model, self.data)In reset() method, the data structure will be rested based on the original model using mujoco.mj_resetData.Here we can choose the shape of the state we will be using to solve our problem. We note as the task is only moving in 2D, therefore we need the current Cartesianposition of the car data.body('car').xpos, in addition to its orientation data.body('car').xquat, and lastly the velocity data.body('car').cvel also maybe helpful to judge if we want to accelerate of decelerate.Note that data.body() or data.geom() allow named access to these objects as defined in the XML file, or even by their index number , where 0 always indicate the worldbody.def reset(self): mujoco.mj_resetData(self.model, self.data) self.episodic_return = 0 state = np.hstack((self.data.body('car').xpos[:3], self.data.body('car').cvel, self.data.body('car').xquat)) if self.viewer is not None: self.viewer.close() self.viewer = mujoco.viewer.launch_passive(self.model, self.data) return stateAs our task is to reach the point [-1,4], then our reward can be as simple as the distance between the current position and the destination. However, taking exp(-distance) seems more suitable since it will restrict the rewards values to the range [0,1], which can lead to better stability in learning.As mentioned previously all we have to do to synchronize the change to the viewer window is to invoke the command self.viewer.sync().def reward(self, state, action): car_dist = (np.linalg.norm(np.array([-1,4]-state[:2]))) return np.exp(-((car_dist)))def render(self): if self.viewer.is_running(): self.viewer.sync()def close(self): self.viewer.close()In the step() routine, the actual model will be updated. First by setting the current action of the forward and turning movements in the data.ctrl. But it is noted that the action is transformed with the np.tanh() which has the output range of [-1,1]. This will allow the neural network of the policy to be trained on the full range [-Inf, Inf] for its output vector, which is easier to represent, as smaller values may get rounded during training.We additionally keep count of the episodic return, and handle the terminal case, by resting the environment.def step(self, action): self.data.ctrl = np.tanh(action) mujoco.mj_step(self.model, self.data) state = np.hstack((self.data.body('car').xpos[:3], self.data.body('car').cvel, self.data.body('car').xquat)) reward = self.reward(state, np.tanh(action)) self.episodic_return += reward done = False info = {} if self.data.time>=self.duration: done = True info.update({'episode':{'r':self.episodic_return,'l':self.data.time}}) info.update({"terminal_observation":state.copy()}) state = self.reset() return state, reward, done, infoThis finished the main environment class of the car model. It is not that complicated or hard to write. However, in dm_control a customized environments and pipelines with various tools are already available and ready to be used for training RL agents. An extensive topic, that is left for explorations in future posts.Training ResultsAfter training the PPO program with the previous environment and using a suitable agent network, we record the following training curve for the episodic return.Made by authorWe can see that the model is clearly learning, albeit slowly. There you have it. Your first simulated and controlled RL agent with MujoCo.However, we still need to see it in action: does the robot really move towards point [-1,4]? To do that we need to run the following testing program with the render variable set to True.def main(): duration = 5 env = Cars(max_steps=duration*500,render=True) #2000000 is the training iterations policy = torch.load(f'ppo_agent_cars_{2000000}_mlp.pth') state = env.reset() start = time.time() while time.time() - start < duration: with torch.no_grad(): action = policy.actor(torch.Tensor(state).to('cuda')).cpu().numpy()[:2] state, reward, done, info = env.step(action) if done: break time.sleep(0.003) env.render() env.close()After initializing the environment and loading the trained model with pytorch. We get the initial state by resetting the environment. Inside the while loop, we alternate between inferring the action from the actor model, and stepping the environment. Lastly we keep rendering each frame with env.render().If we run the program without any delay, we will get a very fast simulation that we may not be able to observe and depending on our while condition, it may get repeated many times before the program finishes.To avoid that, we delay the execution by some amount with time.sleep(). The program may still run several times (before duration seconds has passed), but it will be observable.In my case, this code shows the car moving exactly as shown in the image above (in The Task section), but as the speed is limited and the episode length is only 5 seconds, the simulation ends before reaching the point [-1,4], because reaching the point will be physically impossible in that case, no matter how long the model is trained.ConclusionWhile this tutorial merely scratches the surface of MuJoCos vast API capabilities, it equips you with the foundational knowledge to embark on your robotic simulation journey. MuJoCos C++ foundation enables lightning-fast performance, making it ideal for training intricate robots of diverse configurations.This versatility positions MuJoCo as a valuable tool in both research and industry:Research: Researchers can rigorously test and compare novel reinforcement learning algorithms within challenging, realistic scenarios without the logistical complexities and costs of physical prototyping.Industry: Manufacturers can thoroughly evaluate robot designs and models in environments mirroring real-world conditions, ensuring optimal performance before deployment.This Reinforcement and Imitation Learning series will delve deeper into specific, popular algorithms, exploring their intricacies and applications. Subscribe or follow along to stay informed and explore the full potential of these powerful techniques!Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming asponsor. Published via Towards AI
0 Comments
·0 Shares
·66 Views