Quickstart Guide¶
The basic idea of reinforcement learning is to find a behavior for an agent inside an environment that leads to a maximal reward. Such a behavior is called a policy and it decides what action to take based on the current observation (also called state).
For example, the environment can be an Atari game. In this case the reward is the score, the actions are the controller actions, and the current frame/image of the game is an observation.
The gym library (GitHub) by OpenAI provides several types of environments. A basic reinforcement learning setup to learn a policy for the Breakout environment could look like this:
import gym
# create the environment
env = gym.make('BreakoutNoFrameskip-v4')
# receive an initial observation (frame) to select the first action
observation = env.reset()
while True:
# let the current policy select an action
action = policy(observation)
# execute the action and take one step in the environment (go to next frame)
next_observation, reward, terminal, info = env.step(action)
# improve the policy based on this experience
improve_policy(observation, action, reward, terminal, next_observation)
observation = next_observation
if terminal:
observation = env.reset()
terminal
indicates whether the game ended, so the game has to be reset. reward
is just a number that
represents the points that were achieved in this step. info
contains debug information (the current number of
lives).
A2C and ACKTR actually use multiple environments at once by running them in multiple subprocesses. This means that
we can improve the policy faster, since we simply have more observations and rewards available.
For that reason there is MultiEnv
:
from actorcritic.multi_env import MultiEnv
envs = create_environments() # create multiple environments
multi_env = MultiEnv(envs)
Yet the crucial parts are policy(observation)
and
improve_policy(observation, action, reward, next_observation)
. We need to know how to define a policy and
especially how to improve it.
Actor-critic methods define the policy as a probability distribution, such that it computes the probability of every action based on the current observation. Then these probabilities are used to sample one of the actions. For example, if the ball approaches the bottom in Breakout, the probability to move the paddle towards the ball should be high.
We typically use a neural network to compute these probabilities. Then the observations (frames) are sent into the
network, which produces a score for every action. These scores can be passed in the softmax function to obtain
probabilities. AtariModel
provides a neural network and a policy made for Atari
environments:
from actorcritic.envs.atari.model import AtariModel
# observation_space and action_space define the type and shape of the observations and actions
# e.g. the size of the frames
model = AtariModel(multi_env.observation_space, multi_env.action_space)
Additionally A2C and ACKTR do not take one step only and improve the policy immediately. Instead they take multiple
steps and use all the experienced observations and rewards to improve the policy.
A MultiEnvAgent
simplifies this process. It takes the neural network and the policy
(the ‘model’), and the environments. Then we just have to call interact()
and
it uses the policy to take multiple steps:
from actorcritic.agents import MultiEnvAgent
agent = MultiEnvAgent(multi_env, model, num_steps=5)
while True:
# take 5 steps in all environments
# session is a tf.Session used to compute the values of the neural network
observations, actions, rewards, terminals, next_observations, infos = agent.interact(session)
# improve the policy based on this experience
improve_policy(observations, actions, rewards, terminals, next_observations)
In actor-critic methods we do not define a loss function directly, but a policy objective function to optimize the neural network. It needs the observations, the actions, and the rewards that the agent experienced. Then we can learn through the policy objective, which looks at the rewards in order to decide whether the actions were good or not.
Furthermore we need a baseline function that enhances the policy objective. It should express how much reward we can expect if we would follow our policy proceeding from the observations we just have seen. This helps the policy to decide whether the actions it has taken actually were better or worse than expected. This baseline function is the ‘critic’ of actor-critic (the policy is the ‘actor’). It distinguishes actor-critic methods from policy gradient methods which just have an ‘actor’.
Unfortunately we do not have such a baseline function. That is why we will learn the baseline, too, at the same time
as the policy. Therefore an ActorCriticModel
like the
AtariModel
has to provide a baseline. A2C and ACKTR use the
state-value function which indeed tells us how much reward we can expect from a given observation.
It can be beneficial to use the same neural network as the policy for the baseline.
AtariModel
does exactly this.
In summary we need a ActorCriticObjective
. The policy objective of A2C and ACKTR is
implemented in A2CObjective
. It discounts the rewards and uses
entropy regularization (see A2CObjective
).
from actorcritic.objectives import A2CObjective
objective = A2CObjective(model, discount_factor=0.99, entropy_regularization_strength=0.01)
Next we need an optimizer for our neural network:
import tensorflow as tf
# A2C uses the RMSProp optimizer
optimizer = tf.train.RMSPropOptimizer(learning_rate=0.0007)
# create an 'optimize' operation that we can call
# use optimize_shared() since we share the network between the policy and the baseline
optimize_op = objective.optimize_shared(optimizer)
That is all. We can use all variables defined above to run the A2C algorithm:
while True:
# take multiple steps in all environments
observations, actions, rewards, terminals, next_observations, infos = agent.interact(session)
# improve the policy and the baseline
session.run(optimize_op, feed_dict={
model.observations_placeholder: observations,
model.bootstrap_observations_placeholder: next_observations,
model.actions_placeholder: actions,
model.rewards_placeholder: rewards,
model.terminals_placeholder: terminals
})
bootstrap_observations_placeholder
is needed to compute the
bootstrap_values
, which are used in the policy objective.
In order to use ACKTR we just have to change the optimizer to a kfac.KfacOptimizer
.
See a2c_acktr.py for a
full implementation, especially how to implement create_environments()
and how to use the K-FAC optimizer.