actorcritic.objectives

Contains objectives that are used to optimize actor-critic models.

Classes

A2CObjective(model[, discount_factor, …]) An objective that defines the loss of the policy and the baseline according to the A3C and A2C/ACKTR papers.
ActorCriticObjective An objective takes an ActorCriticModel and determines how it is optimized.
class actorcritic.objectives.A2CObjective(model, discount_factor=0.99, entropy_regularization_strength=0.01, name=None)[source]

Bases: actorcritic.objectives.ActorCriticObjective

An objective that defines the loss of the policy and the baseline according to the A3C and A2C/ACKTR papers.

The rewards are discounted and the policy loss uses entropy regularization. The baseline is optimized using a squared error loss.

The policy objective uses entropy regularization:

J(theta) = log(policy(state, action | theta)) * (target_values - baseline) + beta * entropy(policy)

where beta determines the strength of the entropy regularization.

__init__(model, discount_factor=0.99, entropy_regularization_strength=0.01, name=None)[source]
Parameters:
  • model (ActorCriticModel) – A model that provides the policy and the baseline that will be optimized.
  • discount_factor (float) – Used for discounting the rewards. Should be between [0, 1].
  • entropy_regularization_strength (float or tf.Tensor) – Determining the strength of the entropy regularization. Corresponds to the beta parameter in A3C.
  • name (string, optional) – A name for this objective.
baseline_loss

tf.Tensor – The current loss of the baseline of the model.

mean_entropy

tf.Tensor – The current mean entropy of the policy of the model.

policy_loss

tf.Tensor – The current loss of the policy of the model.

class actorcritic.objectives.ActorCriticObjective[source]

Bases: object

An objective takes an ActorCriticModel and determines how it is optimized. It defines the loss of the policy and the loss of the baseline, and can create train operations based on these losses.

baseline_loss

tf.Tensor – The current loss of the baseline of the model.

optimize_separate(policy_optimizer, baseline_optimizer, policy_kwargs=None, baseline_kwargs=None)[source]

Creates an operation that minimizes the policy loss and the baseline loss separately. This means that it minimizes the losses using two different optimizers.

Parameters:
  • policy_optimizer (tf.train.Optimizer) – An optimizer that is used for the policy loss.
  • baseline_optimizer (tf.train.Optimizer) – An optimizer that is used for the baseline loss.
  • policy_kwargs (dict, optional) – Keyword arguments passed to the minimize() method of the policy_optimizer.
  • baseline_kwargs (dict, optional) – Keyword arguments passed to the minimize() method of the baseline_optimizer.
Returns:

tf.Operation – An operation that updates both the policy and the baseline.

optimize_shared(optimizer, baseline_loss_weight=0.5, **kwargs)[source]

Creates an operation that minimizes both the policy loss and the baseline loss using the same optimizer. This is used for models that share parameters between the policy and the baseline. The shared loss is defined as:

shared_loss = policy_loss + baseline_loss_weight * baseline_loss

where baseline_loss_weight determines the ‘learning rate’ relative to the policy loss.

Parameters:
  • optimizer (tf.train.Optimizer) – An optimizer that is used for both the policy loss and the baseline loss.
  • baseline_loss_weight (float or tf.Tensor) – Determines the relative ‘learning rate’.
  • kwargs (dict, optional) – Keyword arguments passed to the minimize() method of the optimizer.
Returns:

tf.Operation – An operation that updates both the policy and the baseline.

policy_loss

tf.Tensor – The current loss of the policy of the model.