actorcritic.objectives¶

Contains objectives that are used to optimize actor-critic models.

Classes

`A2CObjective`(model[, discount_factor, …])	An objective that defines the loss of the policy and the baseline according to the A3C and A2C/ACKTR papers.
`ActorCriticObjective`	An objective takes an `ActorCriticModel` and determines how it is optimized.

class actorcritic.objectives.A2CObjective(model, discount_factor=0.99, entropy_regularization_strength=0.01, name=None)[source]¶

Bases: actorcritic.objectives.ActorCriticObjective

An objective that defines the loss of the policy and the baseline according to the A3C and A2C/ACKTR papers.

The rewards are discounted and the policy loss uses entropy regularization. The baseline is optimized using a squared error loss.

The policy objective uses entropy regularization:

J(theta) = log(policy(state, action | theta)) * (target_values - baseline) + beta * entropy(policy)

where beta determines the strength of the entropy regularization.

See also

https://arxiv.org/pdf/1602.01783.pdf (A3C)
https://arxiv.org/pdf/1708.05144.pdf (A2C/ACKTR)

__init__(model, discount_factor=0.99, entropy_regularization_strength=0.01, name=None)[source]¶

Parameters:

model (ActorCriticModel) – A model that provides the policy and the baseline that will be optimized.
discount_factor (float) – Used for discounting the rewards. Should be between [0, 1].
entropy_regularization_strength (float or tf.Tensor) – Determining the strength of the entropy regularization. Corresponds to the beta parameter in A3C.
name (string, optional) – A name for this objective.

baseline_loss¶: tf.Tensor – The current loss of the baseline of the model.

mean_entropy¶: tf.Tensor – The current mean entropy of the policy of the model.

policy_loss¶: tf.Tensor – The current loss of the policy of the model.

class actorcritic.objectives.ActorCriticObjective[source]¶

Bases: object

An objective takes an ActorCriticModel and determines how it is optimized. It defines the loss of the policy and the loss of the baseline, and can create train operations based on these losses.

baseline_loss¶: tf.Tensor – The current loss of the baseline of the model.

optimize_separate(policy_optimizer, baseline_optimizer, policy_kwargs=None, baseline_kwargs=None)[source]¶

Creates an operation that minimizes the policy loss and the baseline loss separately. This means that it minimizes the losses using two different optimizers.

Parameters:

policy_optimizer (tf.train.Optimizer) – An optimizer that is used for the policy loss.
baseline_optimizer (tf.train.Optimizer) – An optimizer that is used for the baseline loss.
policy_kwargs (dict, optional) – Keyword arguments passed to the minimize() method of the policy_optimizer.
baseline_kwargs (dict, optional) – Keyword arguments passed to the minimize() method of the baseline_optimizer.

Returns:

tf.Operation – An operation that updates both the policy and the baseline.

optimize_shared(optimizer, baseline_loss_weight=0.5, **kwargs)[source]¶

Creates an operation that minimizes both the policy loss and the baseline loss using the same optimizer. This is used for models that share parameters between the policy and the baseline. The shared loss is defined as:

shared_loss = policy_loss + baseline_loss_weight * baseline_loss

where baseline_loss_weight determines the ‘learning rate’ relative to the policy loss.

Parameters:	optimizer (`tf.train.Optimizer`) – An optimizer that is used for both the policy loss and the baseline loss. baseline_loss_weight (`float` or `tf.Tensor`) – Determines the relative ‘learning rate’. kwargs (`dict`, optional) – Keyword arguments passed to the `minimize()` method of the optimizer.
Returns:	`tf.Operation` – An operation that updates both the policy and the baseline.

policy_loss¶: tf.Tensor – The current loss of the policy of the model.