actorcritic.objectives¶
Contains objectives that are used to optimize actor-critic models.
Classes
A2CObjective (model[, discount_factor, …]) |
An objective that defines the loss of the policy and the baseline according to the A3C and A2C/ACKTR papers. |
ActorCriticObjective |
An objective takes an ActorCriticModel and determines how it is optimized. |
-
class
actorcritic.objectives.
A2CObjective
(model, discount_factor=0.99, entropy_regularization_strength=0.01, name=None)[source]¶ Bases:
actorcritic.objectives.ActorCriticObjective
An objective that defines the loss of the policy and the baseline according to the A3C and A2C/ACKTR papers.
The rewards are discounted and the policy loss uses entropy regularization. The baseline is optimized using a squared error loss.
The policy objective uses entropy regularization:
J(theta) = log(policy(state, action | theta)) * (target_values - baseline) + beta * entropy(policy)
where beta determines the strength of the entropy regularization.
See also
- https://arxiv.org/pdf/1602.01783.pdf (A3C)
- https://arxiv.org/pdf/1708.05144.pdf (A2C/ACKTR)
-
__init__
(model, discount_factor=0.99, entropy_regularization_strength=0.01, name=None)[source]¶ Parameters: - model (
ActorCriticModel
) – A model that provides the policy and the baseline that will be optimized. - discount_factor (
float
) – Used for discounting the rewards. Should be between [0, 1]. - entropy_regularization_strength (
float
ortf.Tensor
) – Determining the strength of the entropy regularization. Corresponds to the beta parameter in A3C. - name (
string
, optional) – A name for this objective.
- model (
-
baseline_loss
¶ tf.Tensor
– The current loss of the baseline of the model.
-
mean_entropy
¶ tf.Tensor
– The current mean entropy of the policy of the model.
-
policy_loss
¶ tf.Tensor
– The current loss of the policy of the model.
-
class
actorcritic.objectives.
ActorCriticObjective
[source]¶ Bases:
object
An objective takes an
ActorCriticModel
and determines how it is optimized. It defines the loss of the policy and the loss of the baseline, and can create train operations based on these losses.-
baseline_loss
¶ tf.Tensor
– The current loss of the baseline of the model.
-
optimize_separate
(policy_optimizer, baseline_optimizer, policy_kwargs=None, baseline_kwargs=None)[source]¶ Creates an operation that minimizes the policy loss and the baseline loss separately. This means that it minimizes the losses using two different optimizers.
Parameters: - policy_optimizer (
tf.train.Optimizer
) – An optimizer that is used for the policy loss. - baseline_optimizer (
tf.train.Optimizer
) – An optimizer that is used for the baseline loss. - policy_kwargs (
dict
, optional) – Keyword arguments passed to theminimize()
method of the policy_optimizer. - baseline_kwargs (
dict
, optional) – Keyword arguments passed to theminimize()
method of the baseline_optimizer.
Returns: tf.Operation
– An operation that updates both the policy and the baseline.- policy_optimizer (
Creates an operation that minimizes both the policy loss and the baseline loss using the same optimizer. This is used for models that share parameters between the policy and the baseline. The shared loss is defined as:
shared_loss = policy_loss + baseline_loss_weight * baseline_loss
where baseline_loss_weight determines the ‘learning rate’ relative to the policy loss.
Parameters: Returns: tf.Operation
– An operation that updates both the policy and the baseline.
-
policy_loss
¶ tf.Tensor
– The current loss of the policy of the model.
-