actorcritic.model

Contains the base class of actor-critic models.

Classes

ActorCriticModel(observation_space, action_space) Represents a model (e.g.
class actorcritic.model.ActorCriticModel(observation_space, action_space)[source]

Bases: object

Represents a model (e.g. a neural net) that provides the functionalities required for actor-critic algorithms. Provides a policy, a baseline (that is subtracted from the target values to compute the advantage) and the values used for bootstrapping from next observations (ideally the values of the baseline), and the placeholders.

__init__(observation_space, action_space)[source]
Parameters:
actions_placeholder

tf.Tensor – The placeholder for the sampled actions.

baseline

Baseline – The baseline used by this model.

bootstrap_observations_placeholder

tf.Tensor – The placeholder for the sampled next observations. These are used to compute the bootstrap_values.

bootstrap_values

tf.Tensor – The bootstrapped values that are computed based on the observations passed to the bootstrap_observations_placeholder.

observations_placeholder

tf.Tensor – The placeholder for the sampled observations.

policy

Policy – The policy used by this model.

register_layers(layer_collection)[source]

Registers the layers of this model (neural net) in the specified kfac.LayerCollection (required for K-FAC).

Parameters:layer_collection (kfac.LayerCollection) – A layer collection used by the KfacOptimizer.
Raises:NotImplementedError – If this model does not support K-FAC.
register_predictive_distributions(layer_collection, random_seed=None)[source]

Registers the predictive distributions of the policy and the baseline in the specified kfac.LayerCollection (required for K-FAC).

Parameters:
  • layer_collection (kfac.LayerCollection) – A layer collection used by the KfacOptimizer.
  • random_seed (int, optional) – A random seed used for sampling from the predictive distributions.
rewards_placeholder

tf.Tensor – The placeholder for the sampled rewards (scalars).

sample_actions(observations, session)[source]

Samples actions from the policy based on the specified observations.

Parameters:
  • observations – The observations that will be passed to the observations_placeholder.
  • session (tf.Session) – A session that will be used to compute the values.
Returns:

list of list – A list of lists of actions. The shape equals the shape of observations.

select_max_actions(observations, session)[source]

Selects actions from the policy that have the highest probability (mode) based on the specified observations.

Parameters:
  • observations – The observations that will be passed to the observations_placeholder.
  • session (tf.Session) – A session that will be used to compute the values.
Returns:

list of list – A list of lists of actions. The shape equals the shape of observations.

terminals_placeholder

tf.Tensor – The placeholder for the sampled terminals (booleans).