actorcritic.agents¶

Contains agents, which are an abstraction from environments.

Functions

transpose_list(values) Transposes a list of lists.

Classes

`Agent`	Takes environments and a model (containing a policy) and provides `interact()`, which manages operations such as selecting actions from the model and stepping in the environments.
`MultiEnvAgent`(multi_env, model, num_steps)	An agent that maintains multiple environments (via `MultiEnv`) and samples multiple steps.
`SingleEnvAgent`(env, model, num_steps)	An agent that maintains a single environment and samples multiple steps.

class actorcritic.agents.Agent[source]¶

Bases: object

Takes environments and a model (containing a policy) and provides interact(), which manages operations such as selecting actions from the model and stepping in the environments.

See also

This allows to create multi-step agents, like SingleEnvAgent and MultiEnvAgent.

interact(session)[source]¶

Samples actions from the model, and steps in the environments.

Parameters: session (tf.Session) – A session that will be used to compute the actions.

Returns:

tuple – A tuple of (observations, actions, rewards, terminals, next_observations, infos).

All values are in batch-major format, meaning that the rows determine the batch and the columns determine the time: [batch, time]. In our case the rows correspond to the environments and the columns correspond to the steps: [environment, step]. The opposite is the time-major format: [time, batch] or [step, environment].

Example:

If the agent maintains 3 environments and samples for 5 steps, the result would consist of a matrix (list of list) with shape [3, 5]:
[ [step 1, step 2, step 3, step 4, step 5],   # environment 1
  [step 1, step 2, step 3, step 4, step 5],   # environment 2
  [step 1, step 2, step 3, step 4, step 5] ]  # environment 3

observations, actions, rewards, terminals, and infos are collected during sampling and have the shape [environments, steps].

next_observations contains the observations that the agent received at last, but did not use for selecting actions yet. These e.g. can be used to bootstrap the remaining returns. Has the shape [environments, 1].

class actorcritic.agents.MultiEnvAgent(multi_env, model, num_steps)[source]¶

Bases: actorcritic.agents.Agent

An agent that maintains multiple environments (via MultiEnv) and samples multiple steps.

__init__(multi_env, model, num_steps)[source]¶

Parameters:	multi_env (`MultiEnv`) – Multiple environments. model (`ActorCriticModel`) – A model to sample actions. num_steps (`int`) – The number of steps to take in `interact()`.

interact(session)[source]¶

Samples actions from the model, and steps in the environments.

Parameters: session (tf.Session) – A session that will be used to compute the actions.

Returns:

tuple – A tuple of (observations, actions, rewards, terminals, next_observations, infos).

All values are in batch-major format, meaning that the rows determine the batch and the columns determine the time: [batch, time]. In our case the rows correspond to the environments and the columns correspond to the steps: [environment, step]. The opposite is the time-major format: [time, batch] or [step, environment].

Example:

If the agent maintains 3 environments and samples for 5 steps, the result would consist of a matrix (list of list) with shape [3, 5]:
[ [step 1, step 2, step 3, step 4, step 5],   # environment 1
  [step 1, step 2, step 3, step 4, step 5],   # environment 2
  [step 1, step 2, step 3, step 4, step 5] ]  # environment 3

observations, actions, rewards, terminals, and infos are collected during sampling and have the shape [environments, steps].

next_observations contains the observations that the agent received at last, but did not use for selecting actions yet. These e.g. can be used to bootstrap the remaining returns. Has the shape [environments, 1].

class actorcritic.agents.SingleEnvAgent(env, model, num_steps)[source]¶

Bases: actorcritic.agents.Agent

An agent that maintains a single environment and samples multiple steps.

__init__(env, model, num_steps)[source]¶

Parameters:	env (`gym.Env`) – An environment. model (`ActorCriticModel`) – A model to sample actions. num_steps (`int`) – The number of steps to take in `interact()`.

interact(session)[source]¶

Samples actions from the model and steps in the environment.

Parameters: session (tf.Session) – A session that will be used to compute the actions.

Returns:

tuple – A tuple (observations, actions, rewards, terminals, next_observations, infos).

All values are in batch-major format, meaning that the rows determine the batch and the columns determine the time: [batch, time]. In our case we have one environment so the row corresponds to the environment and the columns correspond to the steps: [1, step]. The opposite is the time-major format: [time, batch] or [step, 1].

observations, actions, rewards, terminals, and infos are collected during sampling and have the shape [1, steps].

next_observations contains the observation that the agent received at last, but did not use for selecting an action yet. This e.g. can be used to bootstrap the remaining return. Has the shape [1, 1].

actorcritic.agents.transpose_list(values)[source]¶

Transposes a list of lists. Can be used to convert from time-major format to batch-major format and vice versa.

Example

Input:

[[1, 2, 3, 4],
 [5, 6, 7, 8],
 [9, 10, 11, 12]]

Output:

[[1, 5, 9],
 [2, 6, 10],
 [3, 7, 11],
 [4, 8, 12]]

Parameters:	values (`list` of `list`) – Values to transpose.
Returns:	`list` of `list` – The transposed values.