actorcritic.agents¶
Contains agents, which are an abstraction from environments.
Functions
transpose_list (values) |
Transposes a list of lists. |
Classes
Agent |
Takes environments and a model (containing a policy) and provides interact() , which manages operations such as selecting actions from the model and stepping in the environments. |
MultiEnvAgent (multi_env, model, num_steps) |
An agent that maintains multiple environments (via MultiEnv ) and samples multiple steps. |
SingleEnvAgent (env, model, num_steps) |
An agent that maintains a single environment and samples multiple steps. |
-
class
actorcritic.agents.
Agent
[source]¶ Bases:
object
Takes environments and a model (containing a policy) and provides
interact()
, which manages operations such as selecting actions from the model and stepping in the environments.See also
This allows to create multi-step agents, like
SingleEnvAgent
andMultiEnvAgent
.-
interact
(session)[source]¶ Samples actions from the model, and steps in the environments.
Parameters: session ( tf.Session
) – A session that will be used to compute the actions.Returns: tuple
– A tuple of (observations, actions, rewards, terminals, next_observations, infos).All values are in batch-major format, meaning that the rows determine the batch and the columns determine the time: [batch, time]. In our case the rows correspond to the environments and the columns correspond to the steps: [environment, step]. The opposite is the time-major format: [time, batch] or [step, environment].
Example:
If the agent maintains 3 environments and samples for 5 steps, the result would consist of a matrix (list
oflist
) with shape [3, 5]:[ [step 1, step 2, step 3, step 4, step 5], # environment 1 [step 1, step 2, step 3, step 4, step 5], # environment 2 [step 1, step 2, step 3, step 4, step 5] ] # environment 3
observations, actions, rewards, terminals, and infos are collected during sampling and have the shape [environments, steps].
next_observations contains the observations that the agent received at last, but did not use for selecting actions yet. These e.g. can be used to bootstrap the remaining returns. Has the shape [environments, 1].
-
-
class
actorcritic.agents.
MultiEnvAgent
(multi_env, model, num_steps)[source]¶ Bases:
actorcritic.agents.Agent
An agent that maintains multiple environments (via
MultiEnv
) and samples multiple steps.-
__init__
(multi_env, model, num_steps)[source]¶ Parameters: - multi_env (
MultiEnv
) – Multiple environments. - model (
ActorCriticModel
) – A model to sample actions. - num_steps (
int
) – The number of steps to take ininteract()
.
- multi_env (
-
interact
(session)[source]¶ Samples actions from the model, and steps in the environments.
Parameters: session ( tf.Session
) – A session that will be used to compute the actions.Returns: tuple
– A tuple of (observations, actions, rewards, terminals, next_observations, infos).All values are in batch-major format, meaning that the rows determine the batch and the columns determine the time: [batch, time]. In our case the rows correspond to the environments and the columns correspond to the steps: [environment, step]. The opposite is the time-major format: [time, batch] or [step, environment].
Example:
If the agent maintains 3 environments and samples for 5 steps, the result would consist of a matrix (list
oflist
) with shape [3, 5]:[ [step 1, step 2, step 3, step 4, step 5], # environment 1 [step 1, step 2, step 3, step 4, step 5], # environment 2 [step 1, step 2, step 3, step 4, step 5] ] # environment 3
observations, actions, rewards, terminals, and infos are collected during sampling and have the shape [environments, steps].
next_observations contains the observations that the agent received at last, but did not use for selecting actions yet. These e.g. can be used to bootstrap the remaining returns. Has the shape [environments, 1].
-
-
class
actorcritic.agents.
SingleEnvAgent
(env, model, num_steps)[source]¶ Bases:
actorcritic.agents.Agent
An agent that maintains a single environment and samples multiple steps.
-
__init__
(env, model, num_steps)[source]¶ Parameters: - env (
gym.Env
) – An environment. - model (
ActorCriticModel
) – A model to sample actions. - num_steps (
int
) – The number of steps to take ininteract()
.
- env (
-
interact
(session)[source]¶ Samples actions from the model and steps in the environment.
Parameters: session ( tf.Session
) – A session that will be used to compute the actions.Returns: tuple
– A tuple (observations, actions, rewards, terminals, next_observations, infos).All values are in batch-major format, meaning that the rows determine the batch and the columns determine the time: [batch, time]. In our case we have one environment so the row corresponds to the environment and the columns correspond to the steps: [1, step]. The opposite is the time-major format: [time, batch] or [step, 1].
observations, actions, rewards, terminals, and infos are collected during sampling and have the shape [1, steps].
next_observations contains the observation that the agent received at last, but did not use for selecting an action yet. This e.g. can be used to bootstrap the remaining return. Has the shape [1, 1].
-
-
actorcritic.agents.
transpose_list
(values)[source]¶ Transposes a list of lists. Can be used to convert from time-major format to batch-major format and vice versa.
Example
Input:
[[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]]
Output:
[[1, 5, 9], [2, 6, 10], [3, 7, 11], [4, 8, 12]]
Parameters: values ( list
oflist
) – Values to transpose.Returns: list
oflist
– The transposed values.