You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/homepage/blog/an_introduction_to_reinforcement_learning_jl_design_implementations_thoughts/index.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -62,7 +62,7 @@ Although most existing reinforcement learning related packages are written in Py
62
62
Many existing packages inspired the development of ReinforcementLearning.jl a lot. Following are some important ones.
63
63
64
64
-[Dopamine](https://google.github.io/dopamine/)\dcite{dayan2009dopamine} provides a clear implementation of the **Rainbow**\dcite{hessel2018rainbow} algorithm. The [gin](https://github.com/google/gin-config) config file template and the concise workflow is the origin of the `Experiment` in ReinforcementLearning.jl.
65
-
-[OpenSpiel](https://github.com/deepmind/open_spiel)\dcite{LanctotEtAl2019OpenSpiel} provides a lot of useful functions to describe many different kinds of games. They are turned into traits in our package.
65
+
-[OpenSpiel](https://github.com/google-deepmind/open_spiel)\dcite{LanctotEtAl2019OpenSpiel} provides a lot of useful functions to describe many different kinds of games. They are turned into traits in our package.
66
66
-[Ray/rllib](https://docs.ray.io/en/master/rllib.html)\dcite{liang2017ray} has many nice abstraction layers in the policy part. We also borrowed the definition of environments here. This is explained with details in section 2.
67
67
-[rlpyt](https://github.com/astooke/rlpyt)\dcite{stooke2019rlpyt} has a nice code structure and we borrowed some implementations of policy gradient algorithms from it.
68
68
-[Acme](https://github.com/deepmind/acme)\dcite{hoffman2020acme} offers a framework for distributed reinforcement learning.
Copy file name to clipboardExpand all lines: docs/src/How_to_implement_a_new_algorithm.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -45,7 +45,7 @@ end
45
45
Implementing a new algorithm mainly consists of creating your own `AbstractPolicy` (or `AbstractLearner`, see [this section](#using-resources-from-rlcore)) subtype, its action sampling method (by overloading `Base.push!(policy::YourPolicyType, env)`) and implementing its behavior at each stage. However, ReinforcemementLearning.jl provides plenty of pre-implemented utilities that you should use to 1) have less code to write 2) lower the chances of bugs and 3) make your code more understandable and maintainable (if you intend to contribute your algorithm).
46
46
47
47
## Using Agents
48
-
The recommended way is to use the policy wrapper `Agent`. An agent is itself an `AbstractPolicy` that wraps a policy and a trajectory (also called Experience Replay Buffer in reinforcement learning literature). Agent comes with default implementations of `push!(agent, stage, env)` and `plan!(agent, env)` that will probably fit what you need at most stages so that you don't have to write them again. Looking at the [source code](https://github.com/JuliaReinforcementLearning/ReinforcementLearning.jl/blob/main/src/ReinforcementLearningCore/src/policies/agent.jl/), we can see that the default Agent calls are
48
+
The recommended way is to use the policy wrapper `Agent`. An agent is itself an `AbstractPolicy` that wraps a policy and a trajectory (also called Experience Replay Buffer in reinforcement learning literature). Agent comes with default implementations of `push!(agent, stage, env)` and `plan!(agent, env)` that will probably fit what you need at most stages so that you don't have to write them again. Looking at the [source code](https://github.com/JuliaReinforcementLearning/ReinforcementLearning.jl/blob/main/src/ReinforcementLearningCore/src/policies/agent/agent_base.jl), we can see that the default Agent calls are
49
49
50
50
```julia
51
51
function Base.push!(agent::Agent, ::PreEpisodeStage, env::AbstractEnv)
Copy file name to clipboardExpand all lines: docs/src/How_to_write_a_customized_environment.md
+6-10Lines changed: 6 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ write many different kinds of environments based on interfaces defined in
6
6
[ReinforcementLearningBase.jl](@ref).
7
7
8
8
The most commonly used interfaces to describe reinforcement learning tasks is
9
-
[OpenAI/Gym](https://gym.openai.com/). Inspired by it, we expand those
9
+
[OpenAI/Gym](https://gymnasium.farama.org). Inspired by it, we expand those
10
10
interfaces a little to utilize multiple-dispatch in Julia and to cover
11
11
multi-agent environments.
12
12
@@ -30,7 +30,7 @@ act!(env::YourEnv, action)
30
30
## An Example: The LotteryEnv
31
31
32
32
Here we use an example introduced in [Monte Carlo Tree Search: A
33
-
Tutorial](https://www.informs-sim.org/wsc18papers/includes/files/021.pdf) to
33
+
Tutorial](https://ieeexplore.ieee.org/document/8632344) to
34
34
demonstrate how to write a simple environment.
35
35
36
36
The game is defined like this: assume you have \$10 in your pocket, and you are
@@ -168,7 +168,7 @@ policy we defined above. A [`QBasedPolicy`](@ref)
168
168
contains two parts: a `learner` and an `explorer`. The `learner`*learn* the
169
169
state-action value function (aka *Q* function) during interactions with the
170
170
`env`. The `explorer` is used to select an action based on the Q value returned
171
-
by the `learner`. Inside of the [`MonteCarloLearner`](@ref), a
171
+
by the `learner`. Inside of the [`TDLearner`](@ref), a
172
172
[`TabularQApproximator`](@ref) is used to estimate the Q value.
173
173
174
174
That's the problem! A [`TabularQApproximator`](@ref) only accepts states of type `Int`.
@@ -304,11 +304,7 @@ legal_action_space_mask(ttt)
304
304
```
305
305
306
306
For some simple environments, we can simply use a `Tuple` or a `Vector` to
307
-
describe the action space. A special space type [`Space`](@ref) is also provided
308
-
as a meta space to hold the composition of different kinds of sub-spaces. For
309
-
example, we can use `Space(((1:3),(true,false)))` to describe the environment
310
-
with two kinds of actions, an integer between `1` and `3`, and a boolean.
311
-
Sometimes, the action space is not easy to be described by some built in data
307
+
describe the action space. Sometimes, the action space is not easy to be described by some built in data
312
308
structures. In that case, you can defined a customized one with the following
313
309
interfaces implemented:
314
310
@@ -370,7 +366,7 @@ to the perspective from the `current_player(env)`.
370
366
371
367
In multi-agent environments, sometimes the sum of rewards from all players are
372
368
always `0`. We call the [`UtilityStyle`](@ref) of these environments [`ZeroSum`](@ref).
373
-
`ZeroSum` is a special case of [`ConstantSum`](@ref). In cooperational games, the reward
369
+
`ZeroSum` is a special case of [`ConstantSum`](@ref). In cooperative games, the reward
374
370
of each player are the same. In this case, they are called [`IdenticalUtility`](@ref).
375
371
Other cases fall back to [`GeneralSum`](@ref).
376
372
@@ -403,7 +399,7 @@ each action, then we call the [`ChanceStyle`](@ref) of these environments are of
403
399
default return value. One special case is that,
404
400
in [Extensive Form Games](https://en.wikipedia.org/wiki/Extensive-form_game), a
405
401
chance node is involved. And the action probability of this special player is
406
-
determined. We define the `ChanceStyle` of these environments as [`EXPLICIT_STOCHASTIC`](https://juliareinforcementlearning.org/docs/rlbase/#ReinforcementLearningBase.EXPLICIT_STOCHASTIC).
402
+
determined. We define the `ChanceStyle` of these environments as [`EXPLICIT_STOCHASTIC`](@ref).
407
403
For these environments, we need to have the following methods defined:
Copy file name to clipboardExpand all lines: docs/src/rlcore.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,8 +8,8 @@ In addition to containing the [run loop](./How_to_implement_a_new_algorithm.md),
8
8
9
9
## QBasedPolicy
10
10
11
-
`QBasedPolicy` is an `AbstractPolicy` that wraps a Q-Value _learner_ (tabular or approximated) and an _explorer_. Use this wrapper to implement a policy that directly uses a Q-value function to
12
-
decide its next action. In that case, instead of creating an `AbstractPolicy` subtype for your algorithm, define an `AbstractLearner` subtype and specialize `RLBase.optimise!(::YourLearnerType, ::Stage, ::Trajectory)`. This way you will not have to code the interaction between your policy and the explorer yourself.
11
+
[`QBasedPolicy`](@ref) is an [`AbstractPolicy`](@ref) that wraps a Q-Value _learner_ (tabular or approximated) and an _explorer_. Use this wrapper to implement a policy that directly uses a Q-value function to
12
+
decide its next action. In that case, instead of creating an [`AbstractPolicy`](@ref) subtype for your algorithm, define an [`AbstractLearner`](@ref) subtype and specialize `RLBase.optimise!(::YourLearnerType, ::Stage, ::Trajectory)`. This way you will not have to code the interaction between your policy and the explorer yourself.
13
13
RLCore provides the most common explorers (such as epsilon-greedy, UCB, etc.). You can find many examples of QBasedPolicies in the DQNs section of RLZoo.
14
14
15
15
## Parametric approximators
@@ -29,4 +29,4 @@ The other advantage of `TargetNetwork` is that it uses Julia's multiple dispatch
29
29
30
30
## Architectures
31
31
32
-
Common model architectures are also provided such as the `GaussianNetwork` for continuous policies with diagonal multivariate policies; and `CovGaussianNetwork` for full covariance (very slow on GPUs at the moment).
32
+
Common model architectures are also provided such as the `GaussianNetwork` for continuous policies with diagonal multivariate policies; and `CovGaussianNetwork` for full covariance (very slow on GPUs at the moment).
0 commit comments