ValueFunctionIterations.jl
ValueFunctionIterations.jl is a package for solving Markov decision processes (MDP) using the value function iteration algorithm. The library is designed to solve MDPs with continuous state variables, although, it can also accomidate discrete variables.
Markov decision processes
Markov decision processes are a type of optimization problem where a sequence of actions $u_t$ are chosen to maximize a sequence of rewards $R$ discounted into the future by a factor $\delta$. The rewards $R(s_t,u_t,X_t)$ depend on the action taken $u_t$, random events $X_t$ and thirds set of variables $s_t$ called state variables
\[V = \underset{\text{max}}{\{u\}_t}\left\{\sum_{t=0}^{\infty}\delta R(u_t,s_t,X_t) \right\}.\]
The state variables are influnced by the actions $u_t$ creating cause and effect relationships between actions taken in the present and future rewards. These relationships are captured by a function called the state transition function $F(s_t,u_t,X_t)$ which determines the value of the state variables in the next period $s_{t+1}$.
Markov decision provesses find the optimum sequence of action $u_t$ to maximize the discounted sum of rewards given the efect of each action on rewards in the present and on rewards in the future. Markov decision process also allow for there to be uncertianty in the relationship between current actions and future rewards captured by the random variable $X_t$.
Markov decision process can be solved using the Bellman equation which provides recursively calculates the expected value of the objective $V$ as a function of the current value fo the state variables $s_t$
\[V(s_t) = \underset{\text{max}}{u_t}\left\{E_X[R(s_t,u_t,X_t)]+\delta E_X[V(s_t,u_t,X_t)] \right\}.\]
The Bellman equaiton can be solved by approximating the value function $V$ and the expectation operatoes $E_x$ numerically. This software package uses BSplines to approximate the value fucntion and allows useres to choose between Montecarlo integration and quadrature to solve the expectations.
Markov decision process are common in natural resource and financial economics where individuals make decision about how operate in uncertain environments where actions in the present influence future possibilities. For example, a retired persion might have to choose how to spend thier savings over time given uncertain returns on their investments and health expenses.
How to
MDPs can be defined and solved using the DynamicProgram function. This function takes an AbstractValueFunction
object V
to approximate the value function, afunction that defines the rewards R(s,u,X,p)
, the state transition function F(s,u,X,p)
, a component vector of parameters p
defined usng the ComponentArrays library, a matrix u
with all possible actions, a random variable object X
, and the discount factor $\delta$. The random variables X
are defined as and AbstractRandomVariable. Constructors for AbstractRandomVariables are included in the ValueFunctionIterations.jl package.
the AbstractValueFunction
objects to approximate the value function can be created using the RegularGridBspline
functions or with the DiscreteAndContinuous
function. The DiscreteAndContinuous
function is primarily designed to be used when a discrete variable determines the possible values of one or more continuous state variables.
The following example defines a dyamic program to estimate the optimal rotation time for a timber stand that has some probability of experince damage before harvest.
using Plots, ValueFunctionIterations, ComponentArrays, Distribtions
# Income from harvesting trees
function R(s,u,X,p)
if X[1] == 0.0 && u[1] == 0 # if neither damage or harvest recieve nothing
return 0
elseif u[1] == 1 # If harvesting occurs recive net revenue
return s[1]-p.c
else X[1] == 1 # If damage occurs, harvest and revice the salvage value (X[2])
return X[2]*s[1]-p.c
end
end
# State update function
function F(s,u,X,p)
if X[1] == 0.0 && u[1] == 0 # if neither damage or harvest allow growth
return s[1]*exp(p.r*(1-s[1]/p.k))
else # If harvesting or damage occurs go to replanted biomass
return 0.02
end
end
# Parameters
# r: growth rate, k: maximum growth, c: cost of harvest, p_s: price for damaged timber
p = ComponentArray(r = 0.15, k = 1.0, c = 0.25)
# Harvest levels (0 or 97.5%)
u = action_space([0.0,1.0])
# Damage levels (0 or 97.5%) and probabilities (0.99 and 0.01)
X1 = RandomVariable([0.0 1.0;], [0.99, 0.01])
# Define quadrature scheme for normally distributed salvage values using Gauss-Hermite quadrature
X2 = GaussHermiteRandomVariable(10,[0.5],[0.1^2;;])
# Combine the two random variables
X = product(X1,X2)
# Discount factor
δ = 0.99
# Grid of stand sizes
grid = 0.01:0.01:1.00
V = ValueFunctionIterations.RegularGridBspline(grid)
# Define and solve the dynamic program
sol = DynamicProgram(V,R, F, p, u, X, δ; tolerance = 1e-5)
# Plot the policy function
Plots.plot(grid,broadcast(s -> s - s*sol.P(s)[1],grid), xlabel = "Standing timber",
ylabel = "Board feet", label = "Standing timber",linewidth = 2)
Plots.plot!(grid,broadcast(s -> s*sol.P(s)[1],grid), label = "Harvest", linewidth = 2)
plot!(size = (400,250))
Once we have a solved problem we can run simulations under the optimal policy using the simulate
function.
states,actions,rewards,vals=simulate(sol,100)
Plots.plot(states[1,:], label = "Standing timber", linewidth = 2)
Plots.plot!(actions[1,:].*states[1,1:(end-1)], label = "Harvest", linewidth = 2)
Plots.plot!(vals[1:(end-1)], label = "Present value", linewidth = 2)
plot!(size = (400,250), xlabel = "Time", ylabel = "Value")
Dynamic programs can take along time to run so you can build a problem without running it and evaluate the computation time by setting the
solve
key word argument in the DynamicProgram
function to false
. You can then run estimate_time
on the model to evaluate performance bottle necks and then use solve!
to run the model once you are happy with the predicted run time.
prob = DynamicProgram(R, F, p, u, X, δ, grid; solve = false, tolerance = 1e-5)
estimate_time(prob)
solve!(prob)