Optimizing high-dimensional stochastic forestry via reinforcement learning

In proceeding beyond the generic optimal rotation model, forest economic research has applied various speciﬁcations that aim to circumvent the problems of high dimensionality. We specify an age-and size-structured mixed-species optimal harvesting model with binary variables for harvest timing, stochastic stand growth, and stochastic prices. Rein-forcement learning allows solving this high-dimensional model without simpliﬁcations. In addition to presenting new features in reservation price schedules and effects of stochasticity, our setup allows evaluating the simpliﬁcations in the existing research. We ﬁnd that one-or two-dimensional models lose a high fraction of attainable economic output while the commonly applied size-structured matrix model overestimates economic proﬁtability, yields deviations in harvest timing, including optimal rotation, and dilutes the effects of stochasticity. Reinforcement learning is found to be an eﬃcient and promising method for detailed age-and size-structured optimization models in resource economics.


Introduction
The classic economic approach to forestry aims to maximize stand value by the choice of rotation periods.This Samuelson (1976) interpretation of the Faustmann (1849) forest valuation equation is generic in revealing the economics of clear-cut timing.Research has proceeded from this basis, but the development has reached a specific limit; optimal solutions become hard or impossible to find after expanding the model details toward a level that appears necessary for emerging extensions.Our aim is to show that recent achievements in reinforcement learning (RL) algorithms 1 can be applied when extending important model details and realism without forgoing optimization and economically justifiable model structure.

Related literature
The problems of high dimensionality are already present in deterministic models but are most pressing when specifications include stochastic factors.This is more or less the reason why a major part of stochastic forest economic literature

Our setup
We apply, for the first time, a mixed-species individual-tree model for deterministic and stochastic problems and with smooth optimization between regimes relying entirely on partial harvesting or thinning, i.e. continuous cover forestry (CCF) and regimes with both thinning and finite rotation, i.e. rotation forestry (RF).Earlier economic studies ( Haight and Monserud, 1990a;Tahvonen, 2011 ) based on individual-tree structure are deterministic, restrict the CCF/RF regime choice, include fixed harvest timing, and reduce optimized variables by grouping harvested trees.In our model the revenue formation is based on tree diameters and a detailed model for variable and fixed harvesting costs separately on thinning and clear-cuts.Stochasticity is included in (log-normally distributed) wood prices, in individual-tree diameter growth and in natural regeneration.Individual-tree growth and regeneration are based on growth models estimated using empirical data with conditionally Gaussian innovations ( Pukkala et al. 2013 ).
Our Theorem 1 proves that a deterministic stationary harvesting policy can be found that maximizes the objective functional.The result follows from the general theory on dynamic decision processes by Schäl (1983) .Additionally, our model can be extended to cover more general price dynamics than the log-normally distributed prices used here.By the existence of a deterministic stationary Markov policy, we are able to extend general reinforcement learning algorithms, such as proximal policy optimization ( Schulman et al., 2017 ), to solve the stochastic harvesting problem.
First, we show how the model performs if reduced to the unidimensional tree paradigm/S-F setup.This serves as a benchmark for the gains obtained with expanded detailed structure but additionally produces results that are most easily comparable with existing research.Our i.i.d.price stochasticity produces one-and two-dimensional reservation price schedules, but pure growth stochasticity ála the tree paradigm does not cause any effects on optimal rotation.
Secondly, we compute solutions where the optimal management regime consists of optimized partial harvesting followed by clear-cuts.The results reveal the losses and limitations of the S-F setup and that the reservation price schedules become dependent on both price and multidimensional forest state.The lower-dimensional matrix model in existing literature is shown to overestimate growth, the size of felled trees, and the value of the forest, in addition to deviations in harvest frequency and rotation period length.
Third, we obtain solutions relying solely on partial harvesting (the CCF regime) and show their properties under stochastic growth and price and compute reservation price schedules, none of which have been presented in existing studies.We find that including a wider range of harvesting alternatives decreases the overall level of economic risk.The inclusion of thinning and the CCF regime turns out to be essential for profitability, implying that optimizing rotation timing and clearcuts solely, as in the S-F and tree paradigm models, serves for theoretical purposes but in any wider contexts calls for expanded dimensions.
The model with four tree species includes 1900 state and 950 control variables and binary harvest timing choices.Extending harvesting alternatives from clear-cuts to thinning and from a single-species setup to mixed species suggest that simpler specifications dilute the effects of stochasticity.The lower-dimensional matrix model has a similar effect.Additionally, it leads to overestimation of stand growth and of the economic outcome compared with the detailed individual-tree model.
Obtaining high-dimensional results based on individual-tree structure is possible by RL despite being beyond reach by earlier methods.This allows us to evaluate the accuracy of the feedback thinning function applied in the existing literature.This idea to circumvent the curse of dimensionality turns out to be inaccurate but based on our results we suggest a modification that may more closely mimic optimal harvests.
The next section specifies the optimization model.This is followed by an existence theorem for the optimal solutions as stationary Markov policy.Next, we present the computation results and finally a discussion and conclusions.Proofs, details on empirical model properties and the reinforcement algorithm are provided in the appendix.

Stochastic Optimization based on a Mixed Individual-Tree Model
Let ˜ x j,q,w,t denote the number of trees (per ha) of species j ∈ { 1 , . . ., l} in size classes q ∈ { 1 , . . ., m } and age cohorts w ∈ { 1 , . . ., n } at the beginning of period t = 0 , 1 , 2 , . . .and h j,q,w,t the number of trees harvested at the end of the periods, respectively.The variables ˜ d j,q,w,t denote the diameter (centimeters) of size class q of age w trees at the beginning of period t.To allow any initial stand state and utilization of per-period information on growth stochasticity, we introduce variables x j,q,w,t and d j,q,w,t to represent the state variables at the end of the periods before harvesting.The stand state showing the number of trees in various species, size, and age classes at the beginning and end of the periods but before harvesting are then given by ˜ x t ∈ R l×m ×n and x t ∈ R l×m ×n , respectively.Similarly, we use d t , ˜ d t , and h t to denote the corresponding diameter data and the number of harvested trees by species, size, and age class.During each period t, the fraction of trees of species j that survive from age class w to the next age class w + 1 is denoted by 0 ≤ α j ( ˜ , where the development of ε t is stochastic.Natural regeneration of species j and size class q is given by the ingrowth function φ j,q ( ˜ x t , ˜ d t , ω t ) ≥ 0 , where ω t is the stochastic variation.The diameters of the new naturally regenerated trees are d j,q, 0 .
Let r be the rate of interest p.a., and γ = 1 / (1 + r) 5 be the discount factor for a 5-year period.Let (p t ) ∞ t=0 , p t ∈ R 2 l denote a species-specific stochastic price process for saw timber and pulp, and let δ th t ∈ { 0 , 1 } and δ cc t ∈ { 0 , 1 } denote boolean vari-ables that determine periods when a thinning or a clear-cut take place, respectively.When either a thinning or a clear-cut takes place, we include a fixed harvesting cost C f (planning and transporting the harvester to the site), implying that harvesting may not be carried out at each period.In the case of a clear-cut, the stand state is always reset to bare land that is immediately artificially regenerated to the state x j,q, 0 , d j,q, 0 , where the former is the number of seedlings and the latter their initial size.Given that the regeneration cost is C r , the per-period net revenues are (1) Let the expected present value of net revenues be denoted as J(x 0 , d 0 , p 0 ) , where an initial stand state is x 0 , diameter data d 0 , and market prices p 0 .The stochastic problem is written as (2) subject to

˜
x j,q,w +1 ,t+1 = x j,q,w,t − h j,q,w,t δ th t d j,q,w,t = ˜ d j,q,w,t + I j,q,w ( ˜ where t = 0 , 1 , 2 , . . . .In addition, our existence proof applies a technical assumption x t , d t ≤ K element-wise for some finite constant K. Thus, the number of trees or the diameter of each size, age, and species class cannot exceed some arbitrarily large value that never becomes binding and exceeds levels that are biologically attainable in a one-hectare forest. In (1)-( 10), the problem is to choose the number of harvested trees h t and the timings for thinning and clear-cuts, δ th t and δ cc t , over the time periods t = 0 , 1 , 2 , . . . .Equation (3) defines the number of trees in new cohorts at the end of each period before harvesting in the cases with and without clear-cuts.Equation (4) gives the number of trees in various cohorts at the beginning of the periods.Equation (5) includes the survivability of trees over periods.If any cohort reaches age n , it will die naturally.Equation (6) determines the initial tree diameters depending on whether regeneration is natural or artificial, and equations ( 7) -(8) determine their development thereafter.
One possible solution regime includes only thinning (δ cc t = 0 , t = 0 , 1 , 2 , . . . ) , i.e. it is the CCF regime.Solutions with a chain of clear-cuts and possibly thinnings are RF regime solutions.In intermediate cases, clear-cuts only occur occasionally (e.g. at the initial state) although thinning is the dominant harvesting method.Empirical details for the stand growth models are given in Appendix A including the interaction between tree species in growth, mortality and regeneration as well as the connection between the matrix and the individual-tree model.The economic parameter values, including the estimated model for stochastic prices, are in Appendix B.

Existence of an optimal solution as a stationary Markov policy
The forest management problem (1)-( 10) can be described as a dynamic discrete-time decision process (S, { D (s ) : s ∈ S} , q, q 0 , π , γ ) , where the components of the tuple are defined as follows: (i) S stands for the state space, where s t = (x t , d t , p t ) ∈ S represents the number of trees x t , tree diameters d t , and market prices p t .
(ii) A is the action space, where each action a t = (h t , δ th t , δ cc t ) ∈ A represents a harvesting decision.We note that δ th t and δ cc t are independent variables and not functions of h t and x t .This ensures that when studying continuity with respect to δ cc t and δ th t , taking the limit ensures that both of these variables remain 1 or 0 throughout the limiting procedure.
We define D (s ) as a non-empty subset of A , representing the set of actions feasible at state S (i.e., the set of implementable harvesting decisions).Requirement a t ∈ D (s t ) ensures that the number of trees harvested cannot exceed the number of trees available.We also denote gph D = { (s, a ) ∈ S × A : a ∈ D (s ) } .(iii) The transition probability is q : gph D → P (S) , where P (S) denotes the set of all probability measures on S. In this notation, the transition probability q is directly defined by the ecological growth model equations presented in Appendix A and the price dynamics given in Appendix B. That is, q describes the probability measure of variables x j,k,t+1 and p t given (s t , a t ) ∈ gph D .(iv) q 0 ∈ P (S) is the so-called initial state distribution.
A policy is understood as a mapping from past observations to a distribution over the actions (harvesting decisions).
Formally, a policy is defined as a sequence μ = (μ t ) of transition probabilities, where μ t (•| τ ) provide the conditional probabilities of various actions given a history of forest management decisions τ = (s 0 , a 0 , . . ., s t ) .

We write
for the set of all policies and F for the subset representing all deterministic policies (decision functions), i.e.
A policy is called Markov if the distribution depends only on the last state.We write M ⊂ for the set of all deterministic Markov policies.Finally, if the policy does not change over time, we define deterministic stationary policy as a Markov policy ( f t ) , where f t = f is independent of t.
We identify F with the set of all deterministic stationary policies.
The value of a policy μ is defined as the total expected discounted reward that is encountered while the policy is executed starting from initial state s : where the expectation is taken with respect to the probability measure defined by policy μ.The goal is then to find a policy μ that maximizes the expected reward, for any initial state s ∈ S. The optimal policy μ is related to V through μ ∈ argmax μ∈ T μ V , where is the Bellman operator.While μ ∈ can be a randomized policy, the existence of a corresponding optimal deterministic stationary policy means that for each s ∈ S the supremum in sup can be achieved by selecting the actions according to a deterministic function f : S → A , f ∈ F .The existence of a solution is guaranteed by the following result: Theorem 1.The stochastic optimization problem (1)-( 10) corresponds to a dynamic discrete-time decision process (S, { D (s ) : s ∈ S} , q, q 0 , π , γ ) , which has an optimal solution that can be represented as a deterministic stationary Markov policy.That is, there exists a function f The proof of the proposition is given in Appendix C along with a discussion of technical details.The result has a number of important implications for the solvability of the problem.In particular, this allows us to use efficient reinforcement learning (RL) algorithms, such as proximal policy optimization (PPO) ( Schulman et al., 2017 ), that have been popularized in machine learning literature to solve stochastic optimization problems that can be represented as Markov decision processes.
In reinforcement learning, the process of finding the optimal policy parameters is conceptualized as an agent that learns by interacting with its environment and by gathering experiences that will help the agent evaluate what was successful and explore what potentially optimal actions are available in various situations ( Sutton and Barto, 2018 ).The interaction between a learning agent and its environment is defined using the formal framework of Markov decision processes, but in contrast to dynamic programming, the transition probabilities do not need to be known in exact form when learning the policy parameters.The RL methods also avoid an exhaustive search over the entire state and action spaces, which makes them computationally more efficient than classical methods, such as policy and value iteration, for large state and action spaces.
The environment is commonly defined as a large simulation model representing how the actual environment would respond to the actions taken by the agent.In our case, the environment is essentially represented by the forest growth model.The agent and the environment interact at discrete time steps t = 0 , 1 , 2 , . . . .At each time step t, the agent receives a description of the forest stand state s t and, based on the state, selects an action a t = (h t , δ th t , δ cc t ) , where the agent chooses between thinning, clear-cutting, and doing nothing, and how much is thinned.As a consequence of its action, the agent receives a reward, i.e. a per-period profit, π (s t , a t ) , and observes a new stand state s t+1 one time step later.The Markov decision process underlying the environment and agent together thereby give rise to a trajectory of states, forest management decisions, and gross profits: s 0 , a 0 , π 0 , s 1 , a 1 , π 1 , . . . .In RL, each of these trajectories begins independently of how the previous one ended.As the objective of the agent is to maximize the expected net present value (NPV) over each trajectory, the agent can learn from the rewards it has received by pursuing various forest management policies, as represented by the sequence of actions it has taken.
To find a policy that maximizes the expected discounted reward, we apply a modified version of proximal policy optimization (PPO) to sequentially maximize a surrogate objective function that serves as an approximate lower bound for the expected reward ( Schulman et al., 2017 ).While the original PPO algorithm treats all actions as continuous, the algorithm can easily be modified to handle a mixture of discrete and continuous actions, where the continuous actions are viewed as parameters of the discrete actions ( Fan et al., 2019 ).Therefore, similar to Malo et al. (2021) , we approach the problem of continuous action and state spaces using the notion of parameterized action spaces, where the choice between thinning, clear-cutting, and doing nothing is modeled as discrete actions and the actual harvesting quantities are modeled as their continuous parameters.Further details on the algorithm are discussed in Appendix D .

Results
Model ( 1)-( 10) can be solved under a various number of details.To reveal their role, we proceed from a simple setup toward more detailed ones.For this purpose, we first remove thinning (but maintain all other model properties) and solve the rotation as a single optimized variable in spirit of the Samuelson (1976) interpretation of the Faustmann (1849) land value analysis and the closely related "tree paradigm" literature ( Miller and Voltaire, 1983 ).Next, we include optimized thinning and finally both thinning and multiple tree species.

Samuelson-Faustmann specification
In purely deterministic solutions for pine ( Fig. 1 a), the maximum sustainable yield rotation is 70 years (not shown), while the economic rotation is 80 (60) years when interest rate is 1% (3%, not shown).In Fig. 1 a, the matrix model yields a 20-year shorter rotation and overestimates bare land value (BLV).In Figs. 1 b and c, the interest rate is 3%, the stand consists of four tree species (which in the S-F setup that ignore partial harvests, are all clear-cut simultaneously), and the individual-tree based optimal rotation is 65 years, while it is 10 years shorter in the matrix model and again BLV is overestimated.These are the first signs of the matrix model tendencies to overestimate volume growth leading to differences in both optimal harvests and profitability.Following the tree paradigm, we next include stochastic growth into the S-F setup.This yields expected clear-cut harvests and BLVs within 0.1% of their deterministic levels with very low standard deviations (SD) ( ∼ 1% ).Additionally, growth stochasticity is not high enough to produce any deviation from the optimal deterministic rotations.These outcomes follow from the low stochasticity in tree diameter growth and from the minor role of stochasticity in natural regeneration in the S-F solutions, where rotation is rather short.
In contrast to stochastic stand growth, i.i.d.stochastic pulp and saw timber prices (Appendix B) imply a deviation of up to 30 years compared with the expected rotation (Fig 2 a).Given 1% and 3% interest rates and 50 0 0 stochastic realizations, the expected rotations are slightly below their deterministic levels (77.9 and 59.9 years vs. 80 and 60 years).Figs. 2 a,b show that the optimal cutting is not determined by barrier rule for age or cutting value.The BLV distributions in Fig. 2 c show that the optimal policies yield higher expected BLVs and lower SDs compared with the policy computed under deterministic mean prices.
The optimal reservation price schedule in Fig. 3 a shows that with a 1% interest rate, a price deviation of ca.+6% is enough to advance clear-cutting by 15 years and a deviation of ca.-3% causes a postponement of up to 30 years.However, clear-cut timing is less flexible with a higher interest rate.In Fig. 3 a, the saw timber and pulp price deviations from the mean are of equal percentage, reflecting the positive covariance of these variables in our price model estimation (Table 4, Appendix B).Fig. 3 b relaxes this coupling, implying that the reservation price schedule forms a surface with its diagonal as the two-dimensional schedule in Fig. 3 a.A closer examination reveals that optimal timing reacts more sensitively to saw timber compared with pulp price.
In the stochastic realizations of Figs. 2 b and c, the optimal stochastic policy produces a higher expected BLV compared with the policy computed under deterministic mean prices, but the differences are 3.6% (3.2%) with an interest rate of 1% (3%), i.e. the benefit from a flexible stochastic policy is rather low.Closely related to this, the realized prices at clear-cuts are € 61 and € 31.7 for saw and pulp logs, respectively (interest rate 1%), while the expected prices are € 58.6 and € 30.05.The explanation for the rather low benefit from this flexible policy is the low SD ( ∼ 4% ) of the saw and pulp log prices compared with expected prices in our estimated price model.To demonstrate these effects, we increased the SD in our price model to 30%.Given this change, the variation in optimal rotation realizations becomes larger, the benefit from optimizing under stochastic prices increases to ca. 40%, and reservation price levels always exceeded the expected price levels.

Optimized rotation, management regime choice, and thinning
Optimized thinning, deterministic setups, and interest rates of 1-2% imply that the RF regime with two or three thinnings is optimal for pine ( Figs. 4 a,b).Compared with the S-F model ( Figs. 1 a,b), thinning lengthens the rotation by 30-45 years and increases BLV by 19-28%.The matrix model overestimates the BLV by 28-42%, yields different numbers and timings of thinning, and different rotations compared with the individual-tree model.Volume developments in the models are additionally rather different.Similarly as in the S-F setup, stochasticity in stand growth alone does not cause any variations in harvest timing and only minor changes in cutting value.
In Fig. 5 a, stochasticity in prices changes both the timing (5-10 years) and intensity of thinnings along with the timing of clear-cuts.This differs sharply from deterministic growth and no thinning ( Fig. 3 a), where stand state is a deterministic function of time and reservation price schedule is a priori determined in the time-price plane.In Fig. 5 a, thinning reacts to stochastic prices, implying that the stand state varies at the moments of clear-cuts.Thus, harvest timing becomes dependent on varying stand state in addition to price.To obtain a comparable figure with schedules without thinning, the reservation price schedules in Fig. 5 b assume the state from the deterministic solution after the second thinning ( Fig. 5 a) as the initial state.Comparing with the S-F model ( Fig. 3 a), clear-cut timing is more sensitive to timber price variations with optimized thinning.With a 1% interest rate, ca.38% of clear-cuts are expected to deviate by 10 years or more from the deterministic 110-year rotation ( Fig. 5 c).50 0 0 realizations with simultaneous price and growth stochasticity show an increase from the deterministic BLV of € 5437 to an expected BLV of € 5569 (SD € 142).Fig. 5 d compares the policy computed under deterministic environment and the policy computed under stochastic prices and growth.Comparing with BLVs in the S-F model shows that the share of SD on BLV is lower under thinning: 1.7% vs. 2.2%, with an interest rate of 1% and 2.5% vs. 3.7% with an interest rate of 3%.Thus, flexibly optimized thinning acts as risk-spreading.
Including optimized thinning for deterministic Norway spruce implies an infinitely long rotation, i.e. optimality of the CCF regime and convergence toward a cyclical steady state ( Fig. 6 ).The differences with solutions without thinning are remarkable, as the rotations in the S-F solutions are 90 and 70 years and BLVs are 25% and 96% lower (interest rates 1% and 3%).In Fig. 6 b (lower panel), the individual-tree size structure before the steady-state harvest is shown by 416 (nonzero) state variables (tree size classes and diameters).Accordingly, the number of (nonzero) control variables is 208 in addition to the binary harvest timing variables.
Pure growth stochasticity (not shown) causes 10-year deviations from the deterministic 25-year harvesting interval and decreases the expected harvest interval to 23 years (interest rate 1%).The SD in harvest levels is 35% (20%) with a 1% (3%) interest rate.Compared with the S-F model, where growth stochasticity has negligible effects, these variations are high but still low compared with stochasticity and SD in natural regeneration, which is ca.2.8 times the expected value.
To study price stochasticity in the CCF regime, we first assume an initial state from the deterministic steady-state solution five years after the optimal harvest.In Figs. 7 a and b, the mean price would imply a harvesting rate of 25 (20) years with an interest rate of 1% (3%) but a 15% higher price advances the harvest by 10 years.Similarly as in the RF regime ( Fig. 5 b), the change in harvest timing is more sensitive to low price realizations (and low interest rate), and prices that remain below 10% of the mean postpone the harvest by 20 years or more.Fig. 7 c is based on 57 500 price realizations and the stand state after previous harvest varies between realizations, implying that at each date the minimum price with harvest realization varies as a consequence of varying stand state.This allows computing both the mean reservation price and SD.In Fig. 7 d, both price and stand growth are stochastic, implying that the SD from the mean timing is larger (575 0 0 0 realizations).Thus, thinning and/or including both growth and price stochasticity imply that the reservation prices are dependent on stochastically evolving stand growth in contrast to the simplified schedules in Figs.7 a and 5 b.
Despite the differences in timing, the size variation in the harvested trees is negligible, i.e. 19.0 cm with an SD of 4 mm (interest rate 3%) and 19.1 cm with an SD of 1 mm (interest rate of 1%).Instead, the variation occurs in the number of trees harvested, implying an SD of 31 trees when the expected number of harvested trees per harvest is 273.
A run with 50 0 0 realizations produces an expected BLV equal to € 622 with an SD of € 41.Mean BLV decreases to € 601 and SD increases to € 43 if the harvesting policy is trained based on deterministic mean prices ( Fig. 7 e).In Fig. 7 f, the deterministic solution is contrasted to 10 realizations under stochastic price and stand growth.In stochastic solutions, the mean steady-state volume 112 m 3 is above the deterministic average steady-state volume of 107 m 3 .Similarly as with RF solutions for pine, optimized thinning decreases the relative SD of BLVs compared with solutions without thinning, i.e. 2.3%, 4.7%, and 7.4% vs. 1.6%, 2.5%, and 6.6% (interest rates 1%, 2%, and 3%).Intuitively, more frequent fellings and a lower number of trees per felling decrease the overall stand-level risk.
Price stochasticity increases the BLV in the individual-tree model by 3.3% and by 1.4% in the matrix model.Accordingly, the steady-state net revenues increase by 7.3% in the individual-tree model and by 3.5% in the matrix model.In both models, price stochasticity increases the mean steady-state volume, but the increase is 5% in the individual-tree model and 1% in the matrix model.The simpler matrix model thus tends to dilute the effects of stochasticity.

Continuous cover forestry for mixed-species stands
With four tree species, the number of nonzero tree classes in the individual-tree model is 950, implying, along with the variables for tree diameter, that the number of (nonzero) state variables is 1900 and the number of control variables is 950 in addition to the binary harvest timing variables.Assuming interest rates of 1-3%, the optimal deterministic solution in both the matrix and individual-tree models is a spruce-dominated CCF solution ( Fig. 8 a).However, the matrix model solution is more even over time, and harvesting interval is shorter.Additionally, the matrix model overestimates the number of trees in both the small and large size classes ( Fig. 8

b).
A major difference between the models' outputs are in their objective values, and the matrix model produces more than three times higher BLV compared with the individual-tree model ( Table 1 ).This is caused by differences in harvest timing and the levels of yield and net revenues.The first harvest is 15 years sooner in the matrix model solution, and the largest harvested tree is 32.5 cm in the steady-state harvests and 25.6 cm in the individual-tree model.Additionally, the matrix model suggests that a mixed-species stand clearly outperforms single-species stands, while a similar outcome is less clear in the individual-tree output ( Table 1 ).The models' differences at the steady-state cycle are less dramatic but still remarkable: the matrix model net revenues are 25-41% and the timber output is 18-24% higher compared with the individual-tree model.
The main effect of growth stochasticity occurs in natural regeneration, and the SD averages ca.2.5 times the mean number of ingrowth seedlings.This causes variability in steady-state net revenues per harvest with an SD of 22% ( Fig. 9 b).To a large extent, this high variability is cancelled out in the BLV distribution, where the fraction of SD from the expected BLV is 6.3% ( Fig. 9 c).Growth stochasticity lengthens the expected harvesting interval from its deterministic 20 years to 21.3 years ( Fig. 9 a).As a consequence, growth stochasticity increases the average stand volume and average annual steady-state harvest yield ( Figs 10 ).Given the S-F setup, growth stochasticity does not have any effects on clearcut timing implying that Fig. 1 b shows the optimal S-F solutions for mixed species forest with and without stochasticity.Thus, the comparison with Fig. 10 reveals the effects of fully utilizing the model dimensions.
By comparison, the stochastic matrix model results show lower SDs in BLV (2.2%) while the SD with the individual-tree model is 6.3%.In the matrix model, stochasticity shortens the expected harvesting interval from 15 to 12.8 years ( Fig. 9 a).Additionally, the SDs of the steady-state harvest, stand volume, and net revenues are lower in the matrix model, suggesting that it dilutes the effects of growth stochasticity, as in the case with single-species spruce CCF solutions.

Discussion and conclusions
The most commonly applied stochastic approach for forest resources, i.e. the "tree paradigm", only optimizes rotation length and specifies stand clear-cut value as geometric Brownian motion, with the result that stochasticity increases both expected rotation and BLV ( Chang, 2005;Miller and Voltaire, 1983;Willassen, 1998 ).Brownian motion with a fixed initial state after each harvest implies that stochasticity originates from stand growth.In our empirically detailed model and data, including only growth stochasticity into this Samuelsson-Faustmann model (i.e.no partial harvests) does not lead to any departures from the deterministic optimal rotation.
Given the specification with no partial harvests, our on i.i.d.price stochasticity are qualitatively similar to Brazee and Mendelsohn (1988) and Insley and Rollins (2005) , but we find that clear-cut timing reacts more sensitively to saw timber compared with pulp price, and that timing is more flexible under a low interest rate.In our results, the stochastic policy increases the expected BLV ca.3.5% above the outcome of the deterministic policy, while the gain is 42-88% in Brazee and Mendelsohn (1988) and as high as 550% in Insley and Rollins (2005) .In our setup, the low increase in expected BLV follows from the low SD in our price data.When the SD in prices is increased to similar levels as in Brazee and Mendelsohn (1988) , the gain from optimal policy in the expected BLV ( ∼ 40% ) is closer to their results.Notes: Interest rate 3%, Expected BLV € 628.
While Clarke and Reed (1989) and Reed and Clarke (1990) make a sharp distinction between age-dependent and sizedependent models, our setup includes these features simultaneously.Steps toward this direction are taken in Helmes andStockbridge (2011) , Sødal (2002) , and Alvarez and Koskela (2007) by the inclusion of thinning (partial harvests) but without expanding the model dimensions.In our size-and age-structured model, the inclusion of thinning lengthens the rotation period, which, depending on tree species, may become infinitely long.In these cases, the solution represents continuous cover forestry, i.e. harvesting based entirely on thinning instead of clear-cuts.Additionally, after including thinning, the maximized stand value increases (19-900%) and optimal solutions react more sensitively to both stochastic forest growth and stochastic wood prices.However, with thinning, SD becomes lower, suggesting that more frequent smaller harvests act as risk-spreading.Together these results reveal the clear reason for proceeding from the classic S-F setup and the tree paradigm to models with extended dimensions.
The optimization of intermediate harvests or thinning brings out the model dimensions and the curse of dimensionality.The individual-tree structure applied here (with 20 0-20 0 0 state and control variables) is an exception even without stochasticity.Haight and Monserud (1990a,b) apply an individual-tree model, solve both RF and CCF solutions but group harvested trees into seven size classes, and do not apply flexible optimization of harvest timing.Tahvonen (2011) uses a fully continuous diameter variable but assumes only one ingrowth tree class per cohort and circumvents the mixed-integer control variables by assuming a fixed harvesting interval.Fixed harvest timing is a clear departure from the optimal timing/stopping spirit and distorts the solutions when the initial state is far from the CCF steady state and when the question is CCF/RF optimal choice.
In previous research with i.i.d.price stochasticity ( Brazee and Mendelsohn, 1988;Insley and Rollins, 2005 ), the reservation price schedules are two-dimensional, decreasing the relations between price and clear-cut age.We show that including partial harvests with price stochasticity or both price and growth stochasticity changes this established result and the price/timing combination becomes dependent on the forest state that evolves stochastically.Computation realizations suggest that the effects of a stochastically evolving stand state make the guidance of the mean timing schedule inaccurate.Optimal timing must be based on observing both stand state and prices.
Within stochastic Markov decision models, Haight (1990Haight ( , 1991) ) include i.i.d.stochastic timber price within the matrix model approach and assume a feedback thinning function that maps the stand cutting value with the number of harvested trees.The function parameters are optimized by Monte Carlo simulation, and the results suggest no harvest when the stand cutting value is low and a positive increasing harvest with a higher value of standing trees.Later, similar methods motivated by the aim to circumvent the curse of dimensionality are applied in Lu and Gong (20 03, 20 05) and Pukkala (2015) , among others. 4e can analyze the accuracy of the thinning function approximation by running the optimal policy (that depends on both stand state and price) with a sufficiently high number of stochastic price realizations and by plotting the stand cutting value with the number of trees harvested ( Fig. 11 ).The outcomes show that neither the steady-state solutions nor the solutions for the transient period can be accurately represented in our model by a function starting horizontally from the origin and continuing as an increasing concave function after some cutting value of standing trees, as in Haight (1990Haight ( , 1991) ) .This outcome is surely case-specific, but at least in our setup a single function cannot include the feature where waiting may be optimal, even with a high stand cutting value, if the price realization is (still) low.
However, in our results, price stochasticity changes the number of harvested trees but the minimum diameter of harvested trees is rather stable (albeit depends on the species and interest rate).Based on this, a feedback function that maps the cutting value of trees exceeding the typical minimum diameter and the number of trees harvested may produce a reasonable approximation.
A very different approach to circumventing the curse of dimensionality is developed in Kaya and Buongiorno (1987) , where a matrix model is used to compute transition probabilities for a reduced (linear) system with a finite number of discrete stand states.Given 41-64 states and three discrete prices, dynamic programming is used to obtain a feedback harvest policy for stochastic problems.This approach is applied widely, and e.g.Buongiorno and Zhou (2015) study mixed-species CCF forestry with price and growth stochasticity and under various objectives.The question remains how well the reduced model is able to represent the original higher-dimensional nonlinear specification.Compared with the feedback thinning function or Buongiorno and his coauthors, we apply a generalization of the matrix model and maintain the relationship between the genuine state variables and harvesting control variables.Additionally, our state variables and stochastic prices continuous and the results suggest that reducing the model dimensions is not harmless.Getz and Haight (1989) compare a two-species matrix model with an individual-tree model and observe that the matrix model provides a method for simplifying individual-tree models with minimum loss of biological details.This is in line with the still prevailing view ( Liang and Picard, 2013 ) that individual-tree models are not suited for optimization because of the excessively large number of state and control variables.In contrast, matrix models are written to contain fewer variables and are computationally suitable for optimization studies.We challenge this view based on optimization output, while these earlier comparisons observe deterministic stand development without harvests ( Picard and Liang, 2014 ) or outcomes from pre-specified non-optimized harvests and do not compare the maximized objective values produced by the two model variants.
One possibility may be to decrease the size-class width in the matrix model to decrease the differences with the individual-tree model.In our model this, however, implies that the fraction of trees moving to the next size class may exceed one.Preventing this generates non-differentiability in the transition and mortality functions; an unfortunate feature for the optimization algorithms typically used with the matrix models.Similar problems occur in a specification without the "Usher property" ( Picard and Liang, 2014 ), i.e. when some fraction of individuals are allowed to move up more than one class.These problems could perhaps be smaller if the period length in forest growth models were less than the typical 5 or 10 years.
The curse of dimensionality is a barrier to model development in many fields of dynamic economics ( Scheidegger and Bilionis, 2019 ), and optimizing the use of forest resources is not an exception.Specifying our model under a varying degree of dimensionality reveals that increasing model structure and details changes the optimization solutions both qualitatively and quantitatively.Methods developed in reinforcement learning are promising in proceeding beyond the existing boundaries and should have high potential in all sub-fields of resource economics, including the economics of climate change ( Cai and Lontzek, 2019;Rolnick et al., 2022 ).

Data availability
Optimizing high-dimensional stochastic forestry via reinforcement learning Appendix A. Forest Growth Following Pukkala et al. (2013) , the survivability functions in ( 5) are given as where functions B j,q,w , j = p, sp, b, a , represent the basal area ( m 2 ha −1 ) of trees larger than ˜ d j,q,w,t diameter of Scots pine (Pinus sylvestris L.), Norway spruce (Picea abies (L.) Karst.), silver birch (Betula pendula Roth), and European aspen (Populus tremula L.) trees respectively.Functions B are continuous and bounded.Parameter denotes the length of period, which is 5 years.The diameter growth in ( 7) is given as where B denotes the total stand basal area ( m 2 ha −1 ), T S the temperature sum (degree days, dd), SX an indicator variable if site fertility is "sub-xeric" instead of the more fertile "mesic", and SD is the standard deviation of the diameter (cm).The last term ε j,w,t denotes the stochastic variation around the expected diameter increment and is obtained by aggregating the stochastic tree-level model defined by Pukkala et al. (2013) .For pine, we assume the site to be 1350dd/sub-xeric, and for spruce and mixed-species stands the site is 1100dd/mesic.When the individual-tree model is applied as the matrix model, the right-hand side of (A.2) is divided by the size class width, implying that the modified (A.2) determines the transition rate ( Liang and Picard, 2013 ).The matrix model we use in comparison is based on a 50-mm size class depth suggested by Bollandsås et al. (2008) and applied in many studies such as Malo et al. (2021) .
Let N in, j and p in, j denote the number of ingrowth trees and the probability of ingrowth, respectively.The stochastic ingrowth function in Pukkala et al. (2013) is given by .4)where B j denotes the total basal area of trees of species j, and ω j,t denotes the residual variation in ingrowth for species j and a 5-year period t.The relative growth variation is especially large among small trees and in the ingrowth estimates.The ingrowths of consecutive 5-year periods and the residuals of predicted ingrowths are positively correlated in the Pukkala et al. (2013) model.The species-specific temporal autocorrelation coefficient is given by ˜ ρ j .This is largely explained by the fact that one good regeneration year tends to generate ingrowth for several years.In addition to autocorrelation, the  A1.When the initial state is bare land, it takes 20 years in mesic sites (25 years in sub-xeric sites) until the stand is artificially regenerated and the number of trees is 1750 (spruce) and 250 (other species in mixed-stands cases) or 2100 (pure pine stands).At these states, trees are equally distributed into ten classes with diameters 5. 25, 5.75,..., 9.75 (in cm).The number of ingrowth trees are equally distributed into ten classes with diameters 0.25, 0.75,..., 4.75 (in cm), meaning that the ingrowth for individual . In the matrix model, the artificially generated state is 1750 spruce trees and 250 other species in the 7.5-cm size class.Naturally regenerated trees enter the 2.5-cm size class.

Appendix B. Harvesting revenues and costs under stochastic prices
Harvesting revenues are defined separately for saw logs and pulpwood using species-specific market prices.The gross revenue per period is given by (p 1 , j,t v 1 , j,w (d j,q,w,t ) + p 2 , j,t v 2 , j,w (d j,q,w,t )) h j,q,w,t , (B.1) where p 1 , j,t and p 2 , j,t are the inflation-adjusted real prices of saw timber and pulpwood calculated using 2019 as a base year, while v 1 , j,w (d j,q,w,t ) and v 2 , j,w (d j,q,w,t ) are saw timber and pulpwood volumes per tree ( Table B.3 ).Throughout, functions v are continuous and bounded functions of d.
The per-period species-specific saw timber and pulpwood prices are modeled as iid.random variables with a multivariate log-normal distribution, ln (p t ) ∼ N(μ p , p ) , (B.2) where μ p ∈ R 2 l represents long-term equilibrium prices and p ∈ R 2 l×2 l is the covariance matrix for normally distributed random innovations.While the long-term equilibrium prices can be specified as long-term means, we estimate the covariance structure using historical price data from the last 30 years obtained from the statistics of Natural Resources Institute, Finland.Even though the observed annual prices have short-term dependencies, we find them significant only within each 5-year period.Therefore, when modeling prices in steps of 5-year periods, it is justified to assume that the per-period prices are independent and identically distributed variables with a finite covariance matrix.To account for short-term effects when estimating p from annual data, we apply a Vector Error Correction Model (VECM) suggested by Engle and Granger (1987) , which is essentially a vector autoregression that contains an equilibrium correction term.Recently, similar modeling approaches have been applied e.g. by Kuuluvainen et al. (2018) , who studied price integration between domestic and imported saw logs in Finland.The annual price model is then given by ln (p τ ) − ln (p τ −1 , for all i = 1 , . . ., k , and ˜ p ∈ R 2 l×2 l is the covariance matrix for normally distributed random innovations.The number of lags k is specified based on information criteria.The remaining parameters can be estimated using Johansen's maximum likelihood procedure ( Johansen, 1995 ).The first term p τ −1 on the right-hand side of (B.3) is known as the error correction term.The term can be decomposed as

Spruce
Pine Birch Aspen is the saw timber volume (m 3 ) per tree.Aspen and birch volumes are assumed to be equal.We use linear interpolation to obtain volumes for arbitrary tree diameters.

Table B2
The covariance matrix for the logarithms of the wood prices.denote the cointegration matrix and the loading matrix, respectively.The cointegration matrix β contains information on the equilibrium relationships between the variables in levels.Vector β p τ −1 can be interpreted as the distance of the price variables from their equilibrium values.The loading matrix β describes the speed at which the wood prices converge back to their equilibrium values.Given that the effects of price shocks vanish already during a 5-year period, we can also use the error covariance structure, estimated using (B.3) , in the simplified period-level model (B.2) without loss of rigor.The parameter values used for the stochastic price model are given in Table B.4 .The mean prices are according to Table B.5 .
Harvesting costs depend on species, tree diameter, and harvested wood quantity.The variable harvesting and hauling costs are derived from the detailed empirical model by Nurminen et al. (2006) for both clear-cut z = cl and thinning z = th :

Table B4
Species-specific parameters for harvesting-cost functions.Species q c j,q, 0 c j,q, 1 c j,q, 2 c j,q, 3 c j, 4 c j, is closed in the state space S for every closed M ⊂ A .For a given state s = (x, d, p) ∈ S, we have x j,q,w ≥ 0 and for the possible harvesting decision h ∈ R l×m ×n we have h j,q,w ∈ [0 , x j,q,w ] , for all 1 ≤ j ≤ l, 1 ≤ q ≤ m and 1 ≤ w ≤ n .We also have where each M k ⊂ R + is closed.Consequently, the claim follows directly by considering the one-dimensional case and observing that the set { and the reward function R z (h t , d t , p t ) are continuous functions of states s t and h t , and the reward function can be expressed as the linear combination of these continuous functions given δ th t and δ cc t , the upper semicontinuity follows.
Condition (W 3) : Let θ : X → R be bounded and continuous.We have to show that the mapping is continuous.Note that now the transition law q is the probability distribution of s t+1 given (s t , a t ) .Moreover, using defining equations ( 1) -( 10) we observe that s t+1 contains two random innovations ω t and ε t , but otherwise is a deterministic continuous transformation of (s t , a t ) .As ω t and ε t are independent of (s t , a t ) , ε t is an independent sequence and ω t is an AR(1) Gaussian sequence, condition (W 3) is obviously satisfied.
It remains to verify Assumption 3 .We will show that the negative part This verifies Assumption 3 .Thus, since Assumptions 1 -3 hold, the statement of Theorem 1 follows from Theorem 2 .This concludes the proof.

Appendix D. Finding the Markov policy using proximal policy optimization
Reinforcement learning algorithms, such as proximal policy optimization, are closely related to dynamic programming.Similar to learning methods, many dynamic programming algorithms are iterative and incremental procedures that find the correct solution through successive approximations ( Sutton and Barto, 2018 ).Reinforcement learning methods can basically be seen as attempts to achieve the same objective as dynamic programming but with less computation and without assuming a perfect model of the environment.Though the forest environment considered in this study is known, it is still too complicated because of the large number of states and actions (i.e., the curse of dimensionality) to be solved using classical dynamic programming techniques.To deal with large state and action spaces, we apply policy gradient methods that learn a parameterized policy directly via gradient ascent.

D1. Proximal policy optimization as a policy gradient algorithm
Assume that a policy is parameterized by a vector θ , and J(θ ) = V μ θ (s ) is a measure for the performance of the parameterized policy μ θ .The idea of policy gradient methods is to maximize performance by updating the policy parameters using approximate gradient ascent in J: where β denotes the learning rate parameter for the current update.
Most of the policy gradient algorithms would approximate the gradient by where the expectation ˆ E t denotes the empirical average over a finite batch of sample trajectories, and ˆ A t is an estimate for the advantage function at time t.The trajectories are obtained by interacting with the environment.These trajectories are also known as policy rollouts, as they are produced by repeatedly applying the current policy.The advantage for a policy μ is defined as ) is the state-action value function for policy μ.This class of algorithms that use the advantage function while approximating the gradient are known as actorcritic methods.
To prevent destructively large parameter updates, the policy gradient algorithms commonly use surrogate objectives instead of J(θ ) directly.For instance, trust region algorithms have applied a surrogate objective, max where KL denotes Kullback-Leibler divergence and θ old is the parameter vector before the update, and κ is a constant that limits the size of the policy updates.
The proximal policy optimization algorithm (PPO) can be seen as a policy gradient algorithm that is based on the wellknown actor-critic architecture Schulman et al., 2017 ).Similar to trust region methods, PPO maximizes a surrogate objective function that forms a lower bound on the performance of the policy.This corresponds to performing approximate gradient ascent in where the penalty parameter β is assumed to be updated using a suitable adaptive strategy.Hence, instead of attempting to constrain the size of the policy updates using a strict constraint, PPO applies clipped surrogate functions or a soft penalty that limits the KL divergence of the policy update.Most empirical studies suggest that clipping seems to work at least as well as PPO with KL penalty, but it is considerably easier to implement.The clipped surrogate objective that is used in most PPO applications is given by A t ) , (D.5) where the clipping function g is defined as (D.6)

D2. Proximal policy optimization with parameterized action spaces
Following Malo et al. (2021) , we use a modified version of PPO known as hybrid proximal policy optimization algorithm ( Fan et al., 2019 ), where the surrogate objective function is extended to include both continuous and discrete actions.The algorithm is essentially similar to the original PPO, except that the policy is represented by two parallel actor networks, one for the discrete actions and one for the continuous actions (see Fig. D.12 ).The implementation leverages the special hierarchical structure of the action space, where the continuous actions (i.e.harvesting amounts) can be interpreted as parameters of the discrete actions (i.e.choice between thinning, clear-cut, and waiting).
To avoid the computationally expensive use of KL divergence directly, the policy parameters are updated via a clipped PPO objective function J H−P P O (θ ) = ˆ E t [ L (s t , a t , θ old , θ ) ] , (D.7) where the policy is parameterized by vector θ = (θ c , θ d ) , where θ c and θ d are parameters that correspond to continuous and discrete decisions, and the surrogate objective L is defined as the sum of losses corresponding to the discrete and continuous decisions a = (h, δ th , δ cc ) , i.e.
Algorithm 1 PPO with hybrid actions and a clipped objective function.

procedure PPO-Clip
Input: initial policy parameters θ 0 , initial value function parameters φ 0 L (s t , a t , θ k , θ ) using (stochastic) gradient ascent.Estimate value function by minimizing mean-squared error: using gradient descent.end for end procedure The policy function μ θ (i.e. the "actor") is implemented as two neural networks with two heads, one for discrete actions and one for continuous actions, i.e. harvest levels.A softmax-layer is used to compute the action probabilities for each stochastic policy.The continuous part is implemented as a Gaussian policy, which outputs the harvest levels that are transformed to relative proportions of trees belonging to predefined bins.The expected values of the Gaussian are the outputs of the continuous head of the neural network.The standard deviations are treated as additional parameters that are optimized using gradient descent together with the neural network parameters.In our implementation, we used 1-cm bins, with the exception of combining the bins with diameters less than 5cm and bins above 35 cm.An accuracy of 1cm is considered sufficient to approximate the optimal solutions without significant sacrifice in the objective function value.Moreover, harvesting instructions with higher precision in diameter values would hardly be feasible in real-world harvesting.
The performance of RL algorithms is often sensitive to the state representation.Thus, the actor and critic apply preprocessing to the original state variables before feeding the observations to the neural network inputs.As a preprocessing step, we use the tree counts and diameters for each size and age class to compute a histogram with predefined fixed bins for the diameter distribution of each species.The tree frequencies for each bin are fed to the neural network input nodes.Learning policy and value functions is considerably more efficient using this alternative state representation compared with feeding raw tree count and diameter pairs to the network.As the PPO algorithm is prone to local optima, we run the training multiple times.The policy with best performance was chosen in the evaluation.Calculations were performed using computer resources within the Aalto Science-IT project.For further details considering the PPO algorithm used, see Fan et al. (2019) ; Schulman et al. (2015) .Our H-PPO implementation follows the details presented in Malo et al. (2021) .
Fig. 6  a shows how the matrix model produces faster initial growth, earlier and more frequent thinning (20 vs. 15 years), and smoother volume development compared with the individual-tree model.In Fig.6b (upper panel), the tree classes in the individual-tree model are grouped into 5-cm size classes.The matrix model overestimates the number of trees at both ends and particularly the size of the largest trees.Together, these differences imply that the matrix model overestimates the BLV by a factor of three, although the overestimation is lower for the steady-state mean annual net revenues (26%).

Fig. 7 .
Fig. 7. Stochastic continuous cover solutions (spruce).a) A steady-state reservation price schedule under fixed previous harvest state (stochastic price), b) Harvest timing distributions as deviations from deterministic timing (stochastic price), c) A steady-state reservation price schedule under stochastic previous harvest state (stochastic price), d) A steady-state reservation price schedule under stochastic previous harvest state (stochastic price and growth), e) Bare land value (BLV) distributions (stochastic price), f) Deterministic solution and stochastic realizations (stochastic price and growth), interest rate 3% in c-f.

Fig. D1 .
Fig. D1.The neural network structure.The stand structure is given as an input.The actor networks output the discrete harvesting decision and the continuous harvest levels.The value network is used in the advantage estimation.
9) L d (s, δ th , δ cc , θ old,d , θ d ) = min μ θ d (δ th , δ cc | s ) μ θ old,d (δ th , δ cc | s ) The gross revenues from clear-cutting and thinning are R cc (h t , d t , p t ) and R th (h t , d t , p t ) and the clear-cut and thinning costs C cc (h t , d t ) and C th (h t , d t ) , respectively.

Table 1
Comparison of matrix and individual-tree deterministic models.
Note: M = matrix model, I = individual-tree model, interest rate 3%.All figures per hectare.Excluding BLV, figures are for a CCF steady-state cycle.

Table A1
Pukkala et al. (2013)the growth model(Pukkala et al. 2013).ingrowthsarecross-correlated across species, which is captured by the random factors u t that are assumed to follow a multivariate normal distribution and be independent within time periods t with a constant covariance matrix u .Parameter values are based on values presented inPukkala et al. (2013), and the deterministic ingrowth model is obtained by averaging.The numerical values for regression coefficients b •, j in (A .1)-(A .3)can be found in Table d)Parameter d is tree diameter at breast height, v 1 , j (d) is the pulpwood volume (m 3 ) per tree, and v 2 , j(d) Prices for spruce, pine, and birch are based on data from the Natural Resources Institute Finland, while prices for aspen are based on personal communication with the Central Union of Agricultural Producers and Forest Owners. a (c j,z, 1 + c j,z, 2 v j,s (d j,z,s,t ) + c j,z, 3 v j,s (d j,z,s,t ) 2 ) v j,s (d j,z,s,t ) is the total tree volume, again assumed continuous and bounded, and c j,z,s are parameters given inTable B.6 .The model defines cutting (B.4) and hauling (B.5) costs separately.According to this model, the variable harvesting costs increase with total harvested volume but decrease with tree volume.Only cutting costs have species-specific parameters, while hauling and felling costs are determined without separating between tree species. where s is finite, allowing to drop the price dynamics p from the considerations.Let x t , d t ≤ K elementwise for some finite constant K.That is, the number of trees x t or the diameter d t of each size, age, and species classes cannot exceed a certain number determined by the area of the forest.Using elementary fact (a − b) − ≤ b and the definition (1) of π , we see that Sample a set of trajectories D k = { τ i } each with T time steps by running policy μ θ k in the environment.Compute rewards to go ˆ U t = T t= t π (s t , a t ) γ t .based on the current value function V φ k .Update the policy by maximizing the clipped surrogate objective: