Optimum trajectory learning in musculoskeletal systems with model predictive control and deep reinforcement learning

From the computational point of view, musculoskeletal control is the problem of controlling high degrees of freedom and dynamic multi-body system that is driven by redundant muscle units. A critical challenge in the control perspective of skeletal joints with antagonistic muscle pairs is finding methods robust to address this ill-posed nonlinear problem. To address this computational problem, we implemented a twofold optimization and learning framework to be specialized in addressing the redundancies in the muscle control . In the first part, we used model predictive control to obtain energy efficient skeletal trajectories to mimick human movements. The second part is to use deep reinforcement learning to obtain a sequence of stimulus to be given to muscles in order to obtain the skeletal trajectories with muscle control. We observed that the desired stimulus to muscles is only efficiently constructed by integrating the state and control input in a closed-loop setting as it resembles the proprioceptive integration in the spinal cord circuits. In this work, we showed how a variety of different reference trajectories can be obtained with optimal control and how these reference trajectories are mapped to the musculoskeletal control with deep reinforcement learning. Starting from the characteristics of human arm movement to obstacle avoidance experiment, our simulation results confirm the capabilities of our optimization and learning framework for a variety of dynamic movement trajectories. In summary, the proposed framework is offering a pipeline to complement the lack of experiments to record human motion-capture data as well as study the activation range of muscles to replicate the specific trajectory of interest. Using the trajectories from optimal control as a reference signal for reinforcement learning implementation has allowed us to acquire optimum and human-like behaviour of the musculoskeletal system which provides a framework to study human movement in-silico experiments. The present framework can also allow studying upper-arm rehabilitation with assistive robots given that one can use healthy subject movement recordings as reference to work on the control architecture of assistive robotics in order to compensate behavioural deficiencies. Hence, the framework opens to possibility of replicating or complementing labour-intensive, time-consuming and costly experiments with human subjects in the field of movement studies and digital twin of rehabilitation. Supplementary Information The online version contains supplementary material available at 10.1007/s00422-022-00940-x.


Muscle Dynamics
The musculoskeletal arm model is controlled by 14 extensor and flexor muscletendon units, abbreviation of these muscles and also the nomenclature of the paper is given in (Table 1).
Incorporating the properties of muscle dynamics into mathematical models pave the way for musculoskeletal simulations where different hypothesis of the underlying neural motor control can be examined. There exists a large number different muscle models, that differ in the level of description, their complexity and behavior being modelled. The difference of most models are either focus on how microscopic processes at the level of the tissue generate function or generic functional models that describe the input-output relationship between muscle activation and force generation [2]. The microscopic level of modeling approach ofmuscles takes into account the cross-bridge dynamics and the sliding activity of the filaments, e.g. Huxley muscle [3] whereas the generic macroscopic models focus on more high level properties of the muscles such as the force-length and the force-velocity relationships e.g. Hill-type muscles [4,5,6,7]. Since the microscopic properties ofthe muscle tissue are not relevant to biomechanical simulation studies, so-called Hill-type muscle models are widely used instead ofmicroscopic models. Sensory information can also be extracted from these models due to the fact that the length, velocity and force are calculated within these models.
Muscle model consists of several so-called MTUs. The MTU defines the dynamic properties of a muscle. An MTU has two parts: an elastic part (the tendon) and contractile part, (muscle unit). These two parts are serially connected. The muscle unit is again composed of two parts: a contractile part and an elastic element, but this time they are in parallel. The combined activity of those sub-units generates the force that causes the movement in the skeletal system. Therefore, muscle force is generated by these serial and parallel elements of the MTU. The contractile element f L (l M )f V (l M ) with parallel elasticity element f P E (l M ) composes the muscle unit which in turn generates the muscle force f M . The parallel elasticity element becomes active in case of excessive stretches in the contractile element or shrinkage of the contractile element beyond the acceptable range in order to prevent muscles tendon unit to be collapsed. In addition, serially connected tendon unit provide series elasticity f T (l T ) which in turn creates tendon force f T and it takes part in the force-length and force-velocity profile of the muscle.
The activation of the muscle is driven by the stimulus of the motor neurons as a first order differential equation which takes into account the neural delay with parameter τ :u The contraction of the muscle depends on the joint states of the musculoskeletal dynamics, which in turn depends on the states of the muscles. Therefore we are faced with a recurrent nonlinear dynamical system. According to the Hill-type muscle modelling, the force is cal-culated while considering the force-velocity, force-length and the active stimulation of the muscle as given below.
where f M 0 is the peak force at length l M 0 . After the integration of the muscle and joint positions and velocities, the continuous time-series of the stimulation is translated into the force, generated by the muscle sub-units. Hence, the activation ofthe muscle by a stimulus generates the active tension in the muscle model with the evaluation of the active contractile element, a passive elastic element and the tendon dynamics.
The force-length relationship of the muscle is a result of the properties of sarcomeres which is the main cause ofthe active force generated by the muscles and it has bell-shaped curve and it creates passive force after a certain point of stretching. The equation that represents the force-length relationship of the muscle is given as follows: where l opt is the optimum length of the muscle and w is the parameter for the force-length relationship of the sarcomere.
It can be seen that the total force exerted by the muscle is the combination of the active and passive forces generated by the contractile element and the passive elastic element. Muscles have also dynamic properties such that there is a relationship between the shortening velocity of the contractions and the force generation which represents the actin-myosin cycle. It has the effect on the force generation during nonisometric contractions. Instead ofa bell-shape curve, force-velocity has an inverted sigmoidal property. One important aspect of the force generation during the nonisometric contraction is that the force is generated if the muscle is stretched more than the threshold length regardless of the situation of the muscle, whether it is activated or not. The force-velocity relationship of the muscle is given by f v (v M ) which represents the concen-tric contraction, represented as inverted sigmoid.
where v max is the maximum possible velocity and with parameters N and K. The passive force-length f P E (l M ) curve is given by as follows.
The new length of the muscle after one time interval [t n , t n+1 ] depends on the contraction velocity ofthe muscle. In order to calculate the numerical new length, length-velocity equation is solved with trapezoidal rule as follows.
and the velocity of the muscle is found by the inverse of the muscle forcevelocity equation as follows.
The muscle model are represented by multiple compartments in the simulation while the attachment sites are consistent with digitized muscle insertions and anatomical descriptions [8,9]. The parameters of each muscle in this simulations are given in Table 2 2 Simulation Details

Model Predictive Control
The objective function of the skeletal control in Fig.6 for the precise timing control problem is given as follows:

Deep Reinforcement Learning
Derivation of the Deep Reinforcement Learning from Equation 5 in main manuscript can be written as follows [10]: The right-hand side of this equation is usually estimated instead of analytically solved by obtaining sample trajectories while executing the policy function as follows: where a baseline function (here the natural choice is the value function) V π θ (s t ) is subtracted from the expected return Q π θ (s t , a t ) in order to alleviate possibility of very large variance in the expectation [11]. It is also referred as advantage function: Among several other possibilities (e.g. total reward of the trajectory, reward following action a t , state-action value function), it is empirically shown that advantage function yields the lowest possible variance in the expectation. This function could be interpreted as forcing the gradients in a way that it is bounded to increase the probability of better than average decisions while decrease the probability of worse than average decisions. Given that expected return is the sum of discounted reward, one can rewrite the advantage function as follows: In case of choosing λ = 1 [12], one can come up with a more generalized version of the advantage function such that: where δ t = r t + γV π θ ((s t+1 ) − V π θ ((s t ) is the temporal difference error. One way to simplify the maximization of expected return is to write down a surrogate optimization problem where focus is to maximization of the true reward [11]: where ρ t (θ) = π θ (at|st) π(at|st) with current policy π θ (a t |s t ) and the policy before the update π(a t |s t ). There has been two main algorithms suggested to solve this maximization problem, Trust-region policy optimization (TRPO) [11] and ppo [12]. Although TRPO is easy to optimize, it is computationally costly due to the dependence of nonlinear conjugate gradient where multiple Hessianvector products are required to be calculated. Instead, ppo instantly clip the objective function which means it doesn't require any conjugate gradient calculation by replacing Kullback-Leibler divergence constraint. We used ppo in our simulations in order to obtain muscle activities given desired joint trajectories. The maximization problem definition in the context of ppo can be given as follows: where provides the interval of the clip ratio. The first term is the conservative policy iteration which could lead to very large updates on policy optimization. To solve this problem, the second term is introduced to clip the probability ratio between the interval of [1 − , 1 + ].
We summarized all the hyperparameter settings for the PPO training in 2D human musculoskeleletal arm simulations in Table 3. We used a first-order gradient-based optimization of stochastic objective functions, called Adam [13] to estimate the gradients of the PPO.   Figure 6 in the main paper. While the elbow target is indicated dashed purple line, the target of the shoulder joint is the x-axis since the shoulder joint is forced stay stable. For each experiment, the difference between the desired trajectory and simulation results are given in the inner figures where it can be seen that the resulting trajectories are close to optimal trajectories. B. The averaged error of both trajectories, elbow and shoulder can be seen in blue for both experimental settings. The magnitude of the initial error at the elbow joint is highly correlated with the targeted time of the goal position, more the elbow joint have time to reach the goal position, less the error at the beginning of the experiment.   Figure 6. Red lines represents the muscle activity that is observed for the Experiment of reaching target within 0.4s, whereas black lines represents the muscle activity that is observed for the Experiment of reaching target within 0.7s. Since the target goal is identical but only timing differs, one can observe similarities of the activities in the muscles albeit more activity is observed for 0.4s since it requires faster movement. Figure 6: Comparison of extensor muscles activities for the experiments of precise timing control, see manuscript Figure 6. Red lines represents the muscle activity that is observed for the Experiment of reaching target within 0.4s, whereas black lines represents the muscle activity that is observed for the Experiment of reaching target within 0.7s. Similar to the activity of flexor muscles, there is also significant similarities in the extensor activities. One can also observe that all extensor muscles show high correlation between each other.