Practical Algorithmic Trading Using State Representation Learning and Imitative Reinforcement Learning

Algorithmic trading allows investors to avoid emotional and irrational trading decisions and helps them make profits using modern computer technology. In recent years, reinforcement learning has yielded promising results for algorithmic trading. Two prominent challenges in algorithmic trading with reinforcement learning are (1) extracting robust features and (2) learning a profitable trading policy. Another challenge is that it was previously often assumed that both long and short positions are always possible in stock trading; however, taking a short position is risky or sometimes impossible in practice. We propose a practical algorithmic trading method, SIRL-Trader, which achieves good profit using only long positions. SIRL-Trader uses offline/online state representation learning (SRL) and imitative reinforcement learning. In offline SRL, we apply dimensionality reduction and clustering to extract robust features whereas, in online SRL, we co-train a regression model with a reinforcement learning model to provide accurate state information for decision-making. In imitative reinforcement learning, we incorporate a behavior cloning technique with the twin-delayed deep deterministic policy gradient (TD3) algorithm and apply multistep learning and dynamic delay to TD3. The experimental results show that SIRL-Trader yields higher profits and offers superior generalization ability compared with state-of-the-art methods.


I. INTRODUCTION
Algorithmic trading, which enables investors to trade stocks without human intervention, has started playing an important role in modern stock markets. Algorithmic trading is a subset of quantitative trading, which relies heavily on quantitative analysis and machine-learning methods. In particular, reinforcement learning methods can be used to learn trading strategies in the process of interacting with the stock market environment. The application of reinforcement learning to algorithmic trading is not trivial because financial time-series data are nonstationary and contain considerable noise. Two prominent challenges presented by algorithmic trading with reinforcement learning are (1) extracting robust features and (2) learning a profitable trading policy. To accommodate these challenges, recent methods [1]- [6] use deep reinforcement learning and deliver good performance in terms of profitability.
Previously, an assumption was often made that both long and short positions are always possible in stock trading. These positions represent the direction of the bets that a stock price is expected to either rise or fall. A long position involves buying stocks with the intention of future selling, whereas a short position involves selling stocks borrowed from a stockbroker first and then buying them back to close the position.
We note that taking a short position is risky or sometimes impossible in practice, particularly for individual investors. First, the maximum gain in a short position is 100% when the stock price falls to zero, whereas the potential loss is theoretically infinite as the price has no upper bound. On the other hand, the maximum loss in a long position is limited because the price can only decrease to zero, but the possible gain is theoretically infinite. Second, the stockbroker may close a short position immediately, without the investor's consent. The borrowed stocks are essentially loans from stockbrokers. If the stock price rises, the stockbroker requires additional capital, and if the investor cannot provide capital, the stockbroker can close the position, resulting in a loss. Third, the stock market generally increases over time. Fourth, economic regulators can severely restrict or temporarily ban short-selling during an economic crisis. In summary, a short position is not recommended for individual investors. However, obtaining good profits using only long positions is challenging in algorithmic trading.
In this paper, we propose a practical algorithmic trading method named SIRL-Trader, which generates good profits using only long positions. First, we devise an offline/online state representation learning (SRL) method. The offline unsupervised SRL reduces the dimensionality and applies clustering to extract robust features from observations. The online supervised SRL co-trains a regression model that predicts the next price with a reinforcement learning model to provide accurate state information for decision-making. Second, we combine imitation learning with reinforcement learning by cloning the behavior of a prophetic expert who has information about subsequent price movements. Third, we extend the twin-delayed deep deterministic policy gradient (TD3) algorithm [7], which is a state-of-the-art reinforcement learning method, to incorporate offline/online SRL, behavior cloning, multistep learning, and dynamic delay. Fourth, compared with state-of-the-art algorithmic trading methods, SIRL-Trader yields higher profit and has superior generalization ability for different stocks.
The remainder of this paper is organized as follows. Section II introduces reinforcement learning methods, and Section III reviews existing work. Sections IV and V present SIRL-Trader and experimental results, respectively. Finally, Section VI presents our conclusions and suggestions for future work. For ease of reading, Table V in the Appendix lists the abbreviations used in this paper.

A. REINFORCEMENT LEARNING
In reinforcement learning, an agent learns to act in an environment to maximize the total reward. At each time step, the environment provides a state s to the agent, the agent selects and takes an action a, and then the environment provides a reward r and the next state s . This interaction can be formalized as a Markov decision process (MDP), which is a tuple S, A, P, R, γ , where S is a finite set of states, A is a finite set of actions, P (s, a, s ) is a state transition probability, R(s, a) is a reward function, and γ ∈ [0, 1] is the discount factor, a trade-off between immediate and long-term rewards. In reinforcement learning for algorithmic trading, the state is not directly given and needs to be constructed from a history of observations. To accommodate this, the MDP model was extended with an observation probability P (o|s, a). The extended model is referred to as the partially observable MDP (POMDP) model [8].
The agent selects an action using a deterministic policy µ(s) or stochastic policy π(a|s) that defines the probability distribution over actions for each state. The discounted sum of future rewards collected by the agent from the state s t is defined as the discounted return G t in Equation (1). (1)

B. DEEP Q-NETWORK (DQN)
In value-based reinforcement learning, the agent learns an estimate of the expected discounted return, or value, for each state (Equation (2)) or for each state and action pair (Equation (3)).
A common way of deriving a new policy π from Q π (s, a) is to act -greedily with respect to actions. With probability (1 − ), the agent takes the action with the highest Q-value (the greedy action), that is, π (s) = argmax a∈A Q π (s, a). With probability , the agent takes a random action to introduce the exploration.
The deep Q-network (DQN) [9] introduces deep neural networks to approximate the Q-value for large state and action spaces. DQN uses a replay buffer to store past experiences as tuples of s, a, r, s and learns by sampling batches from the replay buffer. DQN uses two neural networks: online (Q θ ) and target (Q θ ) networks. The parameters θ of the online network are periodically copied to the target network during training. The loss function of DQN in Equation (5) is the mean squared error (MSE) between Q θ (s, a) and the target value Y DQN in Equation (4), which uses the Bellman equation [10]. The techniques of experience replay and the target network enable stable learning.
Because we take the maximum value for the target network in Equation (4), we often obtain an overestimated value. Double DQN (DDQN) [11] solves this overestimation problem of DQN by decoupling the selection of the action from its evaluation, as shown in Equation (6).
In policy-based reinforcement learning, the agent directly learns a policy function π(a|s). Actor-critic methods combine policy-based and value-based reinforcement learning methods. In these methods, the critic network (V θ ) learns the value function, and the actor network (π φ ) learns the policy in the direction suggested by the critic. In general, the loss function of the critic network is defined in Equation (8), which uses the Bellman equation; the loss function of the actor network is defined in Equation (9), which uses the stochastic policy gradient theorem [18].
Asynchronous advantage actor-critic (A3C) is an actorcritic method that asynchronously executes multiple agents in parallel instead of using experience replay. In A3C, the critic network estimates the advantage of action a in state s, or A(s, a) = Q(s, a) − V (s). A3C uses n-step returns to accelerate convergence and includes the entropy of the policy π(a|s) to the loss function of the actor network to introduce the exploration.

D. DEEP DETERMINISTIC POLICY GRADIENT (DDPG)
The deterministic policy gradient (DPG) [19] is an actorcritic method that learns a deterministic policy µ(s) rather than a stochastic policy π(a|s). It is a special case of the stochastic policy gradient when the variance approaches zero. DPG is efficient and effective for high-dimensional continuous action spaces.
The deep DPG (DDPG) [20] is an actor-critic method that combines DPG and DQN. As in DQN, DDPG uses a replay buffer and target networks. For the exploration, DDPG adds noise N to the policy, as shown in Equation (10). The loss function of the critic network is defined as in Equation (12), which uses the Bellman equation, and that of the actor network in Equation (13), which uses the deterministic policy gradient theorem [19]. After updating the online actor and critic networks, the target actor and critic networks are softupdated from the online networks, as in Equation (14).

E. TWIN-DELAYED DDPG (TD3)
The twin-delayed DDPG (TD3) improves DDPG in three ways. First, TD3 adds Gaussian noise to the target action, as in Equation (15), which is used to obtain the target value Y in Equation (16). This technique, known as target policy smoothing, serves as a regularizer to avoid overfitting to sharp peaks in the Q-value estimate. The noise is clipped to limit its impact.
Second, TD3 uses two critics (and two target critics) to solve the overestimation problem of DDPG. The minimum value in the pair of target critics is used to compute the target value, as shown in Equation (16). This technique is known as clipped double Q-learning.
Third, TD3 updates the actor network µ φ (and the target actor µ φ and critic networks Q θ j ) less frequently than critic networks Q θj . This technique, known as delayed policy updates, aims to delay policy updates until the Q-value converges.

III. RELATED WORK
Considerable research has been devoted to algorithmic trading, including supervised learning methods [21]- [27] that predict price trends and reinforcement learning methods [1]- [6] that directly learn a profitable trading policy. In recent years, deep reinforcement learning methods have shown promising results in algorithmic trading. There are two challenges in deep reinforcement learning for algorithmic trading: extracting robust features from observations to represent the states (i.e., SRL) and learning a profitable trading policy.
Deng et al. [1] used a fuzzy deep neural network for SRL and proposed a policy-based reinforcement learning method with a recurrent neural network. Li et al. [2] used a stacked denoising autoencoder for SRL and proposed DDQN-extended and A3C-extended methods with a long short-term memory (LSTM) network [28]. The A3Cextended method yields more profits than the DDQNextended method. Fengqian and Chao [3] decomposed candlesticks (or K-lines) into components such as the lengths of the upper shadow line, lower shadow line, and body. Each component is then clustered, and the cluster centers and the color of the body are used to represent the state. For deep reinforcement learning, [3] used a policy gradient method with -greedy exploration. Wu et al. [6] used a gated recurrent unit (GRU) network [29] for SRL and proposed DDQN-extended and DDPG-extended methods, GDQN and GDPG, respectively. GDPG provides more stable returns than GDQN does. Lei et al. [4] used a gate structure to select features, GRU to capture long-term dependency, and a temporal attention mechanism to weight past states based on the current state. They proposed a policy gradient method, known as time-driven feature-aware jointly deep reinforcement learning (TFJ-DRL), which combines SRL and a policy gradient method using an auto-encoder. The decoding part of the SRL model is used to predict the next closing price, where the real price is used as the feedback signal. The encoding part of the SRL model is used as the state representation for the reinforcement learning. Liu et al. [5] used GRU for SRL and introduced imitation learning techniques, such as a demonstration buffer and behavior cloning, to the DPG algorithm.
The above-mentioned studies have resulted in several significant improvements in algorithm trading. However, it is unclear whether these improvements are complementary and can be combined to obtain positive results. This study resulted in a comprehensive solution that integrates existing improvements with new ideas, as explained in the next section.

IV. SIRL-TRADER
In this section, we propose the novel algorithmic trading method named SIRL-Trader.

A. ARCHITECTURE
We propose an actor-critic reinforcement learning method that extends TD3 to incorporate offline/online SRL, imitation learning, multistep learning, and dynamic delay. Fig. 1 shows the proposed architecture. Each component is explained in detail in the following subsections.

B. STATE REPRESENTATION LEARNING
SRL models learn state representations to help the agent learn a good policy. In other words, SRL models learn how to map observations to states. We use the candlestick components and technical indicators in Table I as observations. Our SRL method consists of (1) offline unsupervised SRL that occurs before training the reinforcement learning model and (2) online supervised SRL that occurs while the reinforcement learning model is being trained.
The offline SRL extracts a low-dimensional robust representation from high-dimensional observations, as shown in Fig. 2; first, we normalize each input feature using the zscore standardization method. Second, for each feature group in Table I with high dimensionality, we reduce the dimensionality of the feature space to the threshold F in Fig. 2 using principal components analysis (PCA). Third, we cluster each feature using fuzzy c-means clustering (FCM) [31]. After clustering, each feature value except the body color is represented by the cluster center to which it belongs.  The online SRL, of which the architecture is shown in Fig. 3, takes input as a sliding window of outputs from the offline SRL to see historical data. To focus on important features in each window, we assign weights to the features using the gate structure [4]. The gate g shown in Fig. 4 uses a sigmoid activation function σ, as in Equation (17), where f denotes the input feature vector. Parameters W and b are learned using end-to-end training. In subsequent steps, we use the weighted feature vector f in Equation (18) where denotes the element-wise multiplication of vectors g and f . After weighting the features, we apply an LSTM layer to learn temporal characteristics. The input of the LSTM layer is a sliding window of weighted feature vectors, and its output is the hidden state of the last time step as shown in Fig. 3. We term the network up to the LSTM layer as the online SRL network. Each actor and critic network has a corresponding online SRL network. We co-train the online SRL networks with actor and critic networks, respectively.
In addition to the online SRL networks, we train a regression network whose structure is shown in Fig. 5(a), which predicts the next closing price to provide accurate state information for the actor network. The MSE between the real and predicted prices is used as the loss function for the regression network. The predicted price does not participate in training the actor network, but the underlying online SRL network of the regression network does because the online SRL network is shared with the actor network as explained in the next section.

C. IMITATIVE REINFORCEMENT LEARNING 1) Action
The trading action a t ∈ {buy, hold, sell} = {1, 0, −1} is taken on each trading day. Because we use only the long position, the buy action precedes the sell action. The agent of the SIRL-Trader starts with a certain amount of capital. For the sake of simplicity, the agent buys the maximum number

2) Reward
We define the reward r t for action a t as the change rate of the portfolio value as in Equation (19). The portfolio value V p t is the sum of the stock value V s t and remaining cash balance V c t , as in Equation (20).
The stock value V s t is the current value of the stock, which is computed by multiplying the closing price p c t of the stock by the number of shares n s owned, as in Equation (21). To simulate a real trading environment, we include a transaction cost term ζ.

3) Algorithm
We incorporate offline/online SRL, behavior cloning, multistep learning, and dynamic delay into TD3. Algorithm 1 presents the proposed reinforcement learning algorithm. As shown in Fig. 6, we use two critic networks Q θ1 , Q θ2 , two online SRL networks o η1 , o η2 for Q θ1 , Q θ2 , an actor network µ φ , an online SRL network λ υ , and a regression network for λ υ . The input for the online SRL networks is a sliding window w t of weighted feature vectors {x t−Nw+1 , ..., x t−1 , x t } obtained from the offline SRL.
To accelerate the training process, we use multistep learning, which collects transitions of w t , a t , r t , w t+1 using the N -step buffer D (lines 9 to 11). An N -step transition of w t−N +1:t+1 , a t−N +1:t , r t−N +1:t is constructed from the transitions stored in D (line 13) and used to compute the target value Y (line 16).
Initialize an N -step buffer D 7.
Compute the delay value for each epoch: d ← (e mod α) + β
Observe a reward rt and the next input w t+1 11.
Store a transition wt, at, rt, w t+1 to D
Obtain an N -step transition w t−N +1:t+1 , a t−N +1:t , r t−N +1:t from D and store it to B 14.
Sample a mini-batch of B transitions of length N from B 15.

17.
Update the critics θ j by the MSE loss: 1 Update the actor φ by the deterministic policy gradient:

20.
Update the regression network by the MSE loss: 1

21.
Update the actor φ by the CE loss for the behavior cloning: 1

22.
Soft-update the target networks: end if 25.
end for 26. end for select an action with exploration noise, we add a different amount of noise to each output of the softmax layer (line 9) and then apply argmax. As in the original TD3 algorithm, we use target policy smoothing as a regularization strategy (line 15). When the actor network µ φ is updated, the online SRL network λ υ is also updated by backpropagation.
The critic networks Q θ1 , Q θ2 , whose structure is shown in Fig. 5(c), are combined with the online SRL networks o η1 , o η2 , respectively. To solve the overestimation problem, we use the clipped double Q-learning technique of TD3 with multistep learning (lines 16 to 17). When the critic networks are updated, the corresponding online SRL networks are also updated by backpropagation.
To provide accurate state information for the actor network, we update the regression network using the MSE between the real and predicted prices (line 20). When the regression network is updated, the online SRL network λ υ is also updated by backpropagation. The actor network is indirectly affected by this update through the underlying online SRL network λ υ , which is shared with the regression network as shown in Fig. 6.
For imitation learning, we introduce a behavior-cloning technique to guide the actor network training. We create a prophetic trading expert who selects an action at day t−N +1 using information about today's closing price close t−N +1 and tomorrow's closing price close t−N +2 . The expert buys when close t−N +2 > h × close t−N +1 and sells when close t−N +2 < h × close t−N +1 where h ≥ 1 is a hyperparameter. Otherwise, the expert holds the stock. We train the actor network to minimize the cross-entropy (CE) loss between the softmax output vector of buy, hold, sell and the action a expert of the expert, which is represented as a one-hot vector (line 21).
For more stable and efficient training, we propose a dynamic delay technique for updating the actor and target networks. In the original TD3 algorithm, the delay is fixed to a constant value, and it is hard to find an optimal value. The dynamic delay technique allows us to try various delay values while the reinforcement learning model is being trained. For each epoch, we compute the delay value d using Equation (22) (used in line 7). In Equation (22), α is a constant for adjusting the variance in delay values, and β is a constant for setting the minimum delay value.
For the reward, the stock value V s t , which is used in Equation (20), is redefined as Equation (23). The stock value V long t for the long position is computed using Equation (24), which is the same as Equation (21). The stock value V short t for the short position is computed by multiplying the difference between p c t and p c opened by the number of shares n short , as in Equation (25) where p c opened is the closing price of the stock when the short position is opened, and ζ is the transaction cost. For the sake of simplicity, we take either a long or a short position at a time.

V. EXPERIMENTS
In this section, we present experiments designed to answer the following questions.
• Can SIRL-Trader outperform state-of-the-art methods?
• What leads to the gain obtained by SIRL-Trader? • How do hyperparameters affect the performance of SIRL-Trader? • Is SIRL-Trader robust to high transaction costs?

A. EXPERIMENTAL SETUP 1) Datasets
We test algorithmic trading methods using two datasets with different numbers of stocks included in the S&P 500 index. We obtain stock data consist of opening, high, low, closing, and volume values from Yahoo Finance [32]. We use a smaller dataset to compare SIRL-Trader with other methods in detail. To ensure the diversity of price trends, we select six stocks with different price trends (upward, sideways, and downward), as shown in Fig. 7. We use a larger dataset to verify the generalization ability of the methods. Similar to [4], we select 56 stocks from twelve different sectors as shown in Table IV. For the two datasets, the training period is from Jan. 2014 to Dec. 2018, and the test period is from Jan. 2019 to Dec. 2020. To simulate the real trading environment, we do not use any information from the current trading day for testing.

2) Evaluation Metrics
For each method, we evaluate the rate of return (V end − V start )/V start obtained using starting capital V start of  $10,000 after a trading test period. We also evaluate the Sharpe ratio [33], which measures the return of an investment compared to its risk. In Equation (26), E[R] is the expected return, and σ[R] is the standard deviation of the return, which is a measure of fluctuations, that is, the risk. A greater Sharpe ratio indicates a higher risk-adjusted return rate. We use the change rate of the portfolio value as the return in Equation (26).

3) Baseline Methods
The state-of-the-art methods compared with SIRL-Trader are listed below. For each method, network structures and hyperparameters are manually optimized as stated below 1 . We use the same reward function in Equation (19) and the Adam optimizer for all methods.
• Buy and Hold (B&H) method buys the stock on the first day of the test and holds it throughout the test period. This directly reflects price trends. • K-line [3] clusters candlestick components using the FCM method. The policy network comprises three ReLU dense layers with 128 units and a softmax output layer. We set the number of clusters to 5, the sliding window size to 10, the learning rate to 0.0002, and to 0.9 with a decay of 0.95. • TFJ-DRL [4] uses a gate structure, GRU, temporal attention mechanism, and a regression network for SRL. It combines SRL with a policy gradient method. The policy and regression networks comprise two ReLU dense layers with 128 and 64 units, a dropout layer with an elimination fraction of 0.3, and a softmax output layer. We set the sliding window size to 3, the minibatch size to 32, the learning rate to 0.0001, and to 0.7 with a decay of 0.9. • iRDPG [5] uses GRU and imitative actor-critic reinforcement learning. The actor network comprises a ReLU dense layer with 16 units and a softmax output layer. The critic network comprises a ReLU dense layer with 16 units and a linear output layer. We set the sliding window size to 10, the standard deviation of the noise (or the noise size) for the exploration to 0.9, the minibatch size to 32, and the learning rate to 0.0001. • GDPG [6] uses GRU and DDPG. The actor network comprises two GRU layers with 20 and 24 units, a dropout layer with an elimination fraction of 0.3, and a softmax output layer. The critic network comprises a concatenate layer, two ReLU dense layers with 64 and 16 units, and a linear output layer. We set the sliding window size to 5, the noise size for the exploration to 0.6, the mini-batch size to 64, and the learning rate to 0.001.

4) Implementation Details of SIRL-Trader
In the offline SRL, we set the tumbling window size for the z-score normalization to 20, the dimensionality threshold F in Fig. 2 to eight, and the number of clusters to 20. In the online SRL, we set the size of the input sliding window to five and the number of units of the LSTM layer to 128. In the reinforcement learning, we set the transaction cost ζ in Equation (21) to 0.25%, the N of the multistep learning to two, the h for determining the action of the expert to 1.001, the α and β of the dynamic delay to four and two, the noise size σ for the exploration to 0.7, the noise size σ for the regularization to 0.7, the clipping size c to 1, the mini-batch size to 64, and the learning rate to 0.0001.

1) Comparison with Other Methods
We compare SIRL-Trader with other methods in detail using the smaller dataset with various price trends. As shown in Table III, for the stocks trending upward such as AAPL and AMD, all methods make good profits, but SIRL-Trader has the best return rate and Sharpe ratio. For the stocks trending sideways such as DUK and K, SIRL-Trader is the best in K; GDPG is the best in DUK, but the difference between the return rates of GDPG and SIRL-Trader is small. For the stocks trending downward such as CCL and OXY, most of the methods suffer losses, but SIRL-Trader makes a good profit.
Particularly in CCL, all other methods suffer significant losses, but SIRL-Trader makes a huge profit. Fig. 8, which is extracted from the test period in CCL, shows that SIRL-Trader can fully exploit the price fluctuations compared with other methods in the dotted area. We can also observe that TFJ-DRL trades too frequently and iRDPG too rarely. In summary, SIRL-Trader can yield significant profits in the stocks with different trends by integrating all the techniques in Table II.  We verify the generalization ability of all methods using the larger dataset. As shown in Table IV and Fig. 9, SIRL-Trader outperforms all other methods in terms of the minimum, maximum, and average return rate and Sharpe ratio. SIRL-Trader achieves an average return rate of 57.8%, which is 14.1%P higher than the second-highest method, iRDPG. The average Sharpe ratio of SIRL-Trader is 1.06, which is 0.25 higher than the second-highest method, iRDPG. These results indicate that the reinforcement learning algorithm of SIRL-Trader, which is integrated with offline/online SRL, is effective for generalization.

2) Ablation Study
We conduct an ablation study to demonstrate the contribution of each component of SIRL-Trader. We exclude the components one by one and report the results in Fig. 10. The excluded components are dimensionality reduction (Dim), clustering (Clu), multistep learning (Mul), regression model (Reg), and imitation learning (Imi); 'All' denotes SIRL-Trader with all the components. We evaluate the minimum, maximum, and average performance on the smaller dataset. The results show that all the components contribute to improving the performance. In particular, the offline SRL (Dim and Clu) is very crucial. To demonstrate the effectiveness of the dynamic delay, we compare it with static delays ranging from two to five. As shown in Fig. 11, the dynamic delay significantly improves the performance compared with the static delays.  , we can see that if the dimensionality threshold (or the number of clusters) is too small or too large, it degrades the performance. This is because of the trade-off between information loss (if too small) and noise inclusion (if too large). Fig. 12(c) shows that as the sliding window size increases, the performance decreases. This is because price information older than one week does not aid in decision-making and just increases the amount of noise. Fig. 12(d) shows that if the noise size is too small or too large, it degrades the performance. This is because of the exploration-exploitation trade-off. Greater noise means more exploration and less exploitation. Smaller noise indicates the opposite.

4) Robustness Study
In a real trading environment, transaction costs such as transaction fees, taxes, and trading slippages 2 exist. We study the robustness of SIRL-Trader by varying the transaction cost ζ in Equation (21). We evaluate the average return rate on the smaller dataset. Fig. 13 shows that the average return decreases as the transaction cost increases for all methods. However, SIRL-Trader shows the best results regardless of the transaction cost. We observe that, though the transaction cost of 0.35% is much higher than that of real trading environments, SIRL-Trader still makes a good profit.  2 The difference between the expected price at which a trade takes place and the actual price at which the trade is executed.

VI. CONCLUSIONS AND FUTURE WORK
In this paper, we proposed a practical algorithmic trading method, SIRL-Trader, which achieves good profit using only long positions. We used offline/online SRL and imitative reinforcement learning to learn a profitable trading policy from nonstationary and noisy stock data. In the offline SRL, we used dimensionality reduction and clustering to extract robust features. In the online SRL, we co-trained a regression model with a reinforcement learning model to provide accurate state information for decision-making. In the imitative reinforcement learning, we incorporated a behavior cloning technique with the TD3 algorithm and applied multistep learning and dynamic delay to TD3. The experimental results showed that SIRL-Trader yields significantly higher profits and has superior generalization ability compared with state-of-the-art methods. We expect our approach to be generalizable to other complex sequential decision-making problems. Finally, we plan to apply our work to future markets where both long and short positions are always possible.