Prison Term Prediction on Criminal Case Description with Deep Learning

: The task of prison term prediction is to predict the term of penalty based on textual fact description for a certain type of criminal case. Recent advances in deep learning frameworks inspire us to propose a two-step method to address this problem. To obtain a better understanding and more specific representation of the legal texts, we summarize a judgment model according to relevant law articles and then apply it in the extraction of case feature from judgment documents. By formalizing prison term prediction as a regression problem, we adopt the linear regression model and the neural network model to train the prison term predictor. In experiments, we construct a real-world dataset of theft case judgment documents. Experimental results demonstrate that our method can effectively extract judgment-specific case features from textual fact descriptions. The best performance of the proposed predictor is obtained with a mean absolute error of 3.2087 months, and the accuracy of 72.54% and 90.01% at the error upper bounds of three and six months, respectively. corresponding judgment model through comprehensive analysis of relevant law articles and the structure of judgment document. Then we employ state-of-the-art neural network models to build sentence-level multiple binary classifiers, each of which focusing on a specific feature based on the judgment model. After merging sentence-level features into a case-level feature, we adopt the linear regression model and the neural network model to solve the PTP problem. For experiments, we collect and construct a real-world dataset containing more than 40,000 judgment documents of theft cases published by the Supreme People’s Court of the People’s Republic of China. Experimental results demonstrate that our method can effectively extract judgment-specific case features from textual fact descriptions. The proposed predictor obtains the best performance of 3.2087 months in mean absolute error, and 72.54% and 90.01% in accuracy when the error upper bound being set to three and six months, respectively.


Introduction
For the past few years, the amount of data in the judicial field has grown rapidly. The data involves various legal cases, supplementary extensions of the law and judicial interpretations. Legal professionals, such as judges, lawyers and prosecutors, not only have to handle numerous cases, but also need to consult a large number of files for reference or analyze the data related to the case. It leads to a growing burden on law professionals, which may result in a lower efficiency and an increased risk of making mistakes in the judicial work. To help safeguard judicial fairness and public security, a legal assistant system based on information technology (e.g., artificial intelligence and data mining) should be employed to facilitate the judgment of legal cases. The task of prison term prediction (PTP) differs from the charge prediction task that, instead of aiming to determine appropriate charges (e.g., the crime of theft, fraud, robbery and intentional injury) for a given case, its objective is to predict the term of penalty (e.g., fixed-term imprisonment counted by year/month, life imprisonment or death penalty) for a certain type of criminal case by analyzing the textual fact description. In mainland China which is one of civil law jurisdictions, courts deal with legal cases based on statutory laws and the fact description, rather than with reference to decisions of precedent cases. The judge will make a final decision by combining the analysis of specific situation of current case with the understanding and interpretation of relevant law articles. Although we can expect a traditional classification model by learning previous similar cases to play a role in the legal assistant system, it is always more convincing to make the prediction with legal basis. However, it is not trivial to train a machine judge to predict appropriate prison term based on law articles and fact descriptions. There are two crucial issues to be addressed: 1) how to effectively extract features well representing a case from textual fact descriptions, and 2) how to implement a refined model which outputs an integral number as the prediction result of prison term. The majority of existing works attempt to resolve the judgment prediction task by formalizing it as a text classification problem. These efforts either employ off-the-shelf classification models [Hachey and Grover (2006); Goncalves and Quaresma (2005); Palau and Moens (2018)] with shallow features extracted from text [Liu, Chang and Ho (2004); Liu and Hsieh (2006)] or case profiles [Katz, Bommarito II and Blackman (2017)], or attain deeper semantic understanding of case descriptions by manually annotating cases and designing specific features [Lin, Kuo and Chang (2012)]. Despite the introduction of machine learning and natural language processing (NLP) methods that can advance the analysis of legal texts [Xiong, Shen, Wang et al. (2018)], while it remains unsolved to learn better semantic representations from case fact descriptions with less human annotations and make refined prediction of the prison term for a case with a certain charge. In this paper, we aim to address the PTP problem by incorporating appropriate mechanisms to integrate the textual fact descriptions of criminal cases with legal basis. To obtain a better understanding and more specific representation of the legal texts, we first summarize the corresponding judgment model through comprehensive analysis of relevant law articles and the structure of judgment document. Then we employ state-ofthe-art neural network models to build sentence-level multiple binary classifiers, each of which focusing on a specific feature based on the judgment model. After merging sentence-level features into a case-level feature, we adopt the linear regression model and the neural network model to solve the PTP problem. For experiments, we collect and construct a real-world dataset containing more than 40,000 judgment documents of theft cases published by the Supreme People's Court of the People's Republic of China. Experimental results demonstrate that our method can effectively extract judgmentspecific case features from textual fact descriptions. The proposed predictor obtains the best performance of 3.2087 months in mean absolute error, and 72.54% and 90.01% in accuracy when the error upper bound being set to three and six months, respectively. The contributions of this paper are summarized as follows: 1) A two-step deep learning method is proposed to address the PTP problem by integrating the textual fact descriptions of criminal cases with legal basis; 2) To obtain a better understanding and more specific representation of the legal texts, the judgment model is summarized and then applied in the extraction of case feature from judgment documents; 3) We build a real-world dataset of theft case judgment documents. Experimental results on this dataset demonstrate the effectiveness of our method. The rest of this paper is organized as follows. Section 2 briefly reviews the related work. The judgment model of theft cases is described in Section 3. In Section 4, we propose the methods of case feature extraction and prison term prediction. Experimental results are presented in Section 5. Finally, Section 6 contains the concluding remarks.

Related work
The research of judgment prediction has attracted increasing attention in recent years. Relevant issues in the field of artificial intelligence and law have been studied as well. In earlier studies on judgment prediction, most researchers tended to formalize it as a text classification problem. Hachey et al. [Hachey and Grover (2006)] proposed a method of classifying legal sentences for automatic court rulings. The work of Goncalves et al. [Goncalves and Quaresma (2005)] was to classify legal text in 3,000 categories based on a taxonomy of legal concepts, and reported a F1 score of 79%. Liu et al. [Liu, Chang and Ho (2004)] presented a case-based reasoning system and used KNN model to classify 12 common criminal charges in Taiwan. The work by Katz et al. [Katz, Bommarito II and Blackman (2017)] built randomized trees with features extracted from case profiles and reported an accuracy of 70.9% in predicting the US Supreme Court's behavior. The work in Lin et al. [Lin, Kuo and Chang (2012)] exploited machine learning methods to identify robbery and intimidation cases and predict their sentencing by considering manually defined 21 legal factor labels. More recently, the work of Aletras et al. [Aletras, Tsarapatsanis, Preoţiuc-Pietro et al. (2016)] aimed to predict decisions of the European Court of Human Rights (ECHR), and they reported an accuracy of 78%. Sulea et al. ; Sulea, Zampieri, Malmasi et al. (2017)] applied a linear SVM classifier to predict law area and case judgments of the French Supreme Court, and reported the performance of 96%, 90% and 75.9% F1 scores in case ruling prediction, law area prediction, and estimating the time span of ruling issued, respectively. Although these efforts took full advantage of supervised learning method, they are hardly applied to other scenarios due to relying heavily on manual annotation. Besides the judgment prediction, some researchers investigated the method of identifying applicable law articles for a given legal case. Liu et al. [Liu and Liao (2005)] proposed an intuitive solution of converting the multi-label problem into a multi-class classification problem, and obtained satisfactory results in the classification of larceny and gambling crimes. To solve the scalability issue of Liu et al. [Liu and Liao (2005)], the work in Liu et al. [Liu, Chen and Ho (2015)] reported a two-step strategy consisting of preliminary article classification by SVM and re-ranking the results using word-level features and cooccurrence tendency among articles. Luo et al. [Luo, Feng, Xu et al. (2017)] proposed an attention-based neural network to jointly implement the charge prediction and the relevant article extraction, which has reasonable generalization ability on multiple fact descriptions. There are some works focusing on other text analysis problems. Boella et al. [Boella, Caro and Humphreys (2011)] used TF-IDF and information gain for feature selection, and then build the SVM classifier to identify the relevant domain to which the given legal text belongs. Farzindar et al. [Farzindar and Lapalme (2004)] and Galgani et al. [Galgani, Compton and Hoffmann (2012)] studied the approach to automatic text summarization of legal documents, which can improve work efficiency of legal professionals. De Araujo et al. [De Araujo, Rigo and Barbosa (2017)] studied the problem of domain ontology-based information extraction from natural language texts, and reported an average accuracy of 96%. According to relevant law articles, sentiment analysis of crime facts and prison term, Liu et al. [Liu and Chen (2018)] use SVM algorithm to classify the judgment text automatically. In summary, previous studies have considerably facilitated the advance in the field of artificial intelligence and law. Nevertheless, it remains a challenge to learn abundant semantic representations from case fact descriptions with less human annotations and make refined prediction of the prison term for a certain type of case. Our work in this paper aims to fill this gap.

Case modeling
The extensive application of NLP methods (such as word segmentation, named entity recognition, part-of-speech tagging, etc.) has remarkably advanced the processing and analysis of general textual data including news, online reviews and various social network data, and these techniques can still play a huge role in the context of legal data. However, to achieve better understanding and more effective mining of the case fact description in judgment documents, expert knowledge of relevant law articles is indispensable. As one of the most common types of crime in judicial practice, theft cases account for over 20% of all criminal judgment documents that the Supreme People's Court of the People's Republic of China has made publicly available. Taking the theft case as the research object in this paper, we need first build its judgment model according to relevant law articles. According to Article 264 in the Criminal Law of the People's Republic of China that illustrates the basic principles and framework of judging a theft case, there are four constitutive elements of theft crime as underlined in Appendix A, which includes: 1) Subject element: the nature of criminal suspects that determines the criminal liability, such as age, health condition or mental status, etc.; 2) Subjective element: subjective intention of committing crime, and the foresight to the consequences; 3) Object element: the nature of articles involved in the crime, such as economic value, appropriability, mobility, etc.; 4) Objective element: the concealment of committing crime (to differentiate theft crime from other crimes of property violation such as the crime of forcible seizure of money or property). A judgment document is constituted by four main parts: the basic information about the defendant(s), the fact description, the court's view including relevant law articles and the judgment decision including the charge and prison term. Through the comprehensive analysis of Article 264 and the structure of judgment document, we can describe a theft case with 11 dimensions: the value of stolen items, whether the defendant is juvenile, whether the defendant is disabled, whether the crime can be deemed as burglary (breaking in home), whether the defendant carried lethal weapons, whether the defendant is a pickpocket, whether the crime involves other serious circumstances (including but not limited to: collision, arson, resistance to arrest, etc.), whether the defendant is a recidivist, whether the defendant returned stolen items or compensated the victim, whether the defendant voluntarily surrendered and the prison term. Specifically, the value of stolen items is the primary consideration from the perspective of judgment, the juvenile or the disabled who are convicted of theft crime may have their penalty commuted compared with ordinary people, burglary, carrying lethal weapons, pickpocket and other serious circumstances shall result in a heavier punishment, the defendant who is a recidivist shall be punished severely as well, while the behavior of surrender or compensation that can be identified as remedial measures shall contribute to obtain a mitigated punishment. Formally, the judgment model of theft cases can be expressed as: C= (a, j, d, b, w, p, o, r, c, s, t) (1) where the description of each dimension is shown in Tab. 1. By integrating the structure of judgment documents with legal basis, the judgment model will facilitate an in-depth description of case details. In next section, we will describe the neural network method of feature extraction to obtain a more specific representation of the case facts, and then solve the PTP problem.

Method
In this section, we propose a two-step method to solve the PTP problem, as shown in Fig.  1. After the data preprocessing, the input fact description is transformed into distributed representation taking sentence as unit and fed into the sentence-level sequence encoder, and the case-level feature constructed with sentence-level feature of each dimension is then passed to train the prison term predictor.

Preliminary work 4.1.1 Text preprocessing
According to the judgment model described in Section 3, each dimension needs to be extracted from the judgment document. As all the judgment documents in our dataset are written in Chinese, the word segmentation is first carried out. After word segmentation, we remove all inessential parts from the documents except the basic information description of the defendant(s), the fact description and the judgment decision. The value of stolen items and the prison term are extracted by regular expressions 4 from the fact description part and the judgment decision part, respectively. In order to avoid the possible interference with the process of feature extraction, some insignificant words (e.g., names of people, places, organizations) are filtered by employing part-of-speech tagging and named entity recognition technology supported by the Language Technology Platform (LTP) [Che, Li and Liu (2010)].

Text distributed representation
After text preprocessing, the fact description part is transformed into a word sequence. To make these Chinese words calculable, it is necessary to have each word mapped into a vector space through the distributed representation process [Mikolov, Sutskever, Chen et al. (2013)]. In this paper, we use Word2Vec and the CBOW (Continuous Bag-of-Words) model optimized by negative sampling technique to complete the distributed representation of text and map all words in the text into the same vector space.

Feature extraction
In this subsection, we aim to extract the nine-dimensional feature except the value of stolen items and the prison term from a judgment document of theft case. As each sentence in the input data has been represented as a sequence of word vectors, we can first build a sentence-level sequence encoder to embed each sentence and then merge them into the case-level feature. RNN (Recurrent Neural Network) is a class of artificial neural network where connections between nodes form a directed graph along a sequence, which allows it to exhibit temporal dynamic behavior for a time sequence. The typical RNNs include the traditional RNN, LSTM (Long Short-Term Memory), GRU (Gated Recurrent Unit) and their variants. The ability of RNN to process variable length sequences lies in its unique neuronal structure. Taking the traditional RNN as an example, when processing the sequence information, each item of the sequence is continuously inputted into the network, and the network generates an output at each moment, then the output will jointly be processed with the input in the next moment to further generate the output in the next moment, which enables the output in each moment to carry all the information from the previous inputs. The above process can be depicted as where ht is the output at time t, xt is the input at time t, ht-1 is the output at time t-1, and w and b are the parameters corresponding to x and h, respectively.

LSTM sequence encoder
An RNN composed of LSTM units is often called an LSTM network. LSTM is developed to deal with the exploding and vanishing gradient problems that can be encountered when training traditional RNNs. The structure of a common LSTM unit [Hochreiter and Schmidhuber (1997)] is shown in Fig. 2. It consists of a memory cell and three gates including an input gate, an output gate and a forget gate. The memory cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell. Specifically, the input gate controls the extent to which a new value flows into the cell, the forget gate controls the extent to which a value remains in the cell and the output gate controls the extent to which the value in the cell is used to compute the output activation of the LSTM unit. At time step t, the forward pass of a common LSTM unit is executed as follows: where matrices Wq and Uq contain the weights of the input and recurrent connections, respectively, where q can either be the input gate i, output gate o, the forget gate f or the memory cell c, depending on the activation being calculated. The operator • is to calculate the Hadamard product of the two matrices, that is the result of multiplying the elements of the corresponding positions of the matrix.

GRU sequence encoder
GRU is a variant of LSTM whose unit structure [Cho, Van Merriënboer, Gülçehre et al. (2014)] is similar to LSTM but simpler, as shown in Fig. 3. Compared to LSTM, GRU removes the storage unit and the output gate, and it replaces the input gate and the forget gate with a reset gate and an update gate. At time step t, a GRU unit is updated as follows: where rt means the result of reset gate, zt means the result of update gate, and nt is the intermediate result when calculating the output vector ht. Bi-LSTM (Bi-directional LSTM) and Bi-GRU (Bi-directional GRU) are based on LSTM and GRU, respectively. They predict or label each element of the sequence based on the element's past and future contexts, by concatenating the outputs of two LSTMs or GRUs, one processing the sequence forward, the other one backward. Besides LSTM, GRU, Bi-LSTM and Bi-GRU, CNN (Convolutional Neural Network) can also be adopted to build the sequence encoder as the reference.

Case-level feature extraction
For a judgment document, the embedding layer in our model first transform it into a sequence of word vectors, then the sentence-level feature can be generated via the configurable sequence encoder, and the dropout layer is responsible for randomly discarding some neurons in the network to prevent over-fitting. By averaging all sentence-level features, the final case-level feature vector Fc is calculated as follows: where Fsi means the sentence-level feature vector for the sentence i, and N is the total number of sentences.

Prison term prediction
After the process of feature extraction, we are ready to train the prison term predictor. Taking month as unit, the value of prison term is a non-negative integer, so the PTP task can be formalized as a regression problem. Here, we adopt the linear regression (LR) model and the neural network (NN) model to train the prison term predictor.

LR predictor
LR is a linear approach to modelling the relationship between a dependent variable and one or more independent variables. If there is only one independent variable, the process is called simple linear regression, while for more than one independent variable, it is called multiple linear regression.
For the PTP problem, there are 9 independent variables, so it is a multiple linear regression problem. The model takes the form as follows: In order to get the linear relationship between the dependent variable y and the p-vector of regressors x, the least-squares estimation is used to fit the linear regression model.

NN predictor
The NN is suitable for dealing with nonlinear problems. As there can be multiple dependent variables and independent variables, it is often used for multi-label classification. It is also feasible to employ NN to solve the regression problem by removing the activation function, setting one node in the output layer, and changing the loss function to the mean square error.

Dataset
We collect and construct a real-world dataset containing 41,481 judgment documents of theft cases published by China Judgments Online 5 . The dataset covers 527 grass-roots courts in eight provinces and cities, as shown in Fig. 4.

Feature extraction
To evaluate the performance of our feature extraction method, we first manually annotate 7,079 judgment documents as the training data. And to test the effect of word vector dimension on the performance of feature extraction, the word vectors are trained in the dimensions of 100, 150, 200, 250 and 300, respectively, after the text preprocessing. Then we perform a 10-fold cross-validation on the training set with LSTM, GRU, Bi-LSTM, Bi-GRU and CNN, respectively. Tab. 2 shows the results of feature extraction with different neural network sequence encoders and word vector dimensions. Fig. 5 provides a more intuitive view about the performance difference among the five neural network sequence encoders, from which we can observe that GRU slightly exceeds other models. From the perspective of word vector dimension, CNN and LSTM obtain the highest accuracy of 98.27% and 99.23% with 150-dimension word vector, GRU and Bi-LSTM obtain the highest accuracy of 99.45% and 99.12% with 300-dimension word vector, and Bi-GRU obtains the highest accuracy of 99.23% with 250-dimension word vector. The highest accuracy of feature extraction is achieved using GRU sequence encoder with a 300-dimension word vector, so we use this setting in the following evaluation.

Prison term prediction
In this subsection, we aim to evaluate our method of prison term prediction from two perspectives: the prediction model, and the dataset.

Performance of different prediction models
Our method is characterized by the incorporation of neural network predictor and feature extraction based on judgment model. To evaluate the effectiveness of the judgment model, we need to build contrast predictors that simply use word vectors as the text feature without legal basis. Here, we adopt LSTM, Bi-LSTM, GRU and Bi-GRU models, respectively, to train the word vectors and make the prediction. They are trained over the same dataset as GRU+LR and GRU+NN predictors we proposed, among which 60% are for training, 20% each for validation and testing. We employ three indicators for evaluation metrics, which are: 1) MAE (lower is better): mean absolute error of prison term between the predicted numbers of months versus observed, 2) Acc_e3 (higher is better): the percentage of predicted results with errors not more than three months (i.e. the error upper bound is three months), and 3) Acc_e6 (higher is better): the percentage of predicted results with errors not more than six months (i.e., the error upper bound is six months).
Results are shown in Tab. 3, from which we can infer that both GRU+LR and GRU+NN predictors consistently and significantly outperform all the contrast models, and GRU+NN obtains the best performance of 3.2087 months in MAE, 72.54% in Acc_e3, and 90.01% in Acc_e6, respectively. The experimental results demonstrate the effectiveness of our method which empowers the feature extraction with judgment model and legal basis.

Performance over different datasets
It is intuitive that the judicial decision may be affected by some factors varying among different courts or regions. In this group of experiments, we further explore the effect of dataset on the performance of our method. We divide the universal set containing 41,481 judgment documents into 8 subsets by regions shown in Fig. 4, and then retrain and evaluate the GRU+LR predictor and GRU+NN predictor over the 8 subsets, respectively. Tab. 4 shows the comparison of PTP results among the universal set and 8 subsets. We have the following observations: 1) our model obtains a relatively better accuracy over Guangzhou and Shenzhen subsets than others even the universal, 2) the performance drops considerably over Shandong and Jilin subsets account for two least proportions of universal sets. It demonstrates that a large-scale dataset would in general facilitate the understanding of legal texts and benefit the training of prediction model. But it should be noted that the format of judgment documents may vary among different regions, which will lead to the inaccuracy of feature extraction, this is why the growth of dataset scale does not always boost the performance of our method.

Conclusion
In this paper, we investigated an approach to prison term prediction on criminal case description. To obtain a better understanding and more specific representation of the legal texts, we summarized the judgment model of theft cases according to relevant law articles. Several state-of-the-art neural networks were employed to implement the extraction of judgment-specific case feature. We adopted the linear regression model and the neural network model to build the prison term predictor. Experimental results on the real-world dataset demonstrated the effectiveness of our method.
In future work, we will expand the dataset, and further validate and advance the proposed method on various types of criminal case.