A dataset for predicting Supreme Court judgments in Nigeria

It has been widely argued among researchers that the application of big data analytics promises to reduce human bias and provide a scientific and evidence-based approach to the judicial process. In this dataset, historical data consisting of appeal cases presented at the Supreme Court of Nigeria (SCN) were collected from an online repository (Primsol Law Pavillion). A total of 5585 appeal cases brought before the SCN were collected from the archive. The dataset consisted of both criminal and civil appeal cases brought before the SCN. Variables that are related to court case proceedings were identified from related literature, verified by legal experts and used as a basis for generating an electronic structured version of the dataset stored as a spreadsheet file from the unstructured data. From the collected data, thirteen input variables were identified with one output/decision variable. The distribution of the numerical variables was presented as a descriptive statistical summary in terms of the minimum, maximum, mode, mean and standard deviation. The developed dataset can assist researchers to build predictive systems by training their models. Various feature extraction techniques can also be applied on the dataset to remove irrelevant or redundant features for increased performance of such classifiers that are needed to predict the outcome of legal cases.


a b s t r a c t
It has been widely argued among researchers that the application of big data analytics promises to reduce human bias and provide a scientific and evidence-based approach to the judicial process. In this dataset, historical data consisting of appeal cases presented at the Supreme Court of Nigeria (SCN) were collected from an online repository (Primsol Law Pavillion). A total of 5585 appeal cases brought before the SCN were collected from the archive. The dataset consisted of both criminal and civil appeal cases brought before the SCN. Variables that are related to court case proceedings were identified from related literature, verified by legal experts and used as a basis for generating an electronic structured version of the dataset stored as a spreadsheet file from the unstructured data. From the collected data, thirteen input variables were identified with one output/decision variable. The distribution of the numerical variables was presented as a descriptive statistical summary in terms of the minimum, maximum, mode, mean and standard deviation. The developed dataset can assist researchers to build predictive systems by training their models. Various feature extraction techniques can also be applied on the dataset to remove irrelevant or redundant features for increased performance of such classifiers that are needed to predict the outcome of legal cases.
© 2023 The Author(s

Value of the Data
• The dataset consist of information about approved and rejected appeal cases that were presented at the Supreme Court of Nigeria (SCN). The data can be useful in developing predictive systems which can provide effective decision support needed in facilitating the efficient delivery of appeal cases brought before the SCN. • The data will prove useful to data scientist and machine learning enthusiast for the application of supervised and unsupervised machine learning algorithms which can reveal previously unseen patterns and relationships between identified variables. • The dataset can be useful to lawyers and machine learning experts for the development of classification models which will aid in decisions affecting the outcome of appeal cases. • Algorithmic decision predictors which are an important part of the dataset have the tendency to improve the predictability and consistency of judicial and decision making as demanded by the principle of equity.

Objective
The main goal of generating the dataset is to provide a means via which the judiciary process is improved in Nigeria. The information contained in the dataset can provide a means for assessing the underlying relationship that exists between the identified factors and the outcome of the judicial process. The data can be used to support the decision-making process which may affect the outcome of an appeal case brought before the SCN. The ability to determine the potential judgment of a case based on the information provided could help assist a lawyer in identifying the best possible strategy to be applied. The analysis of the data by lawyers can help in understating the underlying relationship that lies among variables and on the determination of the outcome of judgments. The utilization of the data by lawyers can reduce the time and money spent in searching through voluminous texts for the purpose of generating the exact and accurate information needed to understand the distribution of certain elements of court cases.

Data Description
Technology has played a vital role in creating a foundational basis for the adoption of artificial intelligence (AI) [1] . Introducing AI to the justice system promises to improve procedural and administrative efficiency, aid in decision making processes for judges, lawyers and litigants, and further predict outcomes consistent with past precedents [2][3][4] . [ 5 , 6 ] are few of the researchers who have explored this area.
The dataset named appeal_cases.csv consists of previous criminal and civil appeal cases and their judgment which were delivered by the Supreme Court of Nigeria between 1962 and 2022; a period of 60 years. The secondary dataset was provided by the Primsol Law Pavillion who owns an independent online archive of court cases which can be accessed via subscription. A total of 5585 appeal cases brought before the SCN were collected from the archive. The data collected from the archive consists of case files that were stored as text-based documents in the .docx format thus presenting the data as an unstructured dataset. The unstructured dataset was converted into a structured format and then presented on a spreadsheet file. The dataset consist of 14 variables that were painstakingly extracted hence giving it its unique and distinct nature. Table 1 presents the various categories of the offence and respective sentence for each of the appeal cases. The table also shows the numerical value that was used to represent the various categories of offence and sentences presented in the collected dataset. Table 2 presents the categorization and transformation of the district of trail and appeal. The states reveal the state to which the cities belong to while the senatorial district is composed of a number of states.  Each senatorial district was coded with an integer value that was used to replace the categorical values of each feature.  Number of appeallants the number of persons who make an appeal to a higher court against a judgement made in a lower court Stored using a non-negative integer value, -1 represents missing values.

Number of male appeallants
the number of male persons who make an appeal to a higher court against a judgement made in a lower court Stored using a non-negative integer value, -1 represents missing values.

Number of female appeallants
the number of female persons who make an appeal to a higher court against a judgement made in a lower court Stored using a non-negative integer value, -1 represents missing values.

Number of public witness(es)
the number of person(s) not a party and not called by a party to testify at a hearing Stored using a non-negative integer value, -1 represents missing values.

Number of eye witness(es)
the number of person(s) who has seen something happen and can give a first-hand description of it Stored using a non-negative integer value, -1 represents missing values.

Number of defense witness(es)
the number of witness(es) whom the appellant intends to call at a hearing or at trial Stored using a non-negative integer value, -1 represents missing values.

Final decision held (Judgment)
the judgement made by the SCN regarding the appeal as either approved or disapproved Stored using a non-negative integer value, -1 represents missing values.  Table 4 shows the result of the statistical distribution of the numeric features in the dataset based on a number of statistics. The analysis was done using descriptive statistical analysis of the values by estimating the mean, minimum, maximum, median, mode, standard error, standard deviation and other related statistic.

Data acquisition
Before developing the dataset, a thorough review of the literature covering various areas was initially conducted in order to have a good understanding of the underlying concepts. It was observed that every case consists of a number of components namely: the case identity number, date of the case, location of trial and appeal, information about the appellant, complainants, and witnesses, offence committed, sentence declared, determination of appeals, introduction to the appeal, facts covered, issues and the decision held by the judges regarding the appeal. The various features that were identified include the following: i. Appeal district -the senatorial district of the appeal court; ii. Trial district -the senatorial district of the trial court; iii. Offence -conducts or omissions that violate and are punishable under criminal law; iv. Sentence -formal judgment of a convicted defendant in a criminal case setting the punishment to be meted out. v. Number of complainants -the number of person(s) that reports wrongdoing to law enforcement; vi. Number of male complainants -the number of male person(s) that reports wrongdoing to law enforcement; vii. Number of female complainants -the number of female person(s) that reports wrongdoing to law enforcement; viii. Number of appellants -the number of persons who make an appeal to a higher court against a judgment made in a lower court; ix. Number of male appellants -the number of male persons who make an appeal to a higher court against a judgment made in a lower court x. Number of female appellants -the number of female persons who make an appeal to a higher court against a judgment made in a lower court xi. Number of public witness(es) -the number of person(s) not a party and not called by a party to testify at a hearing; xii. Number of eye witness(es) -the number of person(s) who has seen something happen and can give a first-hand description of it; xiii. Number of defense witness(es) -the number of witness(es) whom the appellant intends to call at a hearing or at trial; xiv. Final decision held -the judgment made by the SCN regarding the appeal as either approved or disapproved. xv. Number of eye witness(es) -the number of person(s) who has seen something happen and can give a first-hand description of it; xvi. Number of defense witness(es) -the number of witness(es) whom the appellant intends to call at a hearing or at trial; xvii. Final decision held -the judgment made by the SCN regarding the appeal as either approved or disapproved.
The information extracted from literature review was validated by professional lawyers and subsequently, data containing information about the factors were collected.
These variables were used as a basis for the extraction of the data that was stored in an online archive containing details about the outcomes of various appeal brought before the Supreme Court of Nigeria (SCN). The repository consists of electronic summaries of the proceedings of cases containing various sections such as introduction, facts, issues and the decision held by the SCN. Figs. 1 and 2 show screenshots of the judgments of criminal and civil cases respectively brought before the SCN.  Each case has a serial number including district of appeal, district of trial, offence and sentence which were all captured as categorical values from the introduction, facts and issues components of the raw file. Also, information about the number of complainants, male complainants, female complainants, appellant, male appellant, public witness, eye witness and defense witness which were extracted from the fact's component of the raw file was sored as numeric values. Information about the outcome of each appeal according to the SCN was also collected and stored as a string variable called decision which contained the verdict of the SCN on every case. However, an additional variable was created named decision binary which was used to classify the decision based on granted and dismissed appeals alone. All other verdicts such as sustained, suspended, re-appeal were removed from the dataset alongside cases that were lost as a result of damages files.
The unstructured data stored in each text file was converted into a structured data containing information about a set of variables which were all extracted from documents collected from the online archive. Each part of the document was used to extract the information about each identified variable associated with the outcome of the cases while the decision held was used to determine if the appeal was either dismissed or granted. The city of the appeal and trial were classified according to their senatorial district in Nigeria to which their states belonged. This was done in order to reduce the set of values that were represented for each feature so as to reduce the complexity of the analysis of each feature. As a result of this, the city of trial and appeal were categorized into their respective senatorial districts. Each categorical variable was represented using a numeric value that lie between the numbers 1 and 7 as depicted in Table 2 . Each senatorial district was coded with an integer value that was used to replace the categorical values of each feature.
More so, the various offences and sentence declared for each appeal case which was represented as categorical string values were converted into numeric values. More so, the target variable which contained information about the judgment of the appeal cases by the SCN was classified into two classes that were converted to binary values; 0 and 1 for the dismissed and granted appeal cases respectively.

Data pre-processing
The pre-processing of the dataset was needed in order to eliminate the inconsistencies created as a result of noise in the dataset due to the presence of missing values across the features. Fig. 3 shows a description of the distribution of the proportion of the total records that were missing from the values of the features that were extracted from the case files in the dataset collected for this study. According to the figure, it was observed that as much as up to 72% of the total records were missing from the values of the number of male and female complainants however the values of the total number of complainants accounted for up to 44% of total records missing. Since all the features in the dataset had been converted to a numeric value, all the missing values within the feature set were replaced with a dummy value of -1. By doing this, subsequent missing values that are encountered within the dataset could be replaced with a dummy value thus making the dataset more suitable for use.