Introduction

Local authorities in England have statutory responsibility for protecting the welfare of children and delivering children’s social care. The COVID-19 pandemic has put pressure on the children’s social care sector, and exacerbated existing challenges in risk and other assessments [1, 2]. Early Help is a service provided by Local Authorities that offers social care and support to families, including early intervention services for children, young people, or families facing challenges beyond the scope of universal services like schools or general practitioners. Early Help provides services that meet the needs of families who have lower level support needs (e.g., not child protection) to prevent problems from escalating and entering the social care system (e.g. child protection). Furthermore, Early Help can offer children the support needed to reach their full potential; improve the quality of a child’s home and family life; enable them to perform better at school and support their mental health; and can also support a child to develop strengths and skills that can prepare them for adult life.

Increase of children in need. Data collected from the most recent Child in Need Census 2022–2023 revealed that there are: 404 310 Children in need, up 4.1% from 2021 and up 3.9% from 2020. This is the highest number of children in need since 2018. Furthermore, there were 650 270 referrals, up 8.8% from 2021 and up 1.1% from 2020. This is the highest number of referrals since 2019. In 2022, compared with 2021 when restrictions on school attendance were in place for parts of the year due to COVID-19, referrals from schools increased, in turn driving the overall rise in referrals. The Department for Education (DfE) is collecting data on early help provision as part of the Child in Need Census 2022–2023 and 2023–2024 Census [3, 4]. These census data are submitted by local authorities to the DfE between early April and end of July of each year. Information about early help enables DfE to understand more about the contact that children in need have with the Early Help services that local authorities provide.

Need for data-driven tools and machine learning. The increasing numbers of children in need and referrals have highlighted the need for data-driven tools that can analyse large datasets to aid local authorities in making informed decisions for individuals at risk alleviating pressure on a chronically overstretched service. Machine learning (ML), an application of artificial intelligence (AI), can efficiently analyse vast amounts of data from diverse sources. In the context of children’s services, this capability allows for the identification of risk factors of which social workers may not have otherwise been aware, such as a family falling behind on rent payments. Such data can be combined with other relevant information, such as school attendance records, to provide a more comprehensive view of the situation.

The effectiveness of ML is currently limited by the lack of transparency in the decision-making process of ML models and the data they use [5,6,7,8,9]. Therefore, it is crucial to increase transparency in ML decision-making processes to ensure fair and equitable outcomes for individuals and communities. An ML model may be biased if it systematically performs better on certain socio-demographic groups [10, 11]. This can occur when the model has been developed on unrepresentative, incomplete, faulty, or prejudicial data [12,13,14]. Given the potential impact of bias on individuals and society, there is growing interest among businesses, governments, and organizations in tools and techniques for detecting, visualising, and mitigating bias in ML. These tools have gained popularity in addressing bias-related issues and are increasingly recognized as important solutions to promoting fairness [15,16,17,18,19,20,21,22]. The use of ML in social care presents a topic of discussion that involves technical and ethical considerations [23]. For example, if used responsibly and fairly, these models have the potential to assist in protecting young people [24], particularly when combined with successful early intervention programs such as the Early Start program developed in New Zealand [25]. Responsible use of ML models has the potential to enhance the usefulness of risk assessment tools in child welfare [26].

This paper evaluates the suitability of ML models for identifying young people who may require Early Help Services (EHS) and applies methods for identifying and mitigating bias. For the purposes of this work, Leicestershire County Council (LCC)’s locality triage will be categorized as (1) eh support: Early Help support: the most intense type of intervention; (2) some action: referral to less intensive services such as group activities or schemes that run during the school holidays, or to external services; (3) no action: additional support is not currently required. Specifically, the contributions of this paper are: (a) ML models were implemented and their performance was evaluated across different validation and test sets; and (b) bias analysis was conducted and mitigation algorithms were applied to reduce bias in the ML models. This study revealed that certain educational indicators, such as fixed-term exclusion and free school meals, may predict the need for EHS.

Methods

Dataset

The dataset contains records of young people who are under 18 years, and the dataset was provided by the LCC. The data relate to families and individuals assessed for Early Help support between April 2019 and August 2022. The time period of data included within the features varies depending on the age of the young person with older individuals having data across a longer time-frame. The initial dataset contained 15976 records and 149 features. The total percentage of missing values was 5.41 %, and the total percentage of NA values was 20.33 %.

To pre-process the dataset, missing values were replaced with 0, and records with more than 30 % of missing values were removed (10 % of the data). For cells that contained NA values, each relevant feature was paired with another feature called feature name NA that received the value of 1 if the original feature was not relevant to the record. For example, the feature Not in Education, Employment or Training (neet) is not applicable to those under 16 years, resulting in the presence of NA values. Supplementary Table S11 contains the statistics of the features before one-hot encoding.

After pre-processing, the dataset contained 14360 records and 149 features. The number of features with less than 5 % of missing values is 64, while the number of features with less than 20 % of NA cells is 91. After applying the one-hot encoding, the number of features in the pre-processed dataset was 363. The feature locality decision represents the target variable with three categories: some action (56.59 %), eh support (33.10 %), and no action (10.31 %). Those who received EHS belong to the eh support category. Any other type of service provided by LCC to a child or signposting to an external organization is considered as some action, and young people who did not receive any action are labelled in the no action category. The remaining input features represent educational indicators and are grouped into topics, such as Absence, Exclusion, School Transfer, Free School Meal (fsm), Special Educational Needs and Disabilities (send), Pupil Referral Unit (pru), Home Education, Missing, Not in Education, Employment or Training (neet), Early Years Funding (eyf), and the Income Deprivation Affecting Children Index (idaci). A description of these features can be found in Supplementary Table S5.

Machine learning model evaluations

The aim is to identify the best performing ML model for predicting three locality decision outcomes: eh support, some action, and no action. The dataset was divided into two sets: a Training/validation set with 10 052 records (70 %), and a Test set with 4 308 records (30 %). Then, the following ML techniques were evaluated using stratified tenfold cross-validation (CV): Ridge Classification, Logistic Regression, Support Vector Classification (Linear and Kernel), K-Nearest Neighbors (KNN) Classifier, Gaussian Naive Bayes, Decision Tree, Random Forest Classifier, Gradient Boosting Classifier, Extreme Gradient Boosting, Ensemble Methods (AdaBoost, CatBoost), and Discriminant Analysis (Linear and Quadratic). Supplementary Tables S1–S3 show the results of evaluating the above-mentioned models for each locality decision outcome.

The hyperparameter settings of each model are shown in Supplementary Table S4 for reproducibility purposes. The best models were chosen based on their area under the curve (AUC), recall, and precision scores on the validation sets. To ensure that the best models did not suffer from low variance, the performance of each best model was evaluated using the tenfold CV approach which was repeated 30 times on the train set. A random seed generator was applied to create a different sequence of values; each time, the k-fold CV was run to ensure randomness. The test set remained the same across the 30 iterations. The average and standard deviation values across the iterations were recorded for the validation and test set.

Multi-class models were implemented, but these did not perform to a satisfactory standard (see Supplementary Tables S6-S7). As a solution, separate models were implemented for predicting each outcome (i.e., one model per module), and those achieved better results. Hence, this paper presents the analysis of the separate models and the results of the multi-class models are found in the supplementary material.

Evaluation metrics

The AUC, recall, and precision evaluation metrics were utilised to evaluate and compare the predictive performance of the ML models. Recall (also known as sensitivity or true-positive rate) measures the proportion of actual positive cases that a model correctly identifies. Precision (also known as positive predictive value) measures the proportion of positive predictions made by the model that are accurate. Recall and precision are calculated using the following expressions: Recall = \({\text{ TP }}/{(\text{ TP } + \text{ FN})}\) and Precision = \(\text{ TP }/{(\text{ TP } + \text{ FP})}\), where:

  • true positive (TP) refers to a young person who required eh support or some action or no action and was predicted by the model as such.

  • true negative (TN) refers to a young person who did not require eh support or some action or no action and was predicted as such.

  • false negative (FN) refers to a young person who required eh support or some action or no action and was predicted as not requiring such service.

  • false positive (FP) refers to a young person who did not require eh support or some action or no action and was predicted as requiring such service.

The receiver-operating characteristic (ROC) curve illustrates the trade-off between the true-positive rate (TPR) and the false-positive rate (FPR), and the AUC represents an aggregate metric that evaluates the classification performance of a model. The closer the AUC is to 1, the better the performance of the classifier. The distribution of classes eh support, some action, and no action in the target variable locality decision is imbalanced. In such cases, adjusting the decision threshold can be a useful technique for improving the performance of ML models and reducing the occurrence of FN, which are often the most costly errors in imbalanced classification problems. By default, the threshold is set to 0.5. However, an appropriate threshold was chosen based on the trade-off between the costs associated with FP and FN. This was achieved by calculating the precision–recall curve and selecting the threshold that maximizes recall. The process of threshold selection is described in Sect. Threshold adjustment.

Bias mitigation in ML models

The threshold optimizer and exponentiated gradient algorithms were applied as techniques to mitigate bias in the ML models. The bias evaluation considered the following sensitive features: gender, age at locality decision, attendance, and idaci. False-negative rate (FNR) was used as a metric to mitigate bias, since it represents those who would benefit from the eh support (or some action or no action), but were not predicted as such. The two-sample Z-test for proportions statistical test was applied to evaluate whether there is a significant difference between the FNRs of the categories for a given sensitive feature (e.g., gender, age at locality decision, attendance, and idaci). The null hypothesis is that there is no significant difference between the two proportions (i.e. FNRs), while the alternative hypothesis is that there is a significant difference between them. Under the null hypothesis, the test statistic follows a standard normal distribution and the p value can be calculated using this distribution. If the p value is less than the chosen significance level (\(\alpha = 0.05\)), the null hypothesis is rejected and it can be concluded that there is a significant difference between the two FNR values, and the ML model may have been biased towards the sensitive feature under scrutiny. In this case, bias-mitigation algorithms are considered to reduce the bias for the sensitive feature. Otherwise, if the p value is greater than the significance level (\(\alpha = 0.05\)), the null hypothesis is not rejected, and it can be concluded that there is not enough evidence to suggest a significant difference between the two FNRs. In this case, it can also be concluded that the ML model does not present bias for the sensitive feature under scrutiny.

Threshold adjustment

Threshold adjustment is applied to find the threshold that maximizes the performance of the model in terms of precision and recall. The process of adjusting the threshold for the eh support model is described as follows. After the model has been trained, it returns a probability score of the confidence of the model’s prediction. The threshold value is then used to decide whether the prediction should be classified as eh support (class 1) or not (class 0). For example, without adjustment, when the model predicts a probability greater than 0.5, the record is labelled as class 1; otherwise, it is labelled as class 0. With the adjustment, the threshold value was adjusted based on the threshold analysis illustrated in Fig. 1. This analysis revealed that the optimal cutoff for classification purposes (i.e., the point with the best balance between precision and recall) occurs at a threshold of 0.27 for the gradient-boosting classifier (GBC) model and at a threshold of 0.25 for the logistic regression (LR) model. The F1 curve represents the harmonic average between the precision and recall rates. The performance of these models (with and without the threshold) is then evaluated using a stratified tenfold CV over 30 iterations.

Fig. 1
figure 1

Threshold analysis for the GBC and LR eh support models

Table 1 presents the average and standard deviation values for recall and precision for the GBC and LR models (with and without threshold).

Table 1 Threshold adjustment for the eh support model: predictive performance for the GBC and LR models (with and without the threshold). Average recall and precision using stratified tenfold CV over 30 iterations

The same process followed for the eh support model was followed for the other models. No threshold adjustment was needed for the some action model, and hence, the threshold value was set to 0.5. For the no action model, the optimal threshold was 0.46. Supplementary Figure S4 illustrates these results.

Results

This section describes the performance of ML models for predicting whether a young person requires EHS. Specifically, it evaluates binary classification models for predicting each of the following outcomes: (1) eh support; (2) some action; (3) no action. The LIME [27] method is then applied to explain the model predictions by identifying those features that were most important for correct or incorrect classifications. Supplementary Table S5 provides a description of all features used in this work. Finally, the suitability of the threshold optimiser[28] and exponentiated gradient reductions[29] techniques in mitigating bias are applied and these results are also reported.

Can we predict whether a young person needs Early Help (eh) support? Model for eh support

Model performance. Supplementary Table S1 lists the predictive performance of the different ML models sorted by AUC. GBC presents the highest AUC and precision values, while LR presents the second highest AUC and precision values. In terms of computational time, LR is 6.5 times faster than CatBoost (CB), and 4.3 times faster than AdaBoost (ADA). Supplementary Table S8 lists the validation performance of the GBC and LR models across 30 iterations. The LR model outperformed the GBC on the validation sets (i.e., across 30 iterations) with an average AUC of 0.62 (standard deviation \(\hat{\sigma } = 0.01\)), an average recall of 0.82 (\(\hat{\sigma } = 0.01\)) and an average precision of 0.38 (\(\hat{\sigma } = 0.00\)). On the test set, LR reached an AUC of 0.63, recall of 0.83, and precision of 0.38. Figure 2 illustrates the ROC curve for the LR classifier and Supplementary Table S10 presents the test performance metrics and the optimal ROC point.

Fig. 2
figure 2

eh support Model: ROC curve, AUC, and optimal ROC point for the LR classifier on the test set

Interpretation of results. The LR model demonstrates a moderate ability to differentiate between young people who require early help and those who do not, as indicated by the AUC of 0.63. However, it has a relatively high recall of 0.83, meaning that it captures a good proportion of the young people who require early help. On the other hand, the precision of 0.38 suggests that the model generates a considerable number of FPs, incorrectly identifying some young people as needing early help.

Factor analysis. The findings from the LIME analysis revealed the factors related to young people correctly classified by the eh support model. Those who did not require eh support, i.e., the true-negative (TN) group, had a median age of 14 years, compared to a median age of 8 years for those who did require eh support, i.e., the true-positive (TP) group. A higher proportion of young people (91 %) who required eh support attended a pupil referral unit (PRU) and received special educational needs and disabilities for education, health, and care (SEND EHC) support compared to those who did not require eh support (77 %). Those who were not applicable (NA) for the special educational needs and disabilities (SEND NEEDS) or referred for special educational needs and disabilities (SEND REFERRAL) services were less likely to require eh support. Supplementary Fig. S1 illustrates these findings. For those belonging to the TP group, i.e., those that required eh support and were correctly classified by the model, the most relevant features were permanent exclusion, Not in Education, Employment or Training (NEET),, School Transfer Phased (TRANSFER PHASED), send referral and pru. Figure 3 shows all important features for those young people that were correctly classified by the eh support model. The negative LIME values in Fig. 3a and the positive LIME values in Fig. 3b show the most important features of the TN and TP groups, respectively.

Fig. 3
figure 3

LIME analysis of young people correctly classified by the eh support model (LR classifier) on the test set

LIME analysis also revealed that the model prioritized features with lower percentages of missing data. For example, feature neet (year 4) has 0.1% missing values, whereas neet (prev 3terms) and neet (year 1) have 3.9 % and 2.3 % missing values, respectively. Another example is pru (year 5) with 1.6 % missing values and pru (year 1) with 6.3 % missing values (see Supplementary Table S11). The LR model for eh support had a false negative rate (FNR) of: 0.19 for females; 0.16 for males; and 0.33 for the category other which only comprises 0.4 % of the data. The feature Income Deprivation Affecting Children Index (ICADI) was categorized into five classes (idaci class) according to the ranges (IDACI 1: [0.0, 0.2), IDACI 2: [0.2, 0.4), IDACI 3: [0.4, 0.6), IDACI 4: [0.6, 0.8), IDACI 5: [0.8, 1.0]) and the FNR values oscillated between 0.13 and 0.22.

Bias analysis. Figure 4 shows the application of the bias-mitigation algorithms. The threshold optimizer (labelled ‘Post-processing’) and the exponentiated gradient reductions algorithm (labelled ‘Reductions’) reduced the FNRs difference across categories but at the cost of an overall increase in the FNR.

Fig. 4
figure 4

LR model for eh support. FNR of the unmitigated and mitigated models (on the test set) for features gender and idaci. Note that ‘Reductions’ refers to the exponentiated gradient reductions algorithm

A two-sample Z-test for proportions compared the differences of the FNR values within the categories of the sensitive features gender and idaci. The test was not performed for the category other, because it comprises only 0.4 % of the data. According to the Z-test, there were no significant differences between the FNR values of female (0.19) and male (0.16) as well as between most of the categories of idaci, where the FNR oscillated between 0.13 and 0.22. Supplementary Table S12 details these results and concludes that the LR model does not present bias in the sensitive features gender and idaci.

Can we predict whether some action is needed? Model for some action

Model performance. Supplementary Table S2 provides the predictive performance of the ML models, and the GBC presented the best AUC and recall values. LR presented the second highest AUC and recall values and a computational time 6.2 times faster than the CB model. The GBC reached an average AUC of 0.60 (\(\hat{\sigma } = 0.01\)), an average recall of 0.79 (\(\hat{\sigma } = 0.02\)), and an average precision of 0.61 (\(\hat{\sigma } = 0.01\)) across 30 iterations. On the test set, the model had an AUC of 0.60, recall of 0.81, and a precision of 0.61 (Fig. 5).

Fig. 5
figure 5

some action Model: ROC curve, AUC, and optimal ROC point for the GBC on the test set

Interpretation of results. The GBC model demonstrates a modest ability to differentiate between young people who require some action and those who do not, as indicated by the AUC of 0.60. It has a relatively high recall of 0.79, indicating its capability to capture a good proportion of young people who require action. However, the precision of 0.31 suggests that the model generates a considerable number of FPs, incorrectly identifying a significant portion of young people as requiring action when they do not actually need it.

Factor analysis. The LIME analysis revealed that the median age of those who did not receive some action was 6 years versus 12 years for those that did. In the former group, a higher proportion received early years funding (EYF) and had free school meal (FSM). The average number of fixed-term exclusion sessions was higher (0.184 average) among those who received some action, compared to those who did not (0.065 average). Supplementary Figure S2 illustrates these results. Moreover, among those that received some action and were correctly classified by the model (TP group), the most relevant features were permanent exclusions, fsm, send referral, age at locality decision (11–13 years) and missing education. The negative LIME values in Fig. 6a represent the most important features of the TN group and the positive LIME values in Fig. 6b show the most important features of the TP group.

Fig. 6
figure 6

LIME analysis for those correctly classified (test set) by the some action model (GBC)

Young people who received some action had a higher average age at locality decision (9.3 years) compared to those who did not receive some action (8.5 years). The attendance (year 4) feature (i.e., the percentage of total school attendance sessions in the academic year 4 years previously), also presented a difference between those who received some action (61.5 %) and those who did not receive some action (56 %). For the other features, a small difference between the groups was observed (less than 2 %). Since age at locality decision and attendance (year 4) are numerical features, it was necessary to create new categorical features (class age and attendance bin, respectively) for evaluating the presence of bias. Supplementary Figure S5 represents histograms for both features. Feature class age was categorized into three groups: below 7.5 years (group A), 7.5–12.5 years (group B) and above 12.5 years (group C). Additionally, the new binary feature attendance bin considers two categories: \(\le 0.5\) and \(> 0.5\).

The GBC model for some action had an FNR of 0.22 for the category female and 0.20 for the category male. Category other had the lowest FNR of 0.08. The FNR values between the categories of idaci oscillated between 0.18 and 0.25. Two-sample Z-tests were carried out to compare differences between categories to identify potential biases within the GBC model. According to the Z-test, there was no significant difference between the FNR values of male and female and between the FNR values for most of the categories of idaci. With regards to the class age feature, the GBC model had an FNR of 0.03 in group age C (above 12.5 years), 0.21 in group age B (7.5–12.5 years), and 0.38 in group age A (below 7.5 years). There was a significant difference between the FNR values for all the groups. Similarly, for the attendance bin feature, the FNR was 0.12 in the group \(> 0.5\) and 0.36 in the group \(\le 0.5\). The test concluded that there was a significant difference between the FNR values between the groups. Supplementary Table S11 details these findings. Hence, these results reveal the presence of bias in the features class age and attendance bin.

Bias analysis. The use of the bias-mitigation post-processing algorithm threshold optimizer reduced the gap between the categories, but it resulted in an increase in the FNR by more than 0.30 in all groups. On the other hand, the exponentiated gradient reductions algorithm decreased the FNR in most groups and produced the closest FNR between the groups, giving a better outcome. Figure 7 illustrates these findings. The GBC with exponentiated gradient reduction algorithm had a precision of 0.59 and a recall of 0.85 on the test set, whereas the GBC unmitigated model had a precision of 0.61 and recall of 0.80. From these results, it can be observed that the GBC model with the exponentiated gradient reductions algorithm outperforms the GBC unmitigated model in terms of recall (i.e., each model reached a recall of 0.85 and 0.80, respectively). However, the GBC unmitigated model has a slightly higher precision (0.61 > 0.59). Moreover, the use of the exponentiated gradient algorithm reduced the overall FNR by 0.06 points.

Fig. 7
figure 7

GBC model for some action. FNR of unmitigated and mitigated models (on the test set) for features class age and attendance bin. On the x-axis, each bracket contains two values. The first value refers to class age and the second refers to attendance bin

Can we predict whether no action is needed? Model for no action

Model performance. Supplementary Table S3 provides the predictive performance of different ML models. GBC presented the best AUC but poor predictive performance regarding precision and recall values. CB and LR reached better precision values than GBC, while RF presented better recall among the models with the highest AUC values. Based on these aspects, LR, RF, and CB were chosen as the best performing models. The use of the L1-norm in the LR improved the predictive performance of the model. Supplementary Table S9 provides the validation performance of the ML models over 30 iterations. The validation results of the LR model reached an average AUC of 0.56 (\(\hat{\sigma }\) = 0.01), an average recall of 0.60 (\(\hat{\sigma }\) = 0.03), and an average precision of 0.11 (\(\hat{\sigma }\) = 0.01). The presence of imbalanced classes reflects the lower values for recall and precision. On the test set, the model had an AUC of 0.56, recall of 0.63, and a precision of 0.12. None of the models performed well on this task. The presence of imbalanced classes reflects their low performance. Figure 8 illustrates the ROC curve, and Supplementary Table S10 presents the predictive performance metrics and the optimal ROC for the LR classifier on the test set.

Fig. 8
figure 8

no action model: ROC curve, AUC, and optimal ROC point for the LR classifier on the test set

Interpretation of results. The LR model has a relatively low AUC score, suggesting poor performance in distinguishing between positive and negative instances. Although it exhibits a decent recall rate, capturing 63% of the young people who require no action, its precision of 12% is quite low, indicating a high number of FPs. These results suggest that the model may have difficulty accurately identifying individuals who require no action.

Factor analysis. The LIME results for the LR classifier are illustrated in Fig. 9. The negative LIME values in Fig. 9a represent the most important features of the TN group and the positive LIME values in Fig. 9b shows the most important features of the TP group. For those in the no action category that were correctly classified by the model (TP group), the most relevant features were exclusion lunchtime, missing education, neet, pru, and home educated.

Fig. 9
figure 9

LIME analysis for young people correctly classified by the no action model (LR classifier) on the test set

Moreover, it was identified that a high proportion of those who had NA values in the send need feature did not belong to the no action category (hence, they had received eh support or some action). This result aligns with the finding obtained by the eh support model. Young people in the TN group had a median age of 9 years, compared to a median of 12 years in the TP group. The box plot suggests no difference between the age of those belonging to the TN and TP groups (see Supplementary Fig. S3). The LR model for no action had an FNR of 0.36 for the category female and 0.39 for the category male. The FNR values between the categories of idaci oscillated between 0.30 and 0.41. According to the two-sample Z-test, there was no significant difference between the FNR values of female and male and between all the categories of idaci. This suggests that the LR model is not biased with respect to these features. Regarding the feature class age, the LR model had an FNR of 0.36 in group age A (below 7.5 years), 0.56 in group age B (7.5–12.5 years), and 0.23 in age group C (above 12.5 years). The two-sample Z-test concluded that there was a significant difference between the FNR values between all groups suggesting the presence of bias in the feature class age. Supplementary Table S12 details these results.

The use of the bias-mitigation algorithms, threshold optimizer (post-processing), and exponentiated gradient (reductions) decreased the difference between the categories but increased the glsFNR for all groups, which is not ideal for classification purposes. Figure 10 illustrates these findings.

Fig. 10
figure 10

LR model for no action. FNR of unmitigated and mitigated models (on the test set) for class age

The LR classifier demonstrated the best predictive performance on the test set and correctly identified 61 % of all young people within the no action category. However, the bias mitigation algorithms did not reduce the FNR and the presence of a strong imbalance in the target variable affected the predictive performance of the model.

Discussion

Overview. For the social care task described in this paper, models were developed to determine whether ML models can assist human decision-makers with regard to identifying families whose young people may require EHS. Young people who could benefit from EHS but are not identified or offered such services can be disadvantaged by the social care system. Therefore, it was important to identify features of those that could be disproportionately negatively impacted by the ML models as a strategy for understanding and communicating the limitations of ML models. Since the dataset was sparse and noisy, adequate data treatment was required prior to the use of the ML models. Imputation techniques were considered as well as the use of one-hot encoded features, allowing the ML algorithms to distinguish between NA, missing, and filled-out cells. The pre-processed dataset was thereafter utilised for data analysis and ML tasks.

Model testing. During testing, while the models show some capability in capturing young people who require intervention or early help, they also generate a significant number of FPs, incorrectly identifying individuals who do not actually need early help referral. This indicates room for improvement in terms of precision and overall performance in accurately identifying those in need of action or help.

Bias analysis. The bias analysis revealed that sensitive features gender and idaci did not bias the model with regards to predicting locality decision (i.e., eh support, some action, no action). However, the analysis identified age at locality decision and attendance (year 4) as sensitive features and the difference between the FNR values for some groups was statistically significant. For example, for young people of age below 7.5 years (\(\textrm{FNR} = 0.38\)) and for those of age 7.5–12.5 years (\(\textrm{FNR} = 0.03\)). The use of bias-mitigation algorithms reduced the FNR in these groups and improved the predictive performance of the models eh support and some action. The data imbalance in the no action category affected the model’s predictive performance. For those that were correctly classified by the eh support model (and hence who did not require eh support) had a median age of 14 years, compared to a median age of 8 years for those who did require eh support. A higher proportion of young people who required eh support attended pru services or received send support. However, the median age for those who did not receive some action was 6 years and young people with a median age of 12 years received some action. Furthermore, in the group of those that did not receive some action, it was identified that a higher proportion received benefits such as eyf or fsm. Although a variety of ML algorithms and bias-mitigation techniques were considered, fairness is a socio-technical challenge, and therefore, mitigations are not all technical and need to be supported by processes and practices. The use of sensitive features including demographic information during the analysis of the ML results can enhance the understanding of model behaviour and aid with the identification of groups that could be subject to bias. It is important to assess ML performance differences across groups and the likelihood of bias.

Conclusion. The findings from our study demonstrate that ML has the potential to support decision-making in social care, and the results highlight that further research is needed in developing methods that work on such complex datasets. In particular, further research is needed in developing methods and strategies for dealing with missing, uncertain, and sparse data; and in developing ML models that can provide clear explanations for their predictions. Research is required about how to best visualise and communicate outputs of ML models to end-users in a way that supports decision-making.

Limitations

The limitations of the study are as follows.

  • A limitation of this study is that a decision was taken early on to restrict to young people aged 18 and under at the point of assessment. However, this does leave a blind spot in the study of young people (n = 23420) aged 18 and under that were never referred for assessment, but possibly should have been.

  • The original dataset contained information about first language. However, this feature was excluded from the study due to a high proportion (29 %) of missing values.

  • The dataset also contained information on ethnicity that was categorized as either white or non-white due to the lower frequency of other ethnic categories (see Supplementary Table S13). Analysis of the correctly classified and misclassified results revealed that the performance of the ML models was similar across the ethnic groups (see Supplementary Table S14). More specifically, based on a two-sample z test for the difference of proportions (p value > 0.05), there were no statistically significant differences in the performance of these ML models between the groups. The team is currently in the process of pursuing a follow-up collaboration with LCC to expand this study. Future work involves analysing the data of primary and secondary school young people who require eh support and incorporating new demographic features to uncover new insights and findings.

  • Imputation methods were explored in our previous study and not reported in this paper. None of the imputation methods explored were suitable for imputing missing values in this dataset due to their sensitivity, and therefore, it was considered ethical not to impute the missing values. Instead, the missing not at random values were treated using one-hot encoding. Further study is needed to develop algorithms suitable for imputing random missing values.

  • Class imbalance appears to be a contributor to the models’ low precision values. The some action model that was trained on a near-balanced dataset achieved the highest precision (i.e., 61%) compared to the other models that were not trained on balanced datasets (eh support: 38% precision; and no action: 12% precision).

  • This paper focuses on the needs and characteristics of individuals and models their service requirements accordingly. This approach does not take into consideration the broader family group and how those complex interrelationships may impact on the requirements of each individual within the group. With EHS being a whole family intervention service, there is a future piece of work to understand those interrelationships and identify requirements at a family level.

Impact on social care

The findings of this paper provide an entry point for local authorities into using AI to support the optimal provision of EHS. Whilst acknowledging the limitations and the need to approach the implementation very carefully, this is a positive step in the long road to incorporating AI into the decision-making process within EHS and potentially the broader remit of children’s social care. At this early stage, a suitable use-case for the model would be to provide additional, data-driven support to the triage process, placing the AI outputs alongside the descriptive referral case notes and information collected by front-line workers. With the focus throughout this paper being on providing explainable AI models, a softer benefit would be to expand confidence and understanding of AI within practitioners and the benefits it could bring to their daily decision-making.

In addition to providing a more complete understanding of the needs of those referred to EHS, the model also has the potential to help identify those that need support but that have not been referred. With the focus on EHS being to provide support and intervention before issues escalate, identifying this group and acting accordingly would be expected to reduce the requirement for higher intensity support later on. With the current limitations of the model, such an approach would need careful consideration as to how that fitted into the existing referral processes. It is not considered justifiable, certainly at this stage, for referrals and allocation of provision to be driven by AI.