Estimation of default and pricing for invoice trading (P2B) on crowdlending platforms

This study developed several machine learning models to predict defaults in the invoice-trading peer-to-business (P2B) market. Using techniques such as logistic regression, conditional inference trees, random forests, support vector machines, and neural networks, the prediction of the default rate was evaluated. The results showed that these techniques can effectively improve the detection of defaults by up to 56% while maintaining levels of specificity above 70%. Unlike other studies on the same topic, this was performed using sampling techniques to address the imbalance of classes and using different time periods for the training and test datasets to ensure intertemporal validation and realistic predictions. For the first-time, default explainability in the invoice-trading market was studied by examining the impact of macroeconomic factors and invoice characteristics. The findings highlighted that gross domestic product, exports, trade type, and trade bands are significant factors that explain defaults. Furthermore, the pricing mechanisms of P2B platforms were evaluated with the observed and implicit probabilities of the default to analyze the price risk adjustment. The results showed that price reflects a significantly higher implicit probability of default than observed default, which in turn suggests that underlying factors exist besides the borrowers’ probability of default


Introduction
Crowdlending is a trending topic that has emerged primarily in the last decade and is set to continue growing in the future for FinTech-related studies (Liu et al. 2020).Online invoice trading is a subfield of crowdlending, a digital market that has experienced exponential growth (Ziegler et al. 2017) and helps businesses finance their working capital.It is a niche segment of the broader P2B market and consists of the financial discount on an invoice via a platform (lender) in exchange for the payment of commissions or fees.Usually, invoices are financed by many investors (crowdlenders), and the platform analyzes the risk of the transaction, establishing a rating and price.The pricing mechanism can be decided in an auction style or by fixed prices set by the platforms.Most of these platforms have evolved towards a fixed-price regime (Dorfleitner et al. 2017).Estimating the default probability is essential for investors and the pricing mechanism of securities (Carmichael 2014).The increased availability of open-source datasets and advances in new techniques have boosted interest in default predictions (Turiel and Aste 2020).Unlike peer-to-peer (P2P) platforms, which provide a great deal of information about debtors, the main invoicing platform (Kriya) offers more limited information, complicating external default evaluation.Additionally, invoice trading has essential characteristics mainly dedicated to short-term financing that differentiate it from other lending operations in terms of risk.
Most research on crowdlending has focused on P2P financing (Carmichael 2014;Serrano-Cinca et al. 2015;Zhu et al. 2019), whereas few studies have addressed the P2B market.Therefore, the first objective of this study is to help investors determine their probability of default (PD) by developing models using publicly available information.To the best of our knowledge, only Dorfleitner et al. (2017) have published a study that focused on estimating the default of an invoice trading platform (Kriya) using logit and Tobit models.Unlike Dorfleitner et al. (2017), our approach focuses on rectifying the imbalance of classes using sampling techniques that allow for the correct classification of defaults with unbiased models.Any model that does not consider this problem is biased towards the majority class (Kotsiantis and Pintelas 2004;Bastani et al. 2019) making it unsuitable for default prediction.Furthermore, greater robustness of the models was provided by ensuring intertemporal validation.This enabled the generation of realistic predictions for future test samples.To the best of our knowledge, this study represents a pioneering application of machine-learning methodologies in the invoice-lending market.We compared the performance of logistic regression, conditional inference trees, conditional inference forests, random forests, support vector machines, and neural networks for predicting defaults, similar to Li et al. (2020) for credit ratings.This study is also the first to examine the influence of new macroeconomic factors and invoice characteristics on the default rate in the invoice trading market.By presenting this innovative approach, a new set of variables that are significant determinants for predicting default was identified: gross domestic product (GDP), exports, trade type, and trade band.The current findings demonstrated that machine-learning techniques can improve default detection by over 56%.Furthermore, the use of sampling techniques to address the imbalance between the two classes produces good results in the detection of loan default, with sampling techniques correctly predicting default by more than 50% compared with scenarios without sampling.
Price setting by the platform, using the implicit and observed probability of default (IPD and OPD, respectively) to analyze the overcharging of sellers, is also discussed for the first time in this study.As such, this study assessed whether the platform's prices correspond to the credit risk inherent to transactions in the context of a diversified retail portfolio.If the charges are not adjusted to the risk of the transactions, this indicates that companies are willing to pay a high interest rate because they are unable to obtain competitive financing from other financial sources, or that they greatly value the flexibility of these platforms.The results revealed a significant distortion between OPD and IPD.
Sect. "Literature review" provides a literature review on the structure of this study.Sect."Dataset and descriptive analysis" provides a descriptive analysis and details of the data preprocessing performed.Sect."Predicting invoice-lending default" discusses the determinants used to predict default using logistic regression, including an evaluation comparing the predictive capabilities of the model across various sampling techniques.
It also compares the performance of different machine learning alternatives in terms of predicting defaults.Sect."Pricing mechanism with implicit probability of default" examines the platform's credit-pricing method using IPD and OPD.Finally, Sect."Discussion of results" discusses the findings and conclusions are presented at the end of this paper.

Literature review
P2P lending has attracted considerable attention from the academic community over the past decade.Most research on this topic has used logistic regression to predict the probability of default and/or profitability of loans.Most of the research was performed using publicly available data from the Lending Club platform, with the remainder obtained from a small set of platforms.
Early research was conducted by Carmichael (2014), who predicted the probability of default and expected returns from P2P lending.The former was estimated using a dynamic logistic regression, and the log of income, recent credit inquiries, loan purpose, loan amount, credit score, and subgrade were the most significant explanatory factors.Variables gathered from borrowers' loan descriptions, such as whether the description lacked complete sentences or claimed that the author was creditworthy, were also significant in explaining default.The full model performed better than the final club subgrade.Meanwhile, the model without a subgrade was included as an explanatory variable and performed similarly to the Lending Club subgrade.To estimate the expected returns, the probability of early repayment and the principal repaid were calculated given default.The first was modelled with a dynamic logistic regression using the same regressors as for default, whereas the second was performed with ordinary least squares.The expected return on the lowest-risk subgrade (A1) loans was 5%.This increased steadily to a maximum of 11% for mid-risk loans (D2), and then decreased to 10% for the highest-risk loans (E5).Similarly, Li et al. (2016) estimate the probability of prepayment in addition to the probability of default.They used multivariate logistic regression incorporating macroeconomic factors and found that the factors explaining default and prepayment were very similar.Specifically, loan features, macroeconomic factors, and most borrower characteristics are significant.Their results showed that high interest rates are not only associated with higher probabilities of default but also with higher probabilities of prepayment, as borrowers do not want to bear the costs associated with them.Serrano-Cinca et al. (2015) study the determinants of default in P2P lending.Using a hypothesis test and survival analysis, they first determine the most significant factors for estimating default, namely, loan purpose, annual income, current housing situation, credit history, and indebtedness.Subsequently, several logistic regressions are performed to predict the probability of default based on the previously determined factors.Their results showed a clear relationship between the Lending Club subgrade and the probability of default, where the subgrade was the variable with the highest predictive capability.Furthermore, interest rates appeared to depend on the grade assigned: the higher the interest rate, the higher the probability of default.These results are similar to those of Möllenkamp (2017), who used binary logistic regression to investigate the determinants of P2P loan performance based on different credit grades.The results show a positive relationship between credit grade and loan performance, wherein a higher credit grade was related to a lower probability of default.As for loan performance, credit grade is the most influential factor.Loan amount and annual income were also significant predictors, whereas all other variables lost significance in forecasting.Regarding the determinants of default, Avgeri and Psillaki (2023) explored borrower-related and macroeconomic factors in the US P2P market using logistic regression.Their study suggests that the higher the percentage change in the house price, consumer sentiment, and S&P500 indices, the lower the delinquency.Unemployment and GDP affected the rate of default.Nigmonov et al. (2022) used a probit model and found that for the same market, the higher the interest rates and inflation, the higher the probability of default.Moreover, the effect of interest rates on default is significantly higher for loans with lower ratings.
Other authors have attempted to predict the probability of default using a different set of methodologies.For example, Zhu et al. (2019) compared a random forest model with other statistical techniques and discovered that the former outperforms decision trees, logistic regression, and support vector machines with an outstanding 98% level of accuracy.They used a synthetic minority oversampling technique (SMOTE) to solve the imbalance problem in the sample.Similarly, Malekipirbazari and Aksakalli (2015) compared different statistical techniques-such as random forests, support vector machines, logistic regression, and the nearest neighbor algorithm-to predict the default rate of P2P lending.Their results showed that random forests outperformed Fair Isaac Corporation (FICO) credit scores, the Lending Club subgrade, and other methodologies for identifying good borrowers, with an accuracy level of 78%.Moreover, based on their findings, although this technique was highly suitable for identifying good borrowers, misclassifications existed for some borrowers who were erroneously deemed bad.Borrower status was also studied by Fu (2017), who combined random forests with neural networks and compared the performance of each model alone and together, where the probability of default was estimated by one model and then given to the other for re-estimation.Preprocessing was also performed, where the data were first normalized, and old observations were discarded.This combined technique, along with data preprocessing, considerably improves accuracy and outperforms the Lending Club subgrade.The highest accuracy was achieved by first using a neural network and then using random forests.A similar method was used by Kim and Cho (2018), who proposed a deep dense convolutional network (DenseNet) for default predictions in P2P.Their findings revealed that this model could achieve a relatively high level of accuracy (79.6%) and reduce overfitting compared with other convolutional neural networks.
Furthermore, Turiel and Aste (2020) developed several artificial intelligence models to predict loan rejection and estimate the probability of default, proving that artificial intelligence can increase accuracy and default risk by 70%.They also proposed the separation of small business subsets to increase the performance of default predictions.Ko et al. (2022) proposed a wide range of prediction models to mitigate the risk of default and asymmetric information on P2P lending platforms, stating that LightGBM outperformed the other methodologies, with a model accuracy of 68.57% and a revenue improvement of 23.8 million US dollars.They argue that the Lending Club, despite being the largest P2P lending platform in the USA, still has a high rate of default, which proves its ineffectiveness when classifying debt.Similarly, Muslim et al. (2022) applied an improved LightGBM based on swarm algorithms to predict default rates on P2P platforms.Their study indicated that the performance increased after feature selection using a swarm algorithm, with LightGBM + ACO achieving the highest level of accuracy (95.64%).
In terms of profitability, the body of work done by Serrano-Cinca and Gutiérrez-Nieto ( 2016) is worthy of mention, which used a multivariate regression along with a decision tree model (CHAID) to develop a profit scoring system.The internal rate of return (IRR) of each loan was used as a profitability measure.Using exploratory data analysis and multivariate linear regression, they found that the explanatory factors for predicting default and profitability differed.Similar to other studies, the Lending Club subgrade was found to be significant.Although the loan purpose and housing characteristics were generally significant, they seemed to be more significant in predicting profitability.Moreover, a nonlinear relationship (inverted and U-shaped) existed between the internal rate of return and its factors.With respect to profitability, the decision tree has a mean internal rate of return of 5.98%, outperforming the Lending Club's mean of 3.92%.Their study also suggested that nonlinear data mining techniques can be useful for developing profit-scoring systems.
Similarly, Guo et al. (2016) predicted the expected return on a loan using an instancebased model (IOM), by considering the investment decision on a P2P platform as a portfolio optimization problem with boundary constraints.They compared the results of this instance-based model with those of two rating-based models, the restricted Boltzmann machine (RBM) and RBM + , and showed that the IOM outperformed them in terms of prediction accuracy and investment performance.The probability of default was estimated using logistic regression, which was then applied to the IOM to calculate the expected returns.Similarly, Bastani et al. (2019) propose a two-stage system for credit and profit scoring.In Stage 1, they attempted to identify non-default loans, which were then moved to Stage 2, where they estimated profitability using the internal rate of return.Wide and deep learning using Google was used to build predictive models in both stages.This model can achieve both memorization and generalization, avoiding the overgeneralization frequently observed in deep learning modelling techniques.The factors predicting default and the internal rate of return were similar to each other and those studied by Serrano-Cinca et al. (2015) and Serrano-Cinca and Gutiérrez-Nieto (2016), respectively.These two studies used different statistical techniques to address the imbalance problem of the sample, namely, random undersampling, random oversampling, and SMOTE.Nonetheless, the latter method appeared to perform consistently better than the other two.Their results indicated that the proposed scoring approach outperformed existing credit and profit scoring approaches, and the combination of wide and deep learning with the SMOTE method achieved the highest performance.Elliott and Curet (1999) devised the first framework for invoice discounting, a generic term for financing solutions that use invoices as collateral for loans.They suggested using an inductive algorithm such as case-based reasoning (CBR) and noticed a lack of knowledge regarding invoice-discounting cases.To the best of our knowledge, few similar studies have been published to date.Dorfleitner et al. (2017) estimated default events in the online invoice trading market using logit models.Their study suggests that interest rates, duration, and advance rates are determinant factors in predicting default, where the higher the gross yield and the longer the time to maturity, the higher the probability of default and the loss rate.However, the advance rate is negatively related to default, reflecting the platform's ability to assess sellers' creditworthiness.This implies that the larger and more creditworthy a company is, the more easily it can obtain greater loans.It has also been concluded that risk can be effectively reduced by diversifying the portfolio of invoices and that investors prefer to deal with a higher risk with higher interest rates rather than lower advance rates.Furthermore, the default and average net returns are lower in a fixed-price regime than in an auction price mechanism.Zhang and Thomas (2015) consider the merits of including economic variables in a logistic regression-based credit scorecard in an invoice-discounting context.By doing so, they wanted to directly estimate a short-term, dynamic version of the probability of default (i.e., Point-in-Time (PIT) probability of default).Perko (2017) also adapted the definition of "probability of default" as a more short-term and fine-grained concept; the outcome should be predicted at a fixed timeframe of 30 days in advance, rather than overdue days.
Other authors have extensively evaluated credit scoring systems and the estimation of default in commercial credit.However, this study focuses on the estimation of the probability of default and pricing optimization on crowdlending platforms, thus we have not referred to the authors in this literature review.A summary of related bodies of research can be found in Table 9 in Appendix, where additional information exists regarding the accuracy of the models and the samples used.

Institutional background and data
We used publicly available data from the crowdlending platform Kriya, 1 whose headquarters are in the United Kingdom (UK).The company was formerly known as Market-Invoice and was later rebranded as MarketFinance before being given its current name; it specializes in factoring and P2B.According to the information provided on their website, as of November 2023, they had funded around 3.4 billion pounds of sterling (GBP) to small and medium-sized enterprises (SMEs), with an expected net yield of between 4 and 6% and a loan collection rate of 78.01%.After data cleansing, the dataset contained 46,761 observations and 26 variables with loans from March 2011 to December 2019.Some original variables were dropped since considering them for predictive purposessuch as trade payment, trade settlement, and delinquent dates-was not sensible.This was also the case for the currency variables; this study focused on GBP given the low number of observations in other currencies.Meanwhile, the trade expected pay date was subtracted to the advance date, creating the variable "days", which represents the maturity of the transaction.In addition, the following macroeconomic variables were tested for model fitting: GDP, exports, import price index, producer price index, consumer price index, money supply, M1, US/GBP spot exchange rate, and EUR/GBP spot exchange rate.However, after feature selection, only the first three showed significance and were considered for further analysis.The information provided by the platform was rather limited since no variables related existed regarding financial grade, income, or solvency; financial information about the borrower, seller, or protection provider was missing.As such, it was necessary to study the extent the available information allowed for an analysis of the probability of default and consequently whether investors could accurately assess risk, which can be done in other crowdlending platforms that do not restrict this type of information.
Table 10 lists all the variables used in the present study.In addition to the invoice variables (advance rate, annualized gross yield, total face value, days) sourced from the Kriya database, a set of macroeconomic variables (GDP, exports, and import price index) were considered for the model fitting, while the remaining variables were only used in the descriptive analysis.

Descriptive analysis
A summary of the variables is presented in Tables 1 and 2. Considering the mean values, a typical invoice was funded at a 76% rate, with an annualized gross yield of 11% and a maturity of 55 days.The platform advance rate ranged from 2.5% to the total value of the invoice, with yields reaching as high as 49% and maturities of over a year in some cases.However, the invoice amount varied from a few thousand to several million GBP, indicating that the platform financed a wide range of businesses in terms of income.The total crystallized losses were very low, with a mean absolute value of 108 GBP, mainly because of the low default rate of 1.7%.Regarding macroeconomics, the UK has a steady mean GDP growth rate of 0.89%.Exports experienced a slightly higher growth rate of  1.86%, and the import price index did not change significantly, considering the mean values.Regarding the categorical variables, the most common trade operation was the standard discount (80.9%), wherein the platform discounted a single invoice with negotiated conditions for the seller, followed by the entire ledger (12.8%), wherein the sellers had a special agreement with the platform to discount their entire portfolio of invoices.This is probably why some of the funded invoices had very low face values and maturity.Most loans were traded in Bands 5 (23.7%), 4 (15.3%), and 1 (13.3%), which may be due to the industry in which the borrower is positioned.In addition, 88% of these loans were in "repaid" status and only 11.4% were repurchased, while other statuses were virtually non-existent.Table 3 presents the correlation matrices between variables.In general, the data did not exhibit multicollinearity.Exports and the import price index were the most noticeable pairs of correlated variables.When the import price index increased, the production costs of companies increased, which in turn affected their competitiveness and reduced their export levels, thus creating a negative correlation.Maturity was also negatively correlated with advance rate and annualized gross yield.In the first case, the platform may have been trying to reduce the risk involved in a higher fund rate by providing a lower maturity period, limiting investors' exposure to it.In the second case, because high interest rates with larger periods of maturity result in greater costs for the seller, many of these loans were repaid as soon as possible, sometimes before the expected payment date.This allows for invoices with long periods of maturity and low interest rates to be obtained, mainly because they are repaid earlier than expected.However, those repaid over longer timeframes usually have lower interest rates to compensate for the costs incurred.Financially healthy companies can do this if they are not overcharged interest when assuming a longer period of maturity, thus creating this negative relationship.The platform's correct credit risk assessment is also supported by the fact that the advance rate and annualized gross yield are negatively correlated, such that higher fund rates are granted with lower interest rates to financially healthy companies, and vice versa.
For outliers, the dataset was primarily affected by the invoice variables, as shown in Fig. 1.It contains several extreme outliers that must be removed from the training dataset because logistic regression might be very sensitive to these outliers.Therefore, no bias was generated during model fitting.Nonetheless, the test dataset remained untouched; therefore, realistic predictions could be made using any type of data point.Consequently, 10% top and bottom winsorization of the training data was performed for model fitting.

Pre-processing and sampling
First, the data were cleaned by rejecting missing or irrelevant observations.Loans pending repayment were discarded.Therefore, only those that had been completed were considered because they had been fully paid, partially recovered, or completely lost.Two levels2 from the categorical variable trade type were excluded because they had very low representation in the dataset.Given the divergence in variable scales, we standardized the variables by subtracting their mean and dividing by their standard deviation.Following these adjustments, feature selection was performed using the information gain ratio and chi-squared filters.Given a training set S that is partitioned into V subsets S 1 , . . ., S v according to V different values of feature X, the mutual information of features X and class Y was defined based on Kotsiantis and Pintelas (2004, p. 50) as ( 1) However, the information-gain filter is strongly biased toward features with different values.This can be corrected using the following calculation, which represents the potential information gain by partitioning S into V subsets.
The information gain ratio expresses the proportion of information gained by the partition.
The chi-square filter X 2 c , measures the divergence of the feature distribution by comparing the observed and expected values.
where c is the number of degrees-of-freedom, O i is the observed value, and E i is the expected value.Statistics were compared using a chi-square table.
All the metrics above reported similar results.The variables that were not significant were excluded from the analysis.Those that were finally included in the model development are shown in Fig. 2.
The data were split into training and testing sets with a 75-25% ratio, respectively.The test dataset collected only loans from 2019, whereas the training dataset gathered a wider selection of older data to ensure intertemporal validation of the models (Lau 1987;Serrano-Cinca et al. 2015).Outliers were handled in the training dataset only by winsorizing the top and bottom 10% of the invoice variables, as mentioned previously.The target variable of this study represented loan status and had two classes: "Yes" if Fig. 2 Feature selection.Note.The results were obtained using the information gain ratio and chi-square filters.The definitions of the variables are listed in Table 3 the loan was default and "No" if the loan was not in default.The positive class, also known as the minority class, represented 1.7% of the total loans, while the negative class, also known as the majority class, accounted for 98.3%.
The dataset suffered from a severe class imbalance problem that would have biased the predictive models towards the majority class (Bastani et al. 2019)."A classifier derived from an imbalanced dataset typically has a low error rate for the majority class and an unacceptable error rate for the minority class" (Kotsiantis and Pintelas 2004, p.48).However, in this case, the misclassification cost for the minority class was higher than for the majority class.That is, the outcome of accepting a defaulting loan was higher than that of rejecting a non-defaulting loan because the former assumes an actual loss, whereas the latter implies an opportunity cost.To resolve the class distribution issue, three well-known sampling techniques-random undersampling, random oversampling, and SMOTE-were applied and compared.
The random under-sampling technique balances classes by randomly eliminating observations from the majority class.This technique can eliminate potentially useful information from the analysis, whereas in random oversampling, the replication of observations in the minority class increases the likelihood of overfitting (Kotsiantis and Pintelas 2004).In both techniques, a parameter controls the final ratio of the classes.Finally, in the synthetic minority oversampling technique, new observations are not replicated but are synthetically derived from the original observations in the minority class.By means of an imaginary segment, these are created by connecting each minority class observation with K nearest neighbors (K-NN).K-NN calculates the distance between a current observation and all other observations by selecting the K th nearest observation with the least distance.This study involved both categorical and numerical data, thus the Gower distance (Gower 1971) was used and estimated by considering two individuals-i and j-compared to a character, k .An S ijk score of zero was assigned when they differed, and a positive fraction or unity when they had some degree of similarity.The possibility of making comparisons can be represented by a quantity, δ ijk , which is 1 when the char- acter k can be compared for i and j or 0 otherwise.The similarity between i and j , S ij , can be expressed as the average score obtained for all possible comparisons.
After obtaining K nearest neighbors, the distance between the observation under consideration and its nearest neighbors is multiplied by a random number between 0 and 1, which is added to the original observation (Chawla et al. 2002) and which "causes the selection of a random point along the line segment between two specific features, " (Chawla et al. 2002, p. 328).
The equation below (Zhu et al. 2019) represents the SMOTE calculation, where x i is the original observation, x n is the Kth-nearest neighbor, R ∈ {0, 1} is a random num- ber, and x new is the resulting artificial observation: (5) To select the perfect ratio between the minority and majority classes, multiple scenarios were tested using logistic regression as a reference and the F1-score as the maximization target.Specifically, logistic regression with SMOTE was evaluated using the five nearest neighbors at a rate of 29 with different balancing rates (ranging from 1 to 50%).The F1-score was higher at a balancing rate of approximately 10% and 36%, although the latter reported a considerably higher sensitivity.
Typically, in an imbalanced dataset, accuracy is not a desirable metric for model comparisons (Bastani et al. 2019).Therefore, this study focused on precision3 and sensitivity4 ; the former penalizes the misclassification of negatives, whereas sensitivity is only to the detriment of the misclassification of positives.The F1-score is the harmonic mean of both metrics and gives them equal importance, thus influencing our decision to use it for model parameter optimization and final selection between models.
where TP is the actual number of positives, FP the number of false positives, and FN the number of false negatives.
For easier comparison with other studies, other commonly used metrics were also provided (specificity,5 accuracy, and McFadden's R 2 ).
where TN is the number of true negatives, L c is the (maximized) likelihood value from the current fitted model, and L null is the corresponding value for the null model (with only an intercept and no covariates).
As previously mentioned, different balancing rates were considered using logistic regression and SMOTE (using the five nearest neighbors at a rate of 29).The best results for F1-score and sensitivity were obtained at a balancing rate of 36%.To provide a fair (7) comparison between the undersampling and oversampling techniques, we used rates of 0.345 and 29 for the former and latter, respectively.At these rates, the three sampling techniques yielded three datasets, all with a minority class ratio of 36%.These datasets were used to evaluate the model performance of the sampling techniques with the logistic regression (Table 5).After the entire preprocessing, the final training dataset contained 1,830 observations with the undersampling technique and 53,064 observations with the oversampling techniques.The experimental design is illustrated in Fig. 3.

Logistic regression
This section analyzes the determinants of default using a binary logistic regression model, which compares the relationship between predictor variables, X = {x 1 • • • x i } , and the categorical response variable, Y = {0, 1} .Predicted scores ( π ) and observed probabilities (Y) can also be compared using the "maximization of a log-likelihood" (LL) function.Table 4 presents the coefficient estimates, standard errors, and significance of the logistic model.Unstandardized coefficients were used to study the direct effects of the variables on default status and standardized coefficients to rank the predictors.These results show that annualized gross yield, advance rate, and maturity were significant in determining default, in line with Turiel and Aste (2020) for the P2P market and Dorfleitner et al. ( 2017) for the P2B market.Gross yield has a positive effect on the probability of default, which is consistent with how borrowers accepting higher interest rates are able to recognize lower creditworthiness (Stiglitz and Weiss 1981;Dorfleitner et al. 2017).Furthermore, a negative relationship exists between the advance rate and default, which was also discovered by Dorfleitner et al. (2017), showing that the advance rate can be an effective mechanism to dissuade debtors the platform considers riskier to default.Furthermore, the nominal value was negatively related to default; thus, lower face-value invoices were more likely to be unpaid.As in the previous case, the platform correctly assesses risk by granting higher capital to companies that are less likely to default, thereby creating a negative relationship.However, maturity is positively linked, increasing the probability of default in long-term transactions, similar to Dorfleitner et al. (2017).Trade type is exceedingly significant in determining default when it is a standard or multi-debtor transaction.A strong positive relationship exists between these types of transactions and defaults.Furthermore, GDP and exports have positive and significant relationships with default.This might have been counterintuitive initially (13) because positive economic cycles are associated with a higher rate of default.Nonetheless, because lending standards usually fluctuate with the economic cycle, in stable cycles, lending standards may decrease, making it easier to access credit even for borrowers with higher probabilities of default.As for the trade bands, 10 was highly significant, while 4 and 9 were only significant at the 10% level.Considering the standardized coefficients, the following variables were ranked as the most important based on the size of their effect: trade type, multi-debtor and standard, trade band 10, total face value, annualized gross yield, and maturity.

Table 4 Logistic regression estimations for probability of default
This table reports the results of the logistic regression estimates for the model presented at the beginning of this section.The advance rate is the percentage of an invoice's face value that a factor pays for a purchase.Annualized gross yield is the actual yearly rate of return earned on factoring, considering the effects of compounding interest.Total face value (with log transformation) is the total number of invoices (in GBP).Days, or maturity, is the number of days elapsed from the advance date to the expected pay date.GDP, Exports and IPI are the gross domestic product, level of exports, and import price index in the UK, respectively.Trade type reflects the type of operation (standard, multi-debtor, whole ledger, or license fee).The trade band reflects the borrower's industry.The intercept was constant.The total number of observations was 46,761.The McFadden's R 2 is 0.072 and 0.073, respectively.*Significant at 10%; ** significant at 5%; *** significant at 1% For the model predictions, Table 5 shows the results in terms of performance metrics with a 36% balanced minority class ratio.Better models were achieved with greater balance at a class distribution of less than 50%.Moreover, increasing these to higher ratios can bias the results towards very high sensitivity rates to the detriment of poor specificity.The data were randomly sampled, thus the mean of 200 iterations was calculated.
In general, a high level of overall accuracy was obtained when no sampling techniques were used; however, the resulting sensitivity rate was not acceptable.A clear difference existed between the results using sampling techniques and those that did not, especially in terms of sensitivity, where the former correctly classified 50% more loan defaults.All sampling techniques reported similar results, although SMOTE slightly outperformed the others for almost all metrics, which is in accordance with Bastani et al. (2019).It correctly classified 56% of the loans in default, while preserving its capacity to do so for 76% of the NPLs.A relatively good area under the curve (AUC6 ) score of 73.2% was achieved using SMOTE: higher than that of any of the alternative models used, as shown in Table 5 and Fig. 5.

Predicting default with machine learning alternatives
This section analyzes the performance of alternative models in predicting the default rate in the invoice-lending segment.Our objective was to evaluate the precision of alternative logistic regression techniques, such as conditional inference trees, random forests, support vector machines, and neural networks.Logistic regression reported a higher F1-score on the SMOTE dataset, thus it was used for alternative model fitting.All models were validated using the same data split as in the logistic regression (75% training-25% testing); however, the package used for modelling7 (Bischl et al. 2016) allowed for tenfold cross-validation within the training dataset, which we used to validate the models.This procedure splits the training dataset into ten folds where each fold is excluded iteratively and used to validate the performance of the models against them.Hyperparameter optimization was performed using a random search, with the objective of maximizing the F1-score (see Table for the hyperparameter selection).

Conditional inference tree
Decision trees are nonparametric supervised learning algorithms that do not require special assumptions, making them versatile.Their main characteristic is a feature space, which is recursively partitioned by grouping observations that have similar response values (Strobl et al. 2009).In other words, they attempt to describe the conditional distribution of the response variable Y, given a set of m covariates X, restricting the feature space X of the covariates to r disjoint nodes B 1 , . . ., B r , where X = r ∪ k=1 B k (Hothorn et al. 2006 The algorithm selects the variable from the covariate vector X with the strongest association with Y, searches for the best split point, and splits the variable into r disjointed nodes.As the splitting process continues, the level of purity 8 is observed."In each node, the variable that is most strongly associated with the response variable (i.e., that produces the highest impurity reduction or the lowest p value) is selected for the next split (Strobl et al. 2009, p. 327).For conditional inference trees, p values were used instead of entropy measures.In this study, the association between Y and X j , where j = 1, . . ., m , was measured using the Bonferroni test, which formulated a global hypothesis of independence in terms of m partial hypothesis, H j 0 : D Y |X j = D(Y ) .The variable with the lowest p value was selected.When insufficient evidence existed to reject H 0 , at a pre- specified level α, the recursion was halted (Hothorn et al. 2006).Permutation tests were performed because the distribution D Y |X j is usually unknown.The split-point A* was obtained with a test statistic maximized over all possible A subsets (Hothorn et al. 2006), which measured the discrepancy between { Y i |X ji ∈ A } and { Y i |X ji / ∈ A}.

Random forest
"A random forest is a classifier consisting of a collection of tree-structured classifiers { h(x, � k ), k = 1, . . .}, where { k } are independent identically-distributed random vec- tors and each tree casts a unit vote for the most popular class at input x, " (Breiman 1996, p. 6).Random forests try to predict the response variable with a predictor ϕ(x, L) where x is the input vector and L is a learning dataset.To obtain a better predictor, repeated bootstrap samples are taken from the learning dataset L B , each of them consisting of N cases drawn at random with replacement, forming a sequence of predictors ϕ x, L B (Breiman 1996, p. 123).Each predictor forms a decision tree that votes for the most popular class (mode) at input x.These procedures are known as bootstrap aggregation or bagging.In addition, random forests use feature bagging, in which a random subset of features x is selected for each built tree.Decision trees are usually correlated with strong predictors, thus feature bagging can reduce this correlation, making random forests more accurate.Conditional random forests are a special type of random forest wherein the trees used for bagging are conditional inference trees.

Support vector machine
In support vector machines, the input vectors are nonlinearly mapped to a high-dimensional feature space, where a hyperplane or a set of hyperplanes is constructed to provide a linear decision function with a maximal margin between the vectors of the two classes (Cortes and Vapnik 1995).Using a set of labelled training patterns: are linearly separable if there exists a vector w and scalar b such that: (21) 8 The number of observations with a majority for a response class that becomes isolated.
The optimal hyperplane that separates the data with maximal margin is given by: where a 0 i > 0 and T 0 = a 0 1 • • • a 0 t form a vector of parameters.and the Lagrangian problem that needs to be minimized is: A kernel function is often used to transform the feature space when the data cannot be split by a hyperplane without error, which occurs when points overlap.

Artificial neural networks
The basic elements of an artificial neural network are the nodes that represent the neurons of the biological brain.These nodes were connected and received stimulation signals from the input variables.This is not performed directly but with weights and activation functions.The neuron output signal O is expressed by the following relation- ship (Abraham 2005): and the function f (net) is referred to as an activation (transfer) function.The variable net is defined as the scalar product of the weight and input vectors.
where T is the transpose of the matrix and the output value O is computed as where θ is the threshold level.If the activation function reaches the threshold level, the signal is transmitted to the connected node.Figure 4 illustrates the general structure of an artificial neural network.
Our network architecture was similar to that of the extreme-learning machine used by Pang et al. (2021), differing only in the learning method.Specifically, nine input variables existed (Fig. 2) with their respective weights, one hidden layer, and one output node (default).A logistic activation function was used: resilient backpropagation with weight backtracking as the learning algorithm and cross-entropy as the error function. (23

Benchmark predictions
Table 6 presents the results for the performance measures.Apart from the neural network, the alternative models did not show any skill, with an AUC score below or slightly above 50% (Fig. 5).Using the DeLong test to compare the significance between the AUC scores, we developed the following two-sided contrast based on DeLong et al. (1988): where p and k represent the AUC of the two models.Table 7 presents the estimates and significance levels for the scores.Our results showed that the AUC score of the logistic regression was statistically significant in all models.The neural network also showed significance at the 5% and 10% levels.However, in contrast to the logistic regression, greater significance was observed in favor of the latter at the 5% level.The other methods did not show statistical significance.The conditional inference tree forest and SVM lacked sensitivity to the detriment of greater specificity, whereas the conditional inference tree and random forest had acceptable sensitivity rates but poor specificity.The neural network is a more balanced model and has a greater F1-score and AUC percentage.Indeed, it achieved the greatest F1-score of all the models assessed, including logistic regression.A greater specificity rate and overall accuracy existed than in logistic regression.However, the AUC was slightly lower and predicted 20% fewer defaulted loans.These results experienced some misclassifications, one of particular importance being that some of the  non-default loans were classified as default, which negatively impacted specificity rates.This was due to a trade-off between sensitivity and specificity, which affected all the models.However, the greater detection of default loans is more than compensated for by the decrease in specificity rates.

Pricing mechanism with implicit probability of default
This section evaluates the pricing of the platform by comparing the implicit and observed probabilities of default.Many factors can affect invoice prices.However, only the platform knows the extent each is considered.The Discussion section revisits this debate.The price paid by the seller is assumed to exclusively represent the borrower's probability of default, thus the valuation of an invoice is based on the expected payoff and can be expressed as where, P is the price, i is the interest rate, r is the recovery rate, C is the total advance amount of the invoice, and PD is the probability of default.By algebraic manipulation, the implicit probability of default in the price is estimated as This is the borrower's probability of default, intrinsic to the price charged to the seller.Table 8 compares the implicit and observed probabilities of default.Implicit default experienced a slight uptrend and is consistently higher than the observed default for each year.This trend remains even with a lowering of the observed probability of default in later years.Notably, the implicit probability of default does not decrease like the observed default.Explanations for this lowering observed probability of default may be higher experience in loan selection, which may have led to a higher proportion of non-performing loans gradually being rejected; the increase in business volume, because the higher the number of observations, the higher the convergence of the default rate towards a lower level (law of large numbers); and the positive economic cycle experienced during this period (2011)(2012)(2013)(2014)(2015)(2016)(2017)(2018)(2019).These results may indicate a low correlation between the implicit probability of default and observed default, with a Pearson correlation coefficient of 0.31.The Mann-Whitney-Wilcoxon test was used to compare the two medians, given that they did not follow a normal distribution and were not paired samples.Furthermore, the medians of the two samples were compared as follows: The statistics were given by: (30) where n 1 and n 2 are the sample sizes; R 1 and R 2 are the rank sums and U 1 and U 2 are the statistics of samples 1 and 2, respectively.The Mann-Whitney-Wilcoxon test reported a p-value of 4.1e-05, which led us to reject the null hypothesis of equal medians between samples at the 1% significance level.By computing the mean difference between the two probabilities of default, implicit default was four times higher than observed default.There is evidence to consider this difference significant, especially if the platform did not consider its own probability of default, the probability of default of the seller, or if some of these transactions were guaranteed by insurance.Given that the borrower's probability of default is the most relevant factor to consider when pricing an invoice and implicit default did not decrease in the same fashion as the observed default, these results may indicate that sellers pay a premium not realistically adjusted to the real probability of default.An explanation for paying such a premium may be that many of these borrowers must deal with credit rationing (Cowling 2010;Lee et al. 2015) in traditional credit markets, either by not having access to credit or having it at an even higher interest rate.In addition, the high flexibility of this lending method, which can remotely finance an invoice within a few hours, could explain why a higher price is charged.Hence, the platform may take advantage of this situation by charging a premium based on market conditions rather than on operational risk, leading to an imbalance between price and actual risk.

Discussion of results
Research on the prediction of default for invoice trading platforms is a novel topic (Dorfleitner et al. 2017) which is investigated in this study.However, unlike previous papers, the current study employed artificial intelligence techniques and sampling combined to enhance the accuracy and effectiveness of predictions.Current results show that these techniques can effectively improve the detection of defaults by up to 56% while maintaining levels of specificity above 70%.Not addressing the problem of an imbalanced dataset results in models with very high overall accuracy and specificity rates but unacceptable sensitivity rates.Therefore, by applying sampling techniques, the robustness of our models increased since they helped the learning process with artificial data.This in turn improved the reaction to unseen data.Once intertemporal validation was ensured, another layer of robustness was introduced by using predictions made with a 1-year lag.Feature selection in the preprocessing stage allowed for the construction of models with relevant factors determining the default, (32) which creates less complex models that are easier to interpret and less prone to overfitting, as well as being more robust.Of the models evaluated, the neural network was the best in terms of the F1-score, whereas the logistic regression had a similar score with 20% more sensitivity, which could be more appropriate from an investor's perspective.Bearing this in mind, limited disclosure on the Kriya platform prevents the advantages of more advanced techniques from being employed, and asymmetric information from being reduced.This negatively affects the performance of any model trying to predict default probability and limits the possibility of investors correctly assessing credit risk.Disclosing the borrower's and seller's financial information on the invoice, as well as information on the protection provider, is essential to permit a complete assessment of transaction risk.
Furthermore, the current study of the platform's credit pricing with an implicit probability of default indicates a consistent discrepancy between the observed and implicit probabilities of default.The price charged to the sellers represented a probability of default four times higher than the observed probability of default, and the differences between the two were statistically significant.Consequently, this study assessed whether the invoice price was realistically adjusted to the borrower's probability of default, or whether the platform was overcharging sellers for other reasons.Therefore, standard valuation models for invoice lending usually consider a borrower's probability of default as the only relevant event (Nava et al. 2019).This is a simplification of reality since the price that the seller pays reflects not only that but also the probability of default of the seller (Nava et al. 2019), the value of a flexible method (where an invoice can be sold in just a few hours [Dziuba 2018]), and situations wherein sellers with lower creditworthiness may encounter credit rationing or higher interest rates in traditional commercial credit (Li 2016).The price also reflects the operational risk of this form of financing (Liang et al. 2022).Factors contributing to the apparent discrepancy may include risks associated with invoice factoring such as fraud, the authenticity of sold invoices (fraudulent activities from debtors), and/or operational issues impacting the platform's reliability and investors' confidence.This uncertainty may prompt investors to demand higher interest rates to offset the perceived risks.Other factors, such as the economic cycle or fierce competition between platforms, may also increase a platform's probability of default (Yoon et al. 2019), thereby influencing the pricing mechanism.
This study had two practical implications.First, although the neural network model exhibited a high F1-score, logistic regression demonstrated comparable overall performance with a 20% higher sensitivity.From an investor's standpoint, prioritizing sensitivity, which entails identifying true-positive cases, may be more appropriate because it mitigates the risk of overlooking potential defaults.When selecting a predictive model, investors should carefully evaluate the tradeoff between specificity and sensitivity.Second, the invoice-trading P2B market may appeal to small businesses that have limited collateral or lack a credit history and that have trouble being granted traditional bank loans.However, during the examined period, the discrepancy between the observed and implicit probabilities may indicate an inefficiency in the market, likely due to limited competition, resulting in small and medium enterprises (SMEs) being financed at higher interest rates than what their actual risk profile suggests.Investors should make informed decisions regarding their investments in online invoice trading platforms to avoid being aware of the disparity between the observed and implicit probabilities of default in the pricing mechanism of these platforms.

Conclusions
This study developed several machine-learning models to predict default in the invoicetrading P2B market.We used publicly available data from the crowdlending platform Kriya and estimated several techniques such as logistic regression, conditional inference trees, random forest, support vector machines, and neural networks.The current findings demonstrate that implementing these techniques leads to a substantial enhancement in the detection of defaults, achieving an improvement of up to 56%.Remarkably, these improvements were achieved while maintaining specificity levels above 70%.Furthermore, these results were obtained despite limited information provided by the platform.The inclusion of neither the financial information of the borrower nor that of the invoice seller was possible and it is a limitation of this study.Solvency ratios, insurance, and collateral information are variables that can significantly increase model performance, hence greater information disclosure in line with what peer-to-peer lending platforms offer is desirable to increase transparency and correctly assess risk.
In addition, this study examined the platform's credit pricing using the implicit probability of default.We discovered a consistent disparity between the observed and implicit default probabilities.Notably, the price charged to sellers reflects a significantly higher probability of default than the observed probability.Several factors could explain this difference, in addition to the borrowers' risk of default, such as inclusion in the pricing mechanism of the platform's and seller's probabilities of default, the flexibility of this form of financing, or credit rationing.However, if the borrower's probability of default is considered the main factor when pricing an invoice, this may indicate that sellers pay a premium that does not accurately reflect the actual risk involved.
The current study focused on the problem of estimating defaults in the P2B invoice market, which is considered short-term lending.This study can be applied to mediumand long-term lending in P2B.Other methodologies proven effective in related studies include LightGBM with and without swarming techniques (Ko et al. 2022;Muslim et al. 2022) and learning-to-rank methodologies.Future studies in this area should change their focus from the borrower's probability of default to the platform's risk of default, as in Ahelegbey et al. (2019), Chen et al. (2022), andLiang et al. (2022).In addition, niche segments in well-studied P2P markets, such as equity and real estate, can be explored.

Fig. 1
Fig. 1 Boxplots of the main variables considered in the analysis.Note: Trade type and trade band are categorical variables; therefore, we exclude them from the boxplot analysis.The definitions of the variables are listed in Table3

Fig. 3
Fig. 3 Summary of the research procedure

Fig. 4
Fig. 4 Illustration of an artificial neural network.Note.Image by Geetika Saini, Artificial neural network, CC BY-SA 4.0

Table 1
Summary of numerical variables"IPI" stands for Import Price Index

Table 2
Summary of categorical variables

Table 3
Correlation matrixThis table reports Spearman's correlation coefficients with the significance of the correlation test.Advance rate (AR) is the percentage of an invoice's face value that a factor pays upon its purchase.Annualized gross yield (AGY) is the actual yearly rate of return accrued from factoring, considering the effect of compounding interest.Total face value (TFV) is the total amount of the invoice (in GBP).Days, or maturity, is the number of days elapsed from the advance date to the expected pay date.GDP, Exports and IPI are the gross domestic product, level of exports, and import price index in the UK, respectively.Trade type and band are categorical variables, and we exclude them from the correlation matrix.*Significant at 10%; ** significant at 5%; *** significant at 1%

Table 5
Logistic regression prediction on a balanced minority class ratio of 36%The mean results comprise over 200 iterations for the random sampling procedure.McFadden's pseudo-R 2

Table 6
Performance measures for a balanced minority class ratio of 36%Hyperparameters were searched only for the conditional inference trees and applied to the forests.Results on conditional inference trees may have varied as they have high variance by default.The forests were built with 200 trees each

Table 7
DeLong test estimates and significance of AUC scoresThis table states the estimates and significance levels of the De Long test between paired AUC scores.Log.

Table 8
Annualized gross yield and implicit and observed probabilities of default mean evolution AGY denotes annualized gross yield.IPD is implicit probability of default.OPD is the observed probability of default

Table 9
Summary of related studies

Table 10
List of variables

Advance rate (AR) Percentage of an invoice's face value which a factor pays upon its purchase
Annualized gross yield (AGY) Actual yearly rate of return earned on factoring considering the effect of compounding interest Trade type Factor that reflects the type of trade.May be standard, whole ledger, multi debtor or license fee Trade band Factor that reflects the industry/sector of the borrower Default Factor that reflects if the loan was default or not.Target variable of the study Total crystallized loss Real loss perceived after discounting recovered amount (in pounds sterling) Trade payment state Factor that reflects the status of the loan.May be paid, seller-repurchased, repurchase-requested, overdue or partially-paid

Table 11
Hyperparameter tuningAny hyperparameter not specified in this table was considered with its default value by the R mlr package.Hyperparameters were searched only for the conditional inference trees and applied over the forests for comparison purposes.Results on conditional inference trees may have varied as they have high variance by default.Teststat is the type of test statistic applied for variable selection.Testtype specifies the distribution of the test statistic.Mtry is the number of input variables randomly sampled at each node.Minrow is the minimum number of observations in a node.Minsplit is the minimum sum of weights in a node considered for splitting.Minbucket is the minimum sum of weights in a terminal node.Maxdepth is the maximum depth of the tree.Ntree is the number of trees.C is the cost of constraints violation.Epsilon is the insensitive-loss function used.Hidden is the number of hidden layers.Stepmax is the maximum number of steps for training the net.Learningrate is the rate at which the net learns.Algorithm specifies the type of algorithm used to calculate the net.Err.fct is the error function.Act.fct is the activation function