A critical review of machine learning for lignocellulosic ethanol production via fermentation route

In this work, machine learning (ML) applications in lignocellulosic bioethanol production were reviewed. First, the pretreatment- hydrolysis-fermentation route, the most commonly studied alternative, was summarized. Next, a bibliometric analysis was performed to identify the current trends in the field; it was found that ML applications in the field are not only increasing but also expanding their relative share in publications, with bioethanol seeming to be the most frequently researched topic while biochar and biogas are also receiving increased attention in recent years. Then, the implementation of ML for lignocellulosic bioethanol production via this route was reviewed in depth. It was observed that artificial neural network (ANN) is the most commonly used algorithm (appeared in almost 90% of articles), followed by response surface methodology (RSM) (in about 25% of articles) and random forest (RF) (in about 10% of articles). Bioethanol concentration is the most common output

summarizes the reviews on bioethanol production through fermentative pathways (including the present paper) to illustrate each work's contribution to the field and show the gap that may be filled with the present study. As the table indicates, the present work can be differentiated from the others in two respects. First, the previously published literature either concentrated on specific value chain steps or specific ML algorithms, whereas our work covered all ML applications in all steps of bioethanol production via the fermentation route. Second, we performed an exploratory analysis through literature so that the shifting trends could also be seen to put the reviewed material in time perspective. To the best of our knowledge, there are no review papers with such coverage. In light of these, first, we summarized the lignocellulosic bioethanol production process through the pretreatmenthydrolysis-fermentation route, followed by an extensive text mining analysis. Next, the manuscript reviews and evaluates ML utilization in the field and, finally, provides a comprehensive perspective for future applications.

Lignocellulosic biomass and its conversion to bioethanol via fermentation
Compared to sugar and starch-based feedstock, LCB is more complex, and understanding its structure, especially at a molecular level, is critical (Liu et al., 2019). It mainly consists of cellulose, hemicellulose, and lignin; all of which are tangled up in one another to make lignin-carbohydrate complexes (Fig. 3). Cellulose is the primary component of plant cell walls giving them rigidity and strength, and it is the largest carbohydrate in the LCB accounting for 40-60% of the weight. Hemicelluloses are the second most abundant carbohydrate in LCB (about 20-35% of weight); they are heterogeneous polysaccharides, including several hexose sugars (e.g., glucose, mannose, and galactose) and pentose sugars (e.g., arabinose and xylose) (Brandt et al., 2013). Finally, as the remaining part, lignin is an aromatic, water-insoluble polymer that provides the plant with water-proofing ability, structural strength, and resilience (Zoghlami and Paes, 2019). Compared to cellulose and hemicellulose, lignin (the protective structure) is especially resistant to biological breakdown.
Unfortunately, due to the complex nature of the LCB, conversion to ethanol is not straightforward and highly complicated. The fermentation pathway can be carried out in three steps; pretreatment, hydrolysis, and fermentation. The components of LCB are bonded with strong covalent bonds, Van der Wall's forces, and various intermolecular bridges forming a strong and complex structure that is highly stable against hydrolysis   (Kumar et al., 2010). Hence, pretreatment is required to separate the lignin and recover cellulose and hemicellulose for conversion to ethanol (Cheah et al., 2020). The separated lignin can be combusted to generate heat or sold in the market (Aui et al., 2021). The remaining complex polymer structures (i.e., cellulose and hemicellulose) are converted into simple sugar molecules via hydrolysis. Then, in the final step, the fermentable simple sugar molecules are converted into ethanol (Charte et al., 2017). The pretreatment of LCB is the most expensive process in the pathway, accounting for approximately 20% of the total cost (Yang and Wyman, 2008); the hydrolysis and fermentation steps are also not easy due to the presence of a complex mixture of different sugars, which has a detrimental impact on the economic feasibility of the process.

Pretreatment
Several pretreatment processes exist, such as physical/physicochemical, chemical, and biological treatments. Mechanical size reduction, such as chopping, is one of the physical methods for increasing the surface of LCB; the ultra-fine milling process can also be used to reduce cellulose polymerization and crystallinity, although it is costly (Liu et al., 2019). Microwave heating is another potential alternative pretreatment for lignocellulosic materials as it eliminates the need for solvents, separating agents, and other auxiliary chemicals, it produces no smoke or waste and reduces the processing time and energy compared to other heating systems (Aguilar-Reynosa et al., 2017). Additionally, liquid hot water pretreatment (Yan et al., 2016) and steam explosion (Liu et al., 2014) are two other examples of physical/physicochemical pretreatment methodologies.
Even though the concentrated acid can almost completely break down cellulose at a lower temperature, the process is not practical as it produces significant waste that is harmful to the environment. Hence, the dilute acid pretreatment is employed as a favorable approach over other pretreatment methods due to its low cost, high efficiency in hydrolyzing hemicellulose into monomeric components, and generating structural modifications for improved enzyme accessibility and cellulose conversion (Loow et al., 2016). Sulfuric acid (H2SO4) is the most frequently used acid for dilute acid pretreatment, while nitric acid (HNO3), hydrochloric acid (HCl), or phosphoric acid (H3PO4) can be used as well (Xu and Huang, 2014). However, this process also has several disadvantages, such as the need for expensive corrosion-resistant equipment or the neutralization of acidic hydrolyzates before the fermentation of sugars (Zheng et al., 2009).
Alkaline pretreatment can eliminate the need for costly materials and specialized designs for corrosion resistance or strong reaction conditions; it is performed at moderate conditions, sometimes even at room temperature, by soaking the material in a sodium hydroxide (NaOH) or ammonium hydroxide (NH4OH) solutions. Some alkaline pretreatment techniques can also allow the recovery and reuse of chemical reagents (Kim et al., 2016). However, the efficiency of alkaline pretreatment depends on the substrate; generally, it is more successful on hardwood, herbaceous crops, and agricultural leftovers with low lignin content (Zheng et al., 2009). The primary downside of this technique is the formation of considerable amounts of salts, which limit microbial growth and ethanol fermentation in the next stages if they are not effectively removed (Liu et al., 2019). Ionic liquids (IL, high amount of organic cation with a small amount of inorganic anion), deep eutectic solvents (DES, mixtures of Lewis and Bronsted acids and bases), organosolv (organic solvents, e.g., ethanol, methanol, butanol, acetone) methods can also be used (Galbe and Wallberg, 2019).
Biological pretreatment is also an environmentally friendly and economically promising alternative; in this method, the microorganisms such as brown, white, and soft rot fungi can degrade lignin and hemicelluloses from LCB (Sindhu et al., 2016). The main advantage of this process is that there is no requirement for chemical recycling, and no harmful substances are released into the environment. However, it has disadvantages like the need for very long reaction times due to slow degradation rate and the loss of significant amounts of biomass during the process (Liu et al., 2019). Although the overall aim of the pretreatment process, which is to maximize the release of fermentable sugars while limiting the inhibitor formation, is common, the best way to achieve this is highly dependent on the type of biomass (i.e., chemical and physical properties of the biomass) (Ravindran and Jaiswal, 2016; Vollmer et al., 2022).

Hydrolysis
Following the pretreatment phase, the hydrolysis process occurs, which can be classified as acid and enzymatic hydrolysis (Lugani et al., 2020). Enzymatic hydrolysis has a lower environmental impact and inhibitor formation, while acid hydrolysis is faster (Vani et al., 2015). In enzymatic hydrolysis, because LCB is composed of cellulose, hemicellulose, and lignin, a cocktail of enzymes containing cellulase (i.e., cellobiohydrolases, endo-glucanases, b-glucosidases), hemicellulase (i.e., endo-xylanases, bxylosidases, xyloglucanases), and lignin-degrading enzymes is required (Agrawal et al., 2021). Since the process uses soluble enzymes to break down insoluble substrates, it is a heterogeneous reaction system that is influenced by a variety of parameters such as lignin and hemicellulose content, cellulose crystallinity, degree of polymerization, accessible surface area, and pore volume (Zhao et al., 2021).
During the hydrolysis process, the cellulose is broken down into glucose while hemicellulose is separated into 5-carbon sugars (i.e., arabinose and xylose); however, acetic acid is also produced as a byproduct of the hydrolysis of hemicellulose limiting the microbial development and ethanol production; this can be considered as the major disadvantage of hydrolysis process (Scheller and Ulvskov, 2010). Another disadvantage is the high energy consumption of the process because of the lignin present in the reaction, which consumes reactor space and creates a need for extra mixing to homogeneously suspend the fermentation broth during the enzymatic hydrolysis and fermentation stages (Liu et al., 2019).

Fermentation
A soup of hexose and pentose sugars is produced at the end of the hydrolysis process. The conversion of glucose to ethanol is simple and uncomplicated, but the others are not. Various microbial populations are needed for the fermentation of different sugars; however, each has different optimum growth conditions (Kucharska et al., 2018). Microbes that naturally ferment all these sugars also have a low tolerance for bioreactor conditions due to toxin buildup. In addition, during the process of sugar fermentation, microorganisms tend to utilize one type of sugar (usually glucose) over others (Kim et al., 2010). For example, Saccharomyces cerevisiae is one of the most commonly used microorganisms in fermentation that cannot naturally utilize xylose (Jahanbakhshi and Salehi, 2019). Even though those microorganisms may utilize pentose sugars, the glucose generated from cellulose often inhibits the catabolism of these sugars (Zhao et al., 2021).
All these make it difficult to develop and control the fermentation of LCB as a feedstock; the incomplete conversions and slow enzyme reactions also complicate the process and reduce the ethanol yield. The efficiency of pretreatment, hydrolysis, and fermentation processes, together with LCB characteristics, are important for producing bioethanol at competitive prices (Qiao et al., 2022). High efficiency, low cost, and low level of inhibitory byproducts using greener pretreatment solutions are among the current research focuses (Sidana and Yadav, 2022). Fermentative microorganisms are also needed in the field, requiring more research for discovering efficient and robust microbial consortiums or constructing genetically engineered strains (Culaba et al., 2022). The selection of raw material is also another important factor affecting the cost of the bioethanol production process, as the composition has a direct effect on pretreatment cost and fermentable sugar content as well as the abundance of the biomass in the region of consideration (Smuga-Kogut et al., 2021). By taking all factors into account, it becomes hard to find the optimal solution for bioethanol production for different regions of the world. Traditional mathematical models can solve optimization problems with first-order equations (Sousa Jr et al., 2011). However, the variability is always high in biological systems; hence the generalization capacities of these models are not always sufficient. On the other hand, more generalizable solutions can be developed with the use of ML algorithms, which can overcome the nonlinearities and high level of complexity of the biological processes, especially if more research and experimental effort is dedicated to producing high-quality, reproducible data (Wang et al., 2022).

Bibliometric evaluation of lignocellulosic biofuel area
Bibliometric evaluation of scientific literature is widely performed in different areas of science and becoming a common research tool in specified research fields to connect relations among various concepts and research disciplines as well as to discover global research trends (Yaoyang and Boeing, 2013). To understand the trends in the "lignocellulosic biofuel" research field, bibliometric evaluation was done by analyzing the "author keywords" of the articles in the literature. For this purpose, the Web of Science database was used with the search term lignocellulosic biofuel, and a bibliometric study was carried out with a total of 6853 publications.
Research interest in the field is assessed by the number of articles published yearly. It was found that articles related to lignocellulosic biofuel are increasing year by year as expected (Fig. 4a). However, this trend is common in different research fields as the total number of SCI-indexed publications is also increasing (Yaoyang and Boeing, 2013). To discover the assistance of ML in the field, another search was conducted with the term; lignocellulosic machine learning and compared with lignocellulosic biofuel, as shown in Figure 4a. ML inclusion in the field was observed to be increasing in number and expanding its relative share in total publications. "Author keywords" are extracted from the publications and categorized concerning the type of biofuel, feedstock, and conversion method to understand the trends in the area. A data cleaning step was conducted by combining the duplicated and synonymous keywords. Also, four-year moving averages were analyzed to eliminate any fluctuations in years. The result of categorized keyword distribution for four-year averages is presented in Figures 4b-d to uncover shifts in the research trend in the field.
Second-generation biofuels can be produced using various conversion methods, including hydrolysis-fermentation, pyrolysis, hydrothermal conversions, and other biological processes. Biofuels such as biogas, biohydrogen, bioethanol, biomethanol, and biodiesel can be produced using these conversion processes (Kucharska et al., 2018). To understand the trends in the lignocellulosic-based biofuel type, keywords are categorized with respect to the main biofuel categories: bioethanol, biogas/biohydrogen, biodiesel, biobutanol, biochar, and fermentable sugar. As shown in Figure 4b, bioethanol is the most studied lignocellulosic biofuel in each period. It is also observable that biochar and biogas are gaining more attention, as almost half of the related papers in these fields have been published in the last 4 years. As shown in Figure 4b, the trend of the conversion methods also agrees with the product-related keywords. Pretreatment, hydrolysis, and fermentation are the most frequently used keywords, as they constitute the main pathway for bioethanol production. However, although fermentation is more studied in total than pyrolysis, this gap is getting closer each year. Also, each year, anaerobic digestion, hydrothermal liquefaction, and hydrothermal carbonization increase their individual share in keywords.
The type of lignocellulosic feedstock utilized for biofuel production is critical for efficient and economic conversion. LCB can be categorized as agricultural and forest residues, forestry products, dedicated energy crops, municipal solids, and industrial waste (Qiao et al., 2022). The most commonly studied feedstocks in literature are categorized and shown in Figure 4c. The Agricultural residue category has the highest focus in the research area. Microalgae, although it is not a lignocellulosic material in nature, has a strong presence also in lignocellulosic biofuel-related articles and gaining more attention in recent years. It is also found that the number of different feedstocks tested increases yearly.
Categories and keywords related to bioethanol are analyzed further, and it was found that the use of ionic liquids (IL) in pretreatment is the most frequently appearing keyword indicating its recent popularity (Fig. 4d). Dilute-acid pretreatment (DA) was the second most studied one between 2011-2018; however, in the last 4 years, organosolv (OV), steam explosion (SE), and microwave (MW) treatment received more attention than diluteacid pretreatment. The increase in the research interest in deep eutectic solvents (DESs) is also worth mentioning, as their frequency has almost doubled in the last four years. The most commonly studied microorganisms are also presented in Figure 4d; for hydrolysis, the research interest is focused mainly on Trichoderma reesei, Clostridium thermocellum, and Aspergillus niger while the interest in fermentation is more diverged even though S. cerevisiae is the choice for fermentation throughout the years.

Machine learning in lignocellulosic ethanol
As the review articles presented in Table 1 (Coşgun et al., 2022), and modeling biodiesel properties (i.e., cetane number, cold filter plugging point and oxidation stability) over biodiesel samples . All those works indicate that, as far as ML is concerned, biofuel research is too diverse to analyze in a single communication; hence, in this review, we cover only the works directly related to lignocellulosic bioethanol production through the fermentation route.
For consistency, academic databases (i.e., Web of Science, Scopus, and Google Scholar) were searched with keywords lignocellulosic bioethanol and machine learning (supplemented by keywords such as data mining and names of ML algorithms). After the preliminary and comprehensive examination, 43 articles were retrieved to represent the subject. It is also worth mentioning that this study is limited to ML studies focused on bioethanol production from LCB. Figure 5 summarizes the articles presented in this work, with the numbers in the figure denoting the number of articles. Figure 5a shows the publication years of the articles, which shows an increasing trend in ML studies in the lignocellulosic bioethanol field, even though there are fluctuations due to the small data size. The distribution of data size used in these works is given in Figure 5b. It is indicated that the majority of the studies have data sizes ranging from 10 to 30 data points, likely due to the time-consuming nature of experimental work in the field. As a consequence of the small data size, the number of descriptors is also small (i.e., 2 to 5) in many works, as depicted in Figure 5c; those are the variables related to biomass characteristics, and operational conditions such as time, temperature and pH depending on the steps (i.e., pretreatment, hydrolysis, and fermentation) involved and technology used. Figure 5d shows the choice of ML algorithm in the studies. The output variables are also given in Figure 5e; although most studies focus on direct outputs, such as bioethanol, fermentable sugar, and glucose, some studies concentrate on process efficiency-related outputs. Studies conducted throughout the fermentation process primarily focus on predicting and optimizing the input variables for bioethanol production, while fermentable sugar and glucose are the common output variables for the hydrolysis process.
The details of reviewed papers involving the pretreatment, hydrolysis, and fermentation steps are presented in Tables 2-4, respectively, while the major patterns observed in these papers are briefly discussed below with representative examples. Tables are categorized depending on the corresponding step of the input variables used in the ML modeling for a comprehensive understanding of the studies. Articles with variables only from the pretreatment step are summarized in Table 2. On the other hand, articles that include inputs from the hydrolysis step but exclude the fermentation step are summarized in Table 3. In Table 3, studies are categorized into; models that include variables from pretreatment and hydrolysis steps and models that only include hydrolysis step variables. In Table 4, articles that include the fermentation step inputs into the ML models are summarized with categorization performed in the same way as in Table 3.
For a comprehensive picture, the unitless performance metrics (i.e., R 2 ) of all ML models studied in the research articles mentioned in Tables 2-4 are summarized in Figure 6. The average performance of models is above the R 2 value of 0.90, which suggests the high predictive performance of the models. Although the number of models (represented as n) is not enough to make clear inferences, it can be concluded that the addition of pretreatment step inputs into models that are only built with hydrolysis step inputs increases the model performance. To understand the maximum achievable results of the major output variables (i.e., bioethanol, fermentable sugar, and glucose concentration), the prediction results of ML-assisted modeling and optimization studies are given in Figure 7. The results are separated with respect to the LCB used in the models, as the bioethanol, fermentable     Table 3.
Summary of the studies in which machine learning was used for lignocellulosic bioethanol production, involving the hydrolysis step. sugar, and glucose concentrations are highly dependent on the nature of the feedstock.

Pretreatment & Hydrolysis
As the initial phase in ethanol production from LCB, several researchers concentrate on enhancing the pretreatment procedure. The output variables are generally cellulose recovery, while the descriptors are biomass characteristics and pretreatment conditions in these works. For instance, Phromphithak et al.
(2021) modeled cellulose enrichment factor (CEF) and solid recovery (SR) by support vector machine (SVM), gradient boosting (GB), and random forest (RF) using 45 types of biomass and 80 kinds of solvents with 520 data entries gathered from the literature. It was shown that RF has higher predictive performance for CEF and SR (% w/w), while the other ML algorithms performed better for CEF. Similarly, the effect of   pretreatment conditions on cellulose recovery using rice straw was investigated by ANN models with the Levenberg-Marquadt back-propagation algorithm by Parkhey et al. (2020).
The cellulose content of LCB is transformed into fermentable sugars by hydrolysis. Among the studies focused on the conversion efficiency of LCB to fermentable sugars by ML models, some studied pretreatment and hydrolysis steps together. For example, Aruwajoye et al. (2022) studied both fermentable sugar concentration and combined severity factor (CSF), which represents the efficiency of the pretreatment method, using ANN, RF, and decision tree regression (DTR). They used soaking temperature, soaking time, autoclave duration, HCl concentration, and solid loading as descriptor variables and constructed models using 49 experimental data. It was found that the most successful ML method varied depending on the output variable studied. There are also studies focusing on the hydrolysis step alone, even though these works also consider different output variables (single or multiple).

Limitations and practical implications of the current work
Although we attempted to cover a sufficient number of papers involving a variety of aspects to provide an accurate representation of the current status of the pretreatment-hydrolysis-fermentation route for lignocellulosic bioethanol production, limitations and weaknesses are inevitable in such a review. First, we might have missed some significant works, as covering all related studies in a single review is impossible. Our restrictions on the scope and focus on bioethanol (no other product) from lignocellulose (no other raw material) via fermentation (no other processes) was necessary to see field-specific trends and make the review in manageable size; however, there is an obvious trade-off in this approach that we may miss the big picture and overlook some trends in biofuel production in general.
We think there are also some limitations and weaknesses arising from the current ML practice. One of the primary issues in the subject is the lack of data; unfortunately, sufficiently large datasets with high-quality data are rarely available. ML relies on statistical inference, requiring large datasets with reasonable accuracy. The researchers in the field either use their own experimental datasets, which are usually limited in size for reliable conclusions, or extract data from the literature, which contain significant levels of noise due to the non-standard nature of experimental conditions. In either case, the knowledge that can be extracted using ML is inevitably limited. There are also some common mistakes in ML applications that may lead to deficient and erroneous conclusions. For example, the ML algorithm is not always chosen by considering the knowledge to be extracted or the dataset's structure. Instead, it may be selected because of recent popularity providing only limited benefit if an unsuitable algorithm is selected. This may also be true for some of the articles we analyzed because it is not always simple to identify such issues unless researchers test alternative methods and describe them in their papers.
Another potential issue is that the models may be too large for the size of the data since the signs of overfitting are not always apparent, as in the case of simple regression. This typically occurs and goes unnoticed if an effective validation procedure is not implemented or the details of the procedure are not discussed in the paper. Even with the appropriate dataset size and effective algorithms and validation procedures, it is necessary to test a broad range of model hyperparameters to determine the optimal model structure that accurately represents the data. Occasionally, only a few sets of model hyperparameters are examined, particularly if the initial trials yield a satisfactory level of fitness.
Nevertheless, the limitations and weaknesses listed above are shared by all review papers of this type, and our paper will still make an important contribution to the field. We think that our review has four major implications in practice. First, it describes the current status, the patterns, and major research findings in the field through bibliometric analysis of the literature. Second, it provides consolidated results of representative works in literature for the readers to deduce their own conclusions. Third, as connected to the first two, our work may help to plan future experimental works by providing insight into the effects of descriptors, such that the focused nature of our work (bioethanol from lignocellulosic material via fermentation route) should help to identify some practical leverage points to improve the relevant processes further. Finally, the present work provides representative examples of ML applications for those who wish to perform similar works. One of the most critical tasks in ML applications is the selection of descriptors; inspecting the descriptors lists and relative significances determined in various works will help the investigators identify the potential descriptors they should use. Additionally, the examples reviewed in this paper also direct the researchers to the data sources and speed up the execution of similar ML analyses.

Challenges and future perspectives
As also stated in the previous section as one of the major limitations in current works, the availability of a sufficiently large number of accurate data is one of the biggest challenges for ML applications in bioethanol production, and this will likely be the case in the near future as well.
ML requires a dataset that describes the physical process well. First, data should contain the desired information (i.e., descriptors like physical and chemical properties of material or operational conditions should explain some critical performance measures). Second, the dataset should be sufficiently large and accurate for statistically reliable inferences. Construction of a sufficiently large and accurate dataset is one of the biggest challenges for ML applications in many fields; this also seems to be the case for lignocellulosic ethanol production, and as we stated in the previous section, it is also one of the major limitations of current applications. In fact, this may be more problematic for complex systems, such as lignocellulosic ethanol production, because a larger number of descriptors is required to represent such systems adequately, necessitating larger datasets for statistically reliable models. Another reason for this challenge in bioethanol production is that various alternative routes (like thermochemical or fermentation routes) or different configurations of the processes in the same route (like sequential or simultaneous hydrolysis and fermentation steps) are considered for lignocellulosic ethanol production, and none of them is regarded as the dominant route. Since the descriptors (sometimes performance measures) differ for dissimilar routes, the data from different process combinations differ. Hence, the availability of diverse processes and process configurations divides the efforts among the alternative routes and prevents the accumulation of sufficient data in any of them. Furthermore, new material or process steps tested the first time create unique variables not reported by other papers. All these create significant difficulties for implementing ML, which relies on learning from existing relations in the data set; single or few data points having variables not shared by the others have limited use in ML analysis. The data seems to be a bigger problem for more complex configurations like performing hydrolysis and fermentation simultaneously because more descriptors will be needed to represent the combined process, which will require more data entries as well.
Another challenge seems to be the non-standard nature of cellulosic raw materials resulting in different products and yields, especially in the pretreatment and hydrolysis steps (Raj et al., 2022). Normally this should not be a problem for ML if all descriptors are clearly identified, and a sufficiently large number of data is available to smooth out the variations; however, in this field, datasets are typically small, and there is a substantial level of uncertainty (or at least variation) associated with the descriptors.
On the other hand, there are also efforts to overcome these challenges, and more can be expected in the future. One of these efforts is an approach called transfer learning, aiming to utilize the ML models and analysis developed for some fields to understand other similar fields (Kaya and Hajimirza, 2019). Although these are not easy to implement in practice, they may be beneficial for lignocellulosic ethanol production as well; for example, experiences and models developed for the fermentation of sugar from corn, which is a more established field, should provide some insights for the ML analysis of the fermentation step in lignocellulosic ethanol production even though some impurities (including inhibitors) exist for the cases related to lignocellulosic ethanol.
Using computational tools, especially density functional theory (DFT), is another option to create a dataset for ML analysis, and it is commonly employed in material research. The standard nature of the data created this way eases data sharing among the researchers; indeed, numerous databases like Material Project (Jain et al., 2016), OQMD (Kirklin et al., 2015), AFLOWLIB (Curtarolo et al., 2012), and Computational Material Repository (Landis et al., 2012) were constructed for this purpose. However, these tools and databases are mostly used for crystals and simple molecules; the current computational state may not be sufficient to generate the large number of data entries required for a process like fermentation. The use of experimental databases like Inorganic Crystal Structure Database (ICSD) (Bergerhoff et al., 1983), Pearson Crystal Data (Villars and Cenzual, 2007), Cambridge Structural Database (Allen, 2002), Crystal Open Database (Gražul et al., 2009) or creation of a database for lignocellulosic ethanol production does not seem to be practical either. However, some sort of data-sharing mechanisms can still be implemented to improve the benefit of ML because larger datasets with more features always provide more detailed and accurate information in ML analysis. One way to do this is to develop some standard testing and reporting protocols, with the collaboration of researchers in the field, so data from various experimental works can be combined to create a sufficiently large amount of relatively uniform data. In the long run, computational tools like DFT can also be utilized in this field to understand the process and generate data considering the astonishing speed of progress in computational tools and algorithms.
Another approach that can be used for small datasets is reducing the number of descriptors (dimensionality reduction) because a lower number of descriptors requires smaller datasets; this can be done by eliminating insignificant descriptors (feature selection) or combining them into a smaller new descriptor set (feature extraction) (Alpaydin, 2020). Meanwhile, new ML algorithms and approaches for small datasets have also been investigated in recent years (Zhang and Ling, 2018;Feng et al., 2019;Ma et al., 2020). This trend will likely grow in the future and contribute to the research in lignocellulosic biofuels as well.
Finally, a concept called explainable ML has been discussed in recent years against the black box nature of ML models as one of the main weaknesses (and criticism) of the current ML applications . Although this concept is also hard to implement (like transfer learning), it is quite appealing because it aims to explain the reasons behind the results obtained by ML models. This approach may be more beneficial for complex systems like lignocellulosic ethanol because it helps to understand the relations among the descriptors and their impact on the outcome and allow to reduce their number (e.g., reduction in the size of the dataset) by eliminating the insignificant descriptors, and make the use of small datasets easier.

Conclusions
Although LCB is the most abundant biomass source, converting it to ethanol is not an easy process and involves many sophisticated steps because of the nature of the LCB. In this article, first, the lignocellulosic bioethanol process was reviewed from several different angles, including the present state of research, underlying mechanisms, challenges, and obstacles. It was revealed that the pretreatment procedure is one of the most expensive steps with numerous approaches, including physical/physicochemical, acid/alkaline, solvent, and biological treatments. During the hydrolysis (which follows the pretreatment process), a cocktail of enzymes containing cellulase, hemicellulase, and lignin-degrading enzymes is necessary to break down the cellulose, hemicellulose, and lignin in the LCB. The hydrolysis process results in a soup of hexose and pentose sugars. The conversion of glucose (the main hexose sugar) to ethanol is straightforward, while the others are challenging.
In the second part of this work, a bibliometric analysis was performed to extract the trends of research interest in the field. It was found from this analysis that the inclusion of ML in the field is not only increasing but also expanding its relative share. Bioethanol was discovered to be the most researched lignocellulosic biofuel, while biochar and biogas have received increased attention in recent years, with nearly half of those studies published in the last four years.
Then, the implementation of ML approaches to assist in choosing the most suitable experimental conditions leading to the highest conversion via the most practicable route was reviewed in depth. It was observed that ANNs are the most commonly used algorithms (appeared in almost 90% of articles), followed by RSM (in about 25% of articles) and RF (in about 10% of articles). These numbers also indicate that most of the works in these articles are performed for the prediction task. Bioethanol concentration is the most common output variable to predict in fermentation steps, while fermentable sugar and glucose concentration are the most common output variables in hydrolysis. No such generalization was possible for pretreatment methods due to the diversity of the goals and the pretreatment process. The size of the datasets used in the analysis is usually small, while the fitnesses of the models developed are usually high considering the R 2 values reported in the papers.
In addition, major challenges related to ML approaches were discussed in detail under three main steps: constructing the dataset, selecting and implementing ML algorithms, and interpreting the results. It was then concluded that due to the complexity and multi-step nature of the lignocellulosic ethanol production process, the availability of a sufficient amount of data would likely be a problem in the future. One way to improve data availability is by using standardized testing and reporting protocols within the field so that more data can be combined and used for ML analysis. New developments in ML, such as transfer learning, explainable ML, and algorithms allowing to work in small datasets, may also contribute to the development of the field.