A data-driven optimization model for the workover rig scheduling problem: Case study in an oil company

After completion, oil wells often require intervention services to increase productivity, correct oil flow losses, and solve mechanical failures. These interventions, known as workovers, are made using oil rigs, an expensive and scarce resource. The workover rig scheduling problem (WRSP) comprises deciding which wells demanding workovers will be attended to, which rigs will serve them, and when the operations must be performed, minimizing the rig fleet costs and the oil production loss associated with the workover delay. This study presents a data-driven optimization methodology for the WRSP using text mining and regression models to predict the duration of the workover activities and a mixed-integer linear programming model to obtain the solutions for the model. A sensitivity analysis is performed using simulation to measure the impact of the regression error in the solution.


Introduction
Oil and gas production relies on several techniques and associated equipment that are responsible for lifting the oil to the surface of the well. Eventually, equipment failures require intervention services to restore productivity or correct oil flow losses. These interventions, known as workovers, vary from recompletion to restoration, cleaning, stimulation, and others operations that require the use of oil rigs (Chaudhuri, 2011). Oil rigs are expensive and scarce resources that cost between US$ 50,000 and US$ 700,000 per day, depending on their type, market, and operational characteristics (Kaiser and Snyder, 2013;Osmundsen et al., 2010).
An undersized fleet of rigs might lead to delays in oil production, jeopardizing the profitability of the wells. In contrast, an oversized fleet may lead to high idleness and opportunity costs. Consequently, rig fleets must be properly planned and scheduled to ensure that the rigs will be available at the right place at the right time with the lowest possible cost (Santos et al., 2021).
Each well has its characteristics and properties, which usually require a specific type of workover rig to serve it (Fernández Pérez et al., 2018). Moreover, workover operations are of varying complexity; some wells may require a single day for an intervention to be completed, while others can require months. As a result, it might not be possible to execute all workovers operations within a given planned time horizon.
This study addresses the workover rig scheduling problem (WRSP) and proposes a data-driven optimization model that estimates the workover duration and generates rig schedules simultaneously. The duration of the workover is predicted, taking into account the decisiondependent nature of the duration, which depends on the matching between the technical specifications of the well and the rig chosen to perform the workover. We perform such predictions by means of a combination of data science techniques, which allows us to naturally model the decision-dependent nature of the workover activity duration without compromising the linearity of the model. The prediction is made based on a combination of techniques. Specifically, text mining, clustering, and regression models were used on historical data, enabling these predictions to be utilized in a mixed-integer linear programming (MILP) model that minimizes rig fleet costs and the oil production loss of the wells.
Data-driven optimization is a recent trend in the Operation Research community that combines mathematical programming with data science and statistical algorithms. Hence, the proposed combination of mathematical programming with text mining, clustering, and regression models contributes to this trend. Furthermore, there is a lack of data-driven optimization models in the rig scheduling problem, as mentioned by Santos et al. (2021). Therefore, the main contribution of this study is the proposed data-driven methodology to improve the representation of the decision-dependent workover duration using historical data. Another contribution is the proposed mathematical model itself, which is a reformulation of Costa and Ferreira Filho (2004)'s model for WRSP with more realistic assumptions, such as a heterogeneous fleet of rigs, multi-objectives, and rig eligibility. Finally, the model is applied to realistic instances, contributing to the connection between academia and industry. These instances are generated based on historical data of the studied company and are realistic to the extent they can represent the problem's main features. Lastly, the proposed data-driven model is compared with the methodology used in practice to set the rig schedules, and this analysis demonstrated the benefits of more accurate predictions for the workover duration.
The paper is divided into six sections. Section 2 reviews the literature on the rig scheduling problem. Section 3 presents the WRSP under study and the methodology used in this research. Section 4 presents the data treatment methods utilized. This treated data is used in regression models to predict the workover duration in Section 5. Two mathematical programming formulations using the outputs from the data treatment and regression models are proposed and tested for the studied WRSP in Section 6. Section 6.3 performs a simulation of different solutions to measure their sensitivity against the prediction error associated with the regression. Lastly, Section 7 reflects on the final considerations of the research and potential future studies of the WRSP.

Literature review
The workover rig scheduling problem is a particular case of the rig scheduling problem (RSP), the scheduling and allocation of well activities to rigs aiming to avoid delays and optimize the use of resources (Eagle, 1996). According to Santos et al. (2021), the RSP can be divided into four major classes of problems: • Drilling Rig Scheduling Problem (DRSP): drilling and completion rig scheduling problems, where scheduling is an isolated choice from the rest of the field development decisions; • Workover Planning : rig scheduling of workover activities, which is typically separated from the other rig-related decisions as they are planned in the production phase. It can be classified into two sub-groups according to the application of routing: workover rig scheduling problems (WRSP) and workover routing and scheduling problems (WRRSP); • Resource Planning : rig scheduling incorporates the planning of different resources besides rigs, such as offshore support vessels (OSVs), equipment, and crews. An example is the planning of the OSVs used to lay the pipes connecting the wells and platforms; their connections can only begin after well drilling and completion (Abu-Marrul et al., 2020). • Field Planning : when rig scheduling is integrated with other oilfield development decisions, such as field design, reservoir modeling, and production flow scheduling. In these cases, the RSP relies upon or affects other parts of the field development; The first articles about RSP were from Aronofsky and Williams (1962) and Aronofsky (1962). The authors proposed two linear programming models for the planning of oil production. At that time, these mathematical models required considerable computational effort, preventing any functional application (Pittman, 1985). Consequently, most of the developments regarding the RSP were simplified, using approximation techniques (Barnes et al., 1977) or decision-making rules (Cochrane, 1989). With the improvement of computer processing capabilities and optimization techniques in the 1990s, RSP studies began to broaden themselves, as mentioned by Santos et al. (2021).
There are several literature reviews considering the RSP. Bassi et al. (2012) studied the workover rig routing and scheduling problem and presented a literature review about its setting. Bissoli et al. (2016) also performed an extensive review on the workover routing and scheduling problems, focusing on its drivers. According to the authors, the RSP trends were to approximate the problem with real-life scenarios through new objective functions, mathematical formulations, solution methods, and dynamic or stochastic approaches. Santos et al. (2021) expanded on Bissoli et al. (2016)'s study with a systematic literature review covering most variants of the rig scheduling problem. The authors proposed a unique taxonomy for the RSP addressing its key features and reviewed 130 studies, detecting several gaps and trends in the literature, such as a trend for optimization under uncertainty and a lack of data-driven optimization models, which this paper intends to fulfill.
Others authors have provided a general analysis that relates to the RSP. Tavallali and Karimi (2014) and Tavallali et al. (2016) discussed the planning and development of oilfield decisions and associated perspectives, reviewing several studies, including some on rig scheduling. According to Tavallali and Karimi (2014), rig scheduling is an open research topic that needs more attention. Tavallali et al. (2016) focused on reservoir models and their optimization approaches but proposed a general classification for field development problems, in which the rig scheduling is an oilfield operation decision. The authors highlighted the lack of scheduling studies for drilling new wells and suggested that it should be an integral part of well placement models and oilfield development planning. Khor et al. (2017) also performs a review of field development problems but focuses on the optimization methods used rather than the problems.
This study focuses on the workover rig scheduling problem. Therefore, the literature review presented in this section will be limited to workover planning problems and separated according to the use or not of routing: workover rig scheduling problems (WRSP), Section 2.1, and workover rig scheduling and routing problems (WRRSP), Section 2.2.

Workover rig scheduling problem
The workover rig scheduling problem was first addressed by Barnes et al. (1977), proposing two approximation techniques to minimize the loss of oil production and testing them on a small and short-term instance. Pioneering advances in the WRSP were made by Ferreira Filho (2004, 2005). The authors proposed a linear integer programming model and 300 real-life instances for the problem that was used in many other studies later. Thus, different heuristics were Table 1 Summary of the studies approaching the workover rig scheduling problem (WRSP).

Authors (Year)
Field Instances Jobs Fleet Approach Objectives Barnes et al. (1977) -Real data Single Homogeneous Heuristic Single Ferreira Filho (2004, 2005) Onshore  Pérez et al. (2018) Onshore Public data Single Heterogeneous Simu-Optimization Single tested or created for the problem, such as a maximum priority threecriteria heuristic, MPTH (Costa and Ferreira Filho, 2004); a dynamical assemble heuristic, DAH (Costa and Ferreira Filho, 2005). Aiming to address large instances, Ribeiro et al. (2011) proposed a simulated annealing (SA)-based heuristic that uses SA to create a preliminary solution and iteratively enhance it with SA, which allowed it to surpass other methods in the instances of Costa and Ferreira Filho (2004), such as GRASP, GRASP-PR, DAH, BS, SS, MA, and GA-2opt.
A few other variations of the WRSP can be found in the literature. For instance, Lasrado (2008) developed a software application using manual procedures combined with reservoir simulation (de Andrade Filho, 1994) to create schedules minimizing the number of rigs and the traveling distances, which reduces contract and transportation costs. Marques et al. (2014) proposed a decision support system that schedules a homogeneous fleet of offshore rigs aiming to minimize its size and utilization through MILP. Monemi et al. (2015) considered a heterogeneous fleet of rigs, presenting a new MILP model with arc-time-indexed formulations and two techniques: branch-price-and-cut (BPC) and hyper-heuristic (HH) that obtained near-optimal results in a remarkably short time. This same problem was addressed by Danach (2016) with a binary linear programming model and a HH, which was examined in a real case, and presented problems solving the large instances. The researchers suggested future improvements in the efficiency of the mathematical formulation. Pérez et al. (2016) adapted the binary linear model from Costa and Ferreira Filho (2004) to the case of heterogeneous onshore rigs, proposing a decomposed reformulation with fewer variables and constraints, obtaining new exact solutions for Costa and Ferreira Filho (2004)'s large instances and surpassing the heuristic methods. This mathematical model was later reformulated by Fernández Pérez et al. (2018) to take into account uncertainty in the duration of tasks through a stochastic programming model that minimizes the loss of oil production and the costs of the drilling fleet. The model was tested in instances adapted from Paiva et al. (2000), Costa and Ferreira Filho (2004) and Ribeiro et al. (2012a) in terms of the problem's features, using different scenario generation methods, such as Monte Carlo simulation and Quasi-Monte Carlo. Next, Table 1 summarizes the WRSP studies presented in this section.

Workover rig routing and scheduling problem
When the wells demanding workovers are not concentrated near to each other and the traveling time between the wells is not negligible, routing techniques are required, which leads to the workover rig routing and scheduling problem (WRRSP) (Bissoli et al., 2016). The WRRSP discussion began with a SA proposed by Paiva et al. (2000) aiming to minimize the oil production losses and costs of a homogeneous fleet of workover rigs.
Meanwhile, other researchers concentrated on new modeling approaches for the WRRSP with a homogeneous fleet. Duhamel et al. (2012) proposed a MILP model based on Aloise et al. (2006), another method based on the open vehicle routing problem, and a set-covering model using Dantzig-Wolfe decomposition and an alternative column generation method with variable neighborhood descent and GRASP. Finally, Kromodihardjo and Kromodihardjo (2016), in a combinatorial optimization approach, employed discrete simulation to perform an exhaustive search in the problem, which also led to reasonable solutions in small real-life instances.
Similarly to the WRSP, some authors address the WRRSP with heterogeneous rigs. Aloise et al. (2006) designed a VNS heuristic mixing swap (changing the wells allocated to a rig) and insert move (inserting wells to a rig itinerary) and implemented it in a Brazilian company, which led to savings of approximately 2.5 million dollars per year. Using column generation, ng-path relaxation, subset-row inequalities, and TS, Ribeiro et al. (2012a) proposed a BPC algorithm to optimally solve real-life examples with as many as ten rigs and two hundred wells. Ribeiro et al. (2014) compared this BPC from Ribeiro et al. (2012a), the ALNS made by Ribeiro et al. (2012b), and the VNS from Aloise et al. (2006) with a hybrid-GA (HGA) that outperformed the other methods.
Focusing on the data exploration to enhance the solution quality, Vasconcelos et al. (2017) combined a GA and operational historical data to minimize the non-productive time of wells, testing it on a petroleum company and improving 20 to 40% of the operational and navigation time. Another GA was proposed by Tozzo et al. (2020) to minimize multiple objectives (rig fleet costs and oil production loss).
As the business environment has become more dynamic nowadays and many decisions are made without knowing the full picture, there is a trend in the Operations Research community to optimize under uncertainty, which can be observed for the WRRSP in the studies of Bassi et al. (2012), and Silva and Silva (2018). Bassi et al. (2012) developed a method to simulate the duration of the workovers and optimize the schedule with GRASP. Last, Silva and Silva (2018) introduced a WRRSP in which the decision maker does not know beforehand where the workovers will be required (which wells will need maintenance), naming it Dynamic WRRSP (D-WRRSP). The proposed formulation was based on Ribeiro et al. (2012a)'s formulation and tested in short-term instances modified from Costa and Ferreira Filho (2004). Next, Table 2 summarizes the WRRSP discussed in this section.

Review outline and insights
The first RSP studies focused on the DRSP. Research considering workover planning only began to grow in the 2000s, with studies addressing the WRSP, most of them proposing heuristics for the problem. Sometime later, with the advances in techniques for VRP, the WRRSP started to gain attention. Nowadays, several model formulations and  Santos et al. (2021), workover planning is now the most popular subject concerning rig scheduling problems. Currently, the approaches tend to combine mathematical programming, heuristics, and simulation and take into account more realistic assumptions and objective functions, such as fleet availability and eligibility considerations (heterogeneous rigs), multiple objectives (rigs fleet costs and oil production loss), net present value, and costs varying over the scheduling horizon.
Furthermore, the complex and risky workover environment requires techniques that reduce uncertainty and can cope with errors in the data, such as stochastic/robust optimization, simulation optimization, dynamic programming, or data-driven optimization. Most of these techniques have been applied in some way in the WRRSP (Bassi et al., 2012;Silva and Silva, 2018;Vasconcelos et al., 2017). However, the WRSP has received less attention in these types of approaches. Some stochastic and robust models were proposed by Fernández Pérez et al. (2018), but there is no data-driven optimization study for the WRSP.
Another literature gap detected by Santos et al. (2021) is that more studies need to be applied in real instances and validated with the decision-makers, strengthening the integration between the academic and industry perspectives.
Aiming to fulfill these gaps, this study proposes a data-driven optimization framework for the workover rig scheduling problem for a heterogeneous fleet of offshore rigs. This data-driven approach first uses text mining and clustering algorithms to extract information from historical data from a Brazilian oil company. Then, this information is used in regression models to predict the duration of the workover activities according to the rig. Finally, an optimized workover rig schedule is obtained with an MILP model that aims to minimize oil production losses and rig fleet costs. Further details on the problem at hand and the methodology used are given in the next section.

Materials and methods
This section defines the workover rig scheduling problem, proposes a data-driven optimization methodology that tackles some of the literature gaps detected in the last section, and clarifies some key elements of the techniques used in the methodology.

Problem definition
This article considers a Brazilian oil company that operates a large number of oil fields and needs to plan a fleet of rigs to operate its offshore wells. As a result, this case study has some particularities. This large set of wells requires workover activities, and a fleet of rigs must be hired to serve them. The goal is to decide which wells will be served by which rig in the scheduling horizon, minimizing the costs associated with hiring the rigs and the oil production loss of the wells waiting for workover service. The offshore wells are relatively close to each other, and their processing times are much longer than the traveling times between them, making thus traveling times negligible. Therefore, routing considerations can be disregarded, and their scheduling sequence naturally yields a route for the rig. As a result, we can classify this problem as a workover rig scheduling problem (WRSP), which is a particular case of the rig scheduling problem for workover operations.
Workover planning is performed separately from the other operations on a stand-alone planning level. In that, a fleet of heterogeneous rigs is hired to execute them. Each rig has a particular maximum water depth and a drilling depth. Moreover, each well has a water depth and a drilling depth that cannot exceed the rig limits. Rigs have a fixed cost when hired. Others resources, in addition to rigs, are not considered in this case study.
Each well has an oil production associated with it, regardless of whether it is an injector or producer well. Further details on the oil production of the wells are provided later when we describe the instance generation (Section 6.2). Every well requires only one maintenance (or rework) operation (job or task). Basically, it is a single job scheduling problem for which we use the terms well, workover, operation, task, and job interchangeably. Furthermore, every well has a release date related to the date it starts needing workover, and there is a cost associated with the oil production loss of the wells waiting to be served, which extends until the end of the scheduling horizon if the well is not served.
Lastly, the processing time for each workover operation varies for each class of rig. However, these processing times are not known before scheduling a well to a rig. Currently, the company studied uses the average duration for the type of workover. However, historical data from the workover operations is available and can be used to predict the processing time of a particular rig in a well. Details on the historical data will be presented in Section 4.

Methodology
This section proposes a data-driven methodology for the workover rig scheduling problem, which is separated into three major phases: data treatment (in which the workover historical data is cleaned, shortened, and labeled using data science techniques, including text mining and clustering); predictive models (when the treated data is applied into predictive models to estimate the workover duration according to a well and a rig); optimization (a mixed-integer linear programming model is used to determine an optimal workover rig schedule). Fig. 1 summarizes these three phases presented in Sections 4, 5, and 6, respectively.
Data treatment is based on the data science framework from Shcherbakov et al. (2014) and separates data into two types, qualitative and quantitative data, applying text mining, clustering, and statistical techniques. As explained by Srnka and Koeszegi (2007), quantitative data refers to numerical variables, such as duration, costs, and other measures of value. On the other hand, qualitative data are categorical variables, usually represented with text, symbols, codes, and other nominal categories. The quantitative data is cleaned by removing errors, duplicated rows, and empty fields. With the assistance of plots, such as boxplots (with a multiplier of 1.5 × , where is the interquartile range) and histograms, outliers are eliminated, generating numerical variables for the predictive models.  The qualitative data is treated with text mining techniques (responsible for cleaning the data) and clustering models (which propose better groups for the treated data) to generate dummy variables.
The text mining procedures were generated using the R public packages ''tau'', ''tm'', ''SnowballC'', ''wordcloud'', and ''stringdist'' and include: • Data cleaning: which is the removal of symbols (such as: " /,@,',",|, -,_''), the converting of the text to lower case only, and the removal of numbers, accent marks, dots, and extra spaces. • Data simplification: removal of stopwords and use of the stemming technique (adapted for the Portuguese language) (Lang, 2004). Stopwords are uninformative words often common in a text, such as: articles, pronouns, and conjunctions (Sarica and Luo, 2021). The complete list of the Portuguese stopwords used is shown in Appendix A. Meanwhile, the stemming technique reduces inflected or derived words to their respective word stems, simplifying the text and making it easier to identify fields with the same meaning (Jivani et al., 2011). For instance, words such as ''removal'', ''removing'', ''removed'', and ''removes'' are replaced by their word stem ''remov''. Basically, the stemming technique and the data cleaning simplify the data. However, these techniques would still not recognize texts with the same meaning as similar. For instance, the terms ''Removing of equipment'' and ''Equipment removal''. The stopword removal would remove the ''of'' from the first text, and the stemming would transform each one of them into ''Remov equip'' and ''Equip remov'', respectively. A clustering model is used to detect these similar text fragments and group them.
The grouping of the text data was made using the R public packages ''pheatmap'', ''dendextend'', ''ggdendro'', and ''cluster'' and include the following procedures: • Distance measure: which uses string similarity and distance tools to measure how close the sentences of the qualitative data are to each other. After several tests, a custom string similarity measure was created using the Levenshtein (LV) (Yujian and Bo, 2007) and the Longest Common Substring (LCS) (Sun et al., 2015) distances. This custom string similarity measure for two strings is the mean between both these measures: where 1 and 2 in Eq. (1) refer to ''String1'' and ''String2'', respectively. The LV distance is an edit-based string similarity, whereas the LCS similarity is a sequence-based measure. Both similarity measures are efficient for short strings like the task description, and the combination of the two resulted in suitable matches. • Clustering methods: which uses the k-means algorithm (Likas et al., 2003), a partition method that separates the data into a predefined number of mutually exclusive clusters ( ). It is a pointbased clustering method that starts with the cluster centers initially placed in arbitrary positions and proceeds by moving the cluster centers at each step to minimize the clustering error (Likas et al., 2003). A crucial part of the k-means algorithm is the definition of the number of clusters ( ), which is usually defined using the average silhouette analysis. The silhouette score measures how similar objects are to their assigned clusters compared to other clusters. The score varies between −1 and +1, and a higher score indicates that the object is well-matched to its own cluster and poorly matched to other neighboring clusters (Rousseeuw, 1987).
The string similarity measure in Eq.
(1) was used as the distance for clustering algorithms that aim to group textual descriptions according to their similarities.
As illustrated in Fig. 2, linear regression models are applied in the treated data aiming to predict the duration of the workovers. Linear regression models are statistical models used to determine the relationship between a response variable ( ) and its explanatory variables ( ), which can then be used to predict response values for newly observed explanatory variable values. Two types of regression are tested and evaluated: • Generalized linear models (GLMs): it is a generalization of ordinary linear regression models that accepts response variables with errors following an exponential family distribution, not necessarily a normal distribution as the ordinary models (Nelder and Wedderburn, 1972). The value predicted by the GLM for the observation is a linear sum of the effects of one or more explanatory variables , as shown in Eq. (2): where = {1, … , } represents the set for all observations, = {1, … , } denotes the number of explanatory variables (or features) ( ) used, and represents their effect on the response variable (McCullagh and Nelder, 2019). • Ridge regression (RR) models: RR is a multiple regression technique adapted for data with multicollinearity (when the least-squares estimates are unbiased, but their variances are significant, causing them to be far away from the actual value). Ridge regression adds a degree of bias to the regression estimates by adding a penalty in the sum of the squares (L2 normalization), reducing standard errors. This technique is recommended for regression models with near-linear relationships among independent variables or many independent dummy variables (Hoerl and Kennard, 1970). • Lasso regression models: Lasso or least absolute shrinkage and selection operator is another type of multiple regression technique with regularization that adds bias by penalizing the sum of the absolute values (L1 normalization). This technique is also recommended for regression models with a near-linear relationship among independent variables or a large number of dummy variables (Tibshirani, 1996). As mentioned by James et al. (2013), the Lasso regression can sometimes be used for feature selection as it can completely reset the coefficients.  • Elastic net regression models: Elastic nets are another type of regularized linear regression that combines the L1 and L2 normalizations, i.e., the ridge and lasso regression models, resulting in a more stable feature selection from the L1-normalization and grouping correlated variables using the L2-normalization (Zou and Zhang, 2009  In the GLM, the error variable follows a distribution of the exponential family, which includes the Normal, Poisson, Binomial, and Gamma distributions. Linear coefficients are estimated using the maximum likelihood estimation (MLE) method if the residuals are non-Normal or ordinary least squares (OLS) otherwise (Yuan and Yang, 2005;Yan and Su, 2009;Mahmoud, 2019). Several packages are available in the R programming language to estimate generalized linear models. In this study, we used the native library Stats (R Core Team, 2013) and the package olsrr (Hebbali and Hebbali, 2017). These packages allow one to estimate the coefficients of the model that minimize the loss function.
However, if there are many dummy variables (as a result, a large number of coefficients), the model can overfit the training data and might not perform properly on an out-of-sample data set. Aiming to assist in those cases, regularization techniques can be used to reduce the number of features and prevent overfitting results, such as ridge regression (McDonald, 2009). As this study proposes using qualitative data as an input to predict the unknown workover duration, a large number of independent dummy variables may be generated. Therefore, the ridge model has been chosen as an alternative testing method. The ridge, lasso, and elastic net regression models were estimated using the glmnet (Engebretsen and Bohlin, 2019), stats (R Core Team, 2013), and Caret (Kuhn et al., 2020) libraries for the R programming language.
Using the previous libraries for GLMs and ridge regression, a procedure was created, exhaustively testing all possible combinations of response variables to predict each of the regressions mentioned above. Based on the hold-out validation, the procedure separates 80% of the data as an in-sample and the others 20% left as out-of-sample data. Insample data is used to train the regression model, and the out-of-sample data is used to predict and evaluate the trained models. The GLMs are fitted using the iteratively reweighted least squares (IWLS) (Street et al., 1988). Meanwhile, the ridge regression models are trained using a 10-fold cross-validation (Bengio and Grandvalet, 2004) within the insample data. The trained models are then evaluated for their prediction capabilities using the out-of-sample data with the following metrics: root-mean-square error (RMSE), R-squared ( 2 ), and -value fit for residuals normally distributed. The goal is to choose a model with a high R-squared, low error, and possibly low complexity and having residuals normally distributed. The Caret package (Kuhn et al., 2020) was used to train and select the regression models as it automatically selects the optimal features and parameters, allowing to decide the algorithm to choose between ridge, lasso, and elastic nets. Last, the selected model is used to predict the duration. In what follows, we apply the methods described in Section 3.2 and present the results.
With the duration predictions, a MILP model is optimized using the Gurobi solver v. 9.1.2 (Gurobi Optimization, 2018), generating a workover rig schedule. Next, we apply this proposed data-driven optimization methodology to the workover rig scheduling problem. Section 4 presents workover data treatment results. Section 5 tests and selects the regression models for the workover duration. Finally, Section 6 compares different mathematical programming formulations for the WRSP.

Workover data treatment
As mentioned in Section 3.1, the workover duration is unknown before scheduling the workover rigs. Currently, the studied company uses an average duration according to the type of workover. However, there are historical data that can be used to estimate the workovers following the methodology proposed in Section 3.2. Table 3 summarizes this historical data according to the data group (well or rig attributes) and type (qualitative or quantitative data): Most of this information is qualitative data, i.e., non-numerical. Only a few fields are quantitative data (numerical), such as those related to depth and water depth. Furthermore, there are several issues with the qualitative data that require corrections. For instance, the workover groups and workover types are poorly grouped, making it hard to obtain any distribution for the duration using only this information. Aiming to enhance the task grouping, a data treatment methodology based on the data science framework by Shcherbakov et al. (2014) is used to obtain representative task groups and to improve the qualitative data in the case study. The proposed method uses the well data with the task description, which is unstructured, with unnecessary words and letters, and prone to errors. Fig. 2 illustrates the proposed methodology.
An example of the cleaned and simplified data is shown in Appendix B. Word cloud plots were made to check for any patterns in the data. Fig. 3 contains two word-cloud plots, (a) for one word alone (1-g) and (b) for two words together (2-g). We can observe that some words are more common in the task description, such as ''abandon'' (when a well needs to be abandoned), ''troc'' and ''substitu'' (related to the replacement of equipment in the well), and ''bcs'' (which is a Portuguese acronym for Bombeio Centrifugo Submerso, in English: Electrical Submersible Pump, ESP). However, many sentences still have similar meanings and could technically be considered the same sentence. For  instance, ''substitu bcs'' (replacement of ESP) and ''bcs substitu'' (ESP replacement) share the same meaning. This issue also occurs with ''abandon definit'' (abandon definitively), ''definit abandon'' (definitive abandonment), and other sentences. String similarities combined with clustering algorithms can be used as a grouping model to detect text with similar meaning. The string similarity measure in Eq.
(1) was used as the distance measure of a k-means algorithm (Likas et al., 2003) to group cleaned textual descriptions according to their similarities. With the silhouette analysis, two strategies were selected to cluster and classify the workover tasks. The first clustering strategy separates the task description into major groups of tasks ( = 7, fewer clusters). Meanwhile, the second clustering strategy selects smaller groups of tasks description, but not too small ( = 45, more clusters). We have chosen to use the second with (clusters) equal to 45 as they contained more information that was hidden in the historical data, providing 45 new groupings for the workover operations based on the string similarity of the task descriptions.
Overall, the text mining procedures were able to clean the qualitative data, which had several errors, and to extract only the critical information. Furthermore, the clustering algorithms are powerful tools to group the essential knowledge and obtain new data classifying the workovers. Finally, this data with the new grouping is analyzed in a feature engineering perspective, using correlation, standard deviation, and pair plots to carefully select the features that are associated with workover duration and are more likely to improve the regression models. Fig. 4 presents the correlation or strength-of-association of the features in the data set with the workover duration, using Pearson's R for continuous-continuous cases, correlation ratio for categoricalcontinuous cases, and Cramer's V for categorical-categorical cases.
The first features in Fig. 4 are over-correlated with the workover duration as there are not enough observations for its several categories and, therefore, were removed as a possible feature to be used. Nonetheless, many other significant features were detected, such as 'Bloc', 'Rig type', 'Clusters45', and 'Rig Water Depth' have a significant association. With the support of a standard deviation analysis and a complete correlation matrix of the features (presented in the Appendix), 30 features were selected to be used as an input to the duration prediction in the following section, which presents the regression models used to model the workover durations after the data treatment.

Regression models for the workover duration
Statistical techniques play an essential role in the oil and gas upstream. There have been several successful cases using statistics to predict operation times and to support their planning. Desai et al. (2020) reviewed some of these studies and mentioned techniques such as regression models, neural networks, machine learning, and support vector machine models. Motivated by Desai et al. (2020), this study uses the treated workover data (Section 4) to obtain parametric regression models to predict the workover duration, as explained earlier in Section 3.2. Two types of regression are tested and evaluated: GLMs and ridge regression models.
To test for a setting with the better fitting of the regression models, some transformations of the well workover duration, when served by rig ( ), were considered. Specifically, a logarithmic scale ( ( )) and a normalization ( ) were applied to the data. Finally, alternative settings for the regression modes were considered. For example, GLMs were tested using Gaussian and Gamma distributions, and ridge regression (RR) models were tested using Gaussian and Poisson distributions. Using the testing procedure described in Section 5, all combinations of response variables to predict the workover duration were exhaustively tested for each of these regression settings. The best results for each regression model and setting are presented in Table 4. The labels generated with the data treatment and clustering are represented by the field 45 , where each task description is associated with one of these 45 clusters. The other independent variables are the data fields described in Table 3. The column ''R 2 '' is the adjusted R-squared for the regression; ''RMSE'' refers to the root-mean-squared deviation; ''MAE'' refers to the mean absolute error. The subscripts and refer to in-sample and out-of-sample, respectively. Last, the column '' -value'' refers to the hypothesis that the errors of the regression estimation for the duration are normally distributed.
Analyzing Table 4, we can observe that all the best-performing regressions use data related to the well ( ) with some data from the rig. Attributes such as Basin (the basin in which the well is associated) and RigType (the type of rig used) are important dependent variables selected in all the best regressions. The smaller clusters ( 45 ) resulting from the text mining and grouping (Section 4) were also a common attribute in most of the regression models, which indicates that the techniques were successful in revealing the underlying task description. As expected, the number of independent variables is smaller in the ridge regression as this technique penalizes the models for an excess of size and dummy variables. The best-fitted model was the ridge regression using a logarithmic duration for the workover ( ( )). The Gaussian distribution has a good adjusted R 2 (slightly lower than using the Poisson distribution) and a better -value for a normal distribution for the errors, suggesting that it would be easier to fit distributions for them. Therefore, we have chosen to work with the duration log as I.M. Santos et al. Fig. 4. Associations between features and the workover duration and its logarithmic scale.
where is the duration of the well workover performed by rig , ℎ is the depth of the well , represents the subpool responsible for the well , refers to the exploratory basin where the well is located, 45 is the cluster for the descriptions of the operation executed in the well (obtained using k-means for = 45), indicates the rig type, and is the residual or error of the regression.
Using this regression, Eq. (3) can be rewritten and simplified to the following linear regression. . Finally,̃is its approximation,̂is its prediction from the regression, i.e.,

= + +
, and the distribution of can be estimated using the regression residuals.
The following section describes the use of the workover data treated in Section 4 and the workover duration estimated in this section to optimize the workover rig schedule.

Optimization models
As mentioned in the literature review in Section 2, several formulations have been proposed for the rig scheduling problem. Ferreira Filho (2004, 2005) proposed models using a time-indexed formulation for the WRSP, consisting of the first formulations for the WRSP. The authors used routing elements to define the sequence in which the rigs serve the wells and scheduling rules to determine when each workover is performed. Although it was a time-index formulation, the model proposed in Ferreira Filho (2004, 2005) had several routing elements, such as flow balance constraints to ensure the correct sequencing of workover activities in each rig. Their objective function aimed to minimize oil production. As a result, this formulation was easily adapted for this WRSP study, removing the time-index elements and modifying it to a routing formulation with release dates for the operations, rig hiring costs, and the selection of which wells to serve as part of the WRSP. Ferreira Filho (2004, 2005) did not consider any release date for the workover activities, so a new constraint for the release date was created. Their objective function was to minimize oil production loss only, and all wells were required to be served. We modified the objective function to consider the rig hiring costs and a penalty for not performing a workover in a well. Furthermore, we added a fictional depot node 0, in which all hired rigs must start their ''routes'' and return to it at the end of the scheduling horizon. Despite being a routing model, the travel times between the wells were considered to be negligible. However, the formulation can be easily adapted to a workover rig routing and scheduling problem (WRRSP) if the context requires it. This new model, its objective function, and its constraints are presented below. In addition, its sets, parameters, and variables are detailed in Appendix D.
The objective function (5) minimizes the total cost. The first two terms represent the oil production loss, which can be associated with the time until the execution of the task after it is released (first term) or the production loss from the entire time horizon (since the well is released) when the well is not served (second term). The last term of the objective function is related to the fleet size cost.
Constraints (6), (7), and (8) are flow balance rules from the vehicle routing formulation, where the last two constraint guarantees that a well or can only be served once. Constraints (9) calculate each task starting time ( ) according to the previous service of the rig ( + ). Notice that the dependence between workover duration and the allocated rig is represented by the index in the parameter̂. The actual duration of workover activity is then given by ∑

∈̂.
However, in constraint (9), we can remove , which we noticed to make the linear formulation stronger. Constraints (10) guarantee that the task starting time ( ) respects its release date ( ). Constraints (11) connect variables and , forcing the model to hire a rig ( ) to execute a task with this rig . The other constraints (12), (13), and (14) are related to the variables' domains. Note that this model could be easily adapted to a WRRSP by simply adding the duration of the travels between well and using rig with the duration of the intervention in well ( ′ = + ) and replacing it in the model, more specifically in Eqs. (5) and (9). Next, we show how we have reformulated the model (5)-(14) to achieve better computational performance.

Reformulated workover rig scheduling problem model
Aiming to improve the performance of the WRSP model, we propose a reformulation adding new auxiliary variables hoping to help the branching process of the MILP solver employed. The additional auxiliary variables required are detailed in Appendix D. Their use aims to avoid summations inside the constraints, which can then improve the linear programming relaxation of the problem. The objective function terms were equivalently reformulated with the auxiliary variables. As shown in Eq. (15), it minimizes the total costs associated with the oil production losses and the fleet size cost.
Subject to: 1 = 2 ∀ ∈ , ∈ New constraints were added to define the auxiliary variables and simplify the equations. Constraints (16) are flow balance rules. The new auxiliary variables ( 1 , 2 , and ) are defined in constraints (17), , and (20), and they guarantee that a well can only be served once. Constraints (21) calculate the starting time ( ) of each task according to the previous service of the rig ( + ). Constraints (22) guarantee that the task starting time ( ) satisfies its release date ( ). Constraints (23) connect variables and 1 , forcing the model to hire a rig ( ) in order to execute a task with this rig ( 1 ). The other constraints (24) to (29) state the domains of the variables.

Computational experiments
To test the proposed data-driven optimization methodology for the workover rig scheduling problem, data from a major Brazilian oil company were gathered and structured. A total of 74 real-life based  instances were created based on these data. A detailed description of the instance generator is provided in Appendix E. Instances in this study vary according to the number of rigs (2, 3, 5, 10, and 15), the number of wells (15, 25, 50, and 75), the release date density (0.1, 0.5, and 0.9), and the random seed used for drawing numbers and replicating the instance. These instances were used to compare the two formulations, analyze their robustness and the impact of the regression error on the mathematical models, and compare the trade-off between the proposed data-driven model and the current technique used by the company. The computational experiments were performed in a computer with Intel ® Core ™ i7-8565U CPU and a 20.0 GB RAM memory. The models were implemented using the Julia programming language (Bezanson et al., 2012) and optimized with Gurobi solver v. 9.1.2 (Gurobi Optimization, 2018). Table 5 presents a solution comparison between both models, the original model (I) and the reformulated model (II), for different instances with a scheduling horizon of 360 days and using the different seeds (1019, 2657, and 3229) in the instance generator. The terms ''UB'' and ''LB'' are acronyms for ''Upper Bound'' and ''Lower Bound'', respectively, both in million (M) dollars. A time limit of 3600 s was also enforced to solve the models.
The results in Table 5 show the gap and the computational time difference between the two mathematical models. Model I (the original formulation) requires, in most instances, a longer time than Model II (the reformulation with auxiliary variables) to obtain optimal solutions. In the larger instances, both models started reaching the 3600-second time limit, but the GAPs from the original model are consistently higher than those from the reformulated model. These results indicate that, despite the more significant number of constraints and variables in the reformulated model, the auxiliary variables reduce the computational effort required and enable the model to obtain better solutions.
Another important analysis is to compare the proposed data-driven optimization methodology with the current approach of the company. As mentioned in Section 3.1, the company uses the average duration based only on the type of workover. Using instances generated with out-of-sample records, the reformulated model was used to generate optimal schedules according to a given duration. Three types of duration are used: the average planning duration (̄), which is the current strategy used by the studied company and does not consider any information about the rig; the regression estimation (̂), which is the proposed strategy using the data-driven model and depends on the rig and the well; the actual duration of the well (̃), which was obtained from the out-of-sample historical data (as the optimization cannot guarantee that the rig performing the workover is the same from historical records, this duration is not influenced by the rig in this case).
Aiming to analyze the robustness and the flexibility of the model's solution, i.e., the capacity to accomplish what was planned by the model, we performed the following experiment. We considered the actual workover duration for each well (̃), the rig fleet, and the list of served wells obtained when performing the schedule using estimated workover durations (average duration,̄, as currently done by the studied company, or̂, estimated using the proposed data-driven method). This analysis emulates the process of planning the workover resources beforehand in terms of defining which rigs will be hired and how contracts (i.e., which wells are to be served by the hired rigs) are designed in advance. This comparison is presented in Fig. 5. The vertical axis in Fig. 5 is the percentage of deviation of each comparison in terms of the objective function value.
Clearly, the solutions using the duration estimation through the proposed regression model are closer to the ''best possible'' solutions (obtained with the actual duration of the workover) than the solutions generated with the current approach of the studied company (average duration). Furthermore, the regression solutions fit better with the real duration of the workover activities, as the rescheduling not only is closer to the ''best possible'', but also varies much less than the solutions using the average duration.

Sensitivity analysis
In the previous section, the robustness of the data-driven optimization model was tested against the non-data-driven model. Undoubtedly, the data-driven approach generates solutions closer to the ''best possible'' solutions (obtained with the actual duration of the workover) than the traditional method. Nonetheless, this solution is dependent on I.M. Santos et al. Fig. 8. Associations between features and the workover duration and its logarithmic scale. the quality of the regression model selected. The duration predictions can vary due to the error associated with the regression, which might impact the data-driven model results.
To check the impact of the regression error component on the objective function value of the data-driven model, a sensitivity analysis was performed, simulating the workover duration estimated by the regression. As mentioned earlier, the regression estimation and the actual duration of the workover differ from each other according to regression error, i.e.,̃=̂+ , wherẽis the actual workover duration,̂is the regression estimation, and represents the regression error, which follows a normal distribution estimated as ∼ ( = 1.054672, = 7.438810). In this sensitivity analysis, the WRSP is optimized using the duration estimated by the regression (̂). The solution of each optimal schedule is fixed in the number of rigs, and the wells that can be attended to and 500 simulations of the regression error ( ) are made by sampling from the normal distribution ∼ ( = 1.054672, = 7.438810), to determine the actual duration of the workover (̃) for each simulation. With this duration using the regression error, a reschedule is generated according to the rigs and wells selected in the first schedule. The rescheduled solutions are used to obtain a confidence interval for our data-driven optimization model. Fig. 6 presents this sensitivity analysis for each instance according to the number of rigs (horizontal axis), and the number of wells (color labels). The objective function is given by the markers' relative position on the vertical axis, and the error bar represents the confidence interval of this objective function calculated using the t-score ( ℎ 2, −1 ), where alpha is 5%, and is the sample size of 500 replications. Analyzing Fig. 6, we can observe that the objective function and its variability are highly influenced by the instance size. The larger the number of wells needing intervention, the larger the costs associated, as expected. The number of rigs is also important; a small number of rigs reduces the solution flexibility, and when the number of rigs is sufficiently large, increasing the selection of available rigs allows the model to select cheaper and better rigs, reducing the costs.
Overall, these results indicate that, with 95% of confidence, the decision maker will not observe losses greater than 15%, and that is for the most uncertain case, which despite providing some comfort in terms of reliability, does indicate potential benefits from additional uncertainty mitigation measures.

Conclusions
Oil rigs are an expensive and scarce resource used in critical oiland-gas production operations, such as workover activities, which are intervention services made on the wells to recover productivity or correct oil flow losses. To support the planning of these operations, we studied a real-life workover rig scheduling problem (WRSP) in which a company needs to decide which wells must be served by which rig in the scheduling horizon, minimizing the costs associated with hiring the rigs and the oil production loss of the wells waiting for    workover service. This oil company made available historical data on rig schedules and needs to accurately predict the duration of the new workover activities using the new rig. Two mathematical programming formulations are developed, and a data-driven optimization methodology is proposed for the WRSP. Basically, the proposed approach is separated into data treatment (cleaning, simplification, and labeling of the workover historical data through text mining and clustering), predictive models (estimation of the workover duration according to a well and a rig using historical records), and optimization (definition of an optimal workover rig schedule using MILP).
Computational experiments made on real-life-based instances showed the superior performance of a reformulated version of the optimization model (II), obtaining better or equal objective function in 122 of 126 instances when compared with the results of the initial model (I). Furthermore, the use of the auxiliary variables in Model II improves the formulation of the model and, as a result, reduces the computational effort.
Regarding the proposed data-driven approach, the text mining and clustering procedures proved to be an efficient way of labeling historical data and acquiring hidden information. The combination of these data science techniques with the regression model improved the prediction of the workover duration, which is currently poorly estimated by the studied company. As a result, the solution of the proposed datadriven optimization methodology obtained solutions much closer to the ''perfect'' schedule (optimal schedule with the actual duration) than the schedules generated with the company's current methodology. The proposed approach achieves solutions with a deviation of less than 10% and therefore requires considerably less rescheduling. Meanwhile, the current approach employed by the company usually has deviations of 40 to 80%, requiring more frequent rescheduling. This indicates how well the regression model can represent the uncertain workover duration and its dependency on the rig allocation, which in turn leads to more stable and reliable schedules.
Nevertheless, the R 2 values of the selected predictor are still in the order of 0.5, which should prompt further efforts to improve the prediction accuracy. This can be achieved by, e.g., investigating further feature engineering strategies. Furthermore, alternative prediction methods can be tested and compared with the current data-driven methodology, such as gradient-boosted trees and random forest or support vector clustering.
Every regression model has an error associated with the regression residuals, i.e., the difference between the estimation and the actual value of the predicted variable. A sensitivity analysis performed by simulating the regression error showed a low deviation from the objective function, demonstrating that the proposed data-driven optimization methodology is suitable for the problem. Nonetheless, the uncertainty embedded in the duration estimation could be explored in future studies. For example, stochastic programming could be used to consider scenarios and optimize the best average solution, a direction that the authors intend to pursue in future research. Another possibility is to use robust optimization and minimize worst-case scenarios or to use chance-constrained programming to control the level of feasibility of the schedules, potentially reducing the need for rescheduling. Last, the proposed data-driven methodology could be applied to similar scheduling problems.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

•
: duration of the intervention in well using rig (in days). The processing time of any rig in the fictional depot node 0 is equal to 0.
• : integer variable equal to the starting time of task in days.

•
: binary variable representing if rig is hired (used) or not.
Auxiliary variables (only for the reformulated model): • 1 and 2 : respectively, if a rig arrives at or leaves from well .
• : if any rig serves (enters and leaves) well .

Appendix E. Instance generator
The instance generator uses historical data to generate instances for the optimization, as described in Fig. 10 Based on Wigwe et al. (2020), the wells' oil production (in bbl, barrels) were generated randomly according to their type, using the Gamma distribution, as follows in Eqs. (30) and (31): where is the loss of oil production from the well, and are a parameter that makes the oil production loss proportional to the operation type and well type (respectively), and 0 are the random oil production generated using the Gamma distribution Table 9 Proportional scales of oil production according to the type of well and operation. , in which and are the shape and the scale of the distribution (respectively). Fernández Pérez et al. (2018) suggested using the price of the oil barrel as 55 $ / barrel. As a result, the oil production loss cost in dollars is equal to 55 ⋅ . Details on the proportional scales values, and , are provided in Table 9: The rig hiring and operation costs were randomly selected from the Markit (2021) database, which has historical information of the rig average day rates according to the type of rig and the market.
Using the wells and rigs data sets, an instance generation algorithm was developed, such that it creates instances for a desirable number of rigs, wells, scheduling horizon, random seed, and density coefficient (represented by ). The random seed is a number used to initialize the random number generator and to allow the reproduction of the instance. As to the density coefficient, it is a setting parameter between 0 and 1 that controls the release dates of the workover. A small rho ( ) tends to result in latter release dates, reducing the feasible windows of the tasks. However, a large would generate smaller release dates, increasing the window of allocation of the workover. The algorithm selects random samples of the set to generate instance sets and parameters. With the sets and parameters selected, the algorithm calculates an eligibility matrix that indicates which rigs from the sample set can serve the sample wells. This eligibility matrix is calculated according to the rig data (the type of rig, the rig's maximum water depth, and the rig's maximum depth) and the well data (the well's water depth, the well's depth, and the rig type that can attend the well). A rig will only be able to serve a well if the well's maximum water depth, depth, and type are within the rigs specifications. During the construction of the eligibility matrix, the algorithm checks the feasibility of the instance, that is, if there is a rig for every well and if all wells have a rig to serve it. In case of infeasibility, new samples are calculated until a feasible instance is found, outputting this instance to the data-driven models.