Artificial Intelligence (AI)-Aided Structure Optimization for Enhanced Gene Delivery: The Effect of the Polymer Component Distribution (PCD)

Gene therapy has emerged as a significant advancement in medicine in recent years. However, the development of effective gene delivery vectors, particularly polymer vectors, remains a significant challenge. Limited understanding of the internal structure of polymer vectors has hindered efforts to enhance their efficiency. This work focuses on investigating the impact of polymer structure on gene delivery, using the well-known polymeric vector poly(β-amino ester) (PAE) as a case study. For the first time, we revealed the distinct characteristics of individual polymer components and their synergistic effects–the appropriate combination of different components within a polymer (high MW and low MW components) on gene delivery. Additionally, artificial intelligence (AI) analysis was employed to decipher the relationship between the polymer component distribution (PCD) and gene transfection performance. Guided by this analysis, a series of highly efficient polymer vectors that outperform current commercial reagents such as jetPEI and Lipo3000 were developed, among which the transfection efficiency of the PAE-B1-based polyplex was approximately 1.5 times that of Lipo3000 and 2 times that of jetPEI in U251 cells.

S-6 20 µL of the reaction mixture was collected at different time points, diluted with 1 mL of DMF, filtered through a 0.2 µm filter and then measured by SEC. The columns (PolarGel-M, Edinburgh, UK, 7.5 mm × 300 mm, two in series) were eluted with DMF and 0.1% LiBr at a flow rate of 1 mL/min at 60 •C. Columns were calibrated with linear poly(methyl methacrylate) (PMMA) standards.

Nuclear Magnetic Resonance (NMR)
The chemical structure and composition of polymers were confirmed with one-and twodimensional NMR spectra of 1 H-NMR, 1 H, 1 H-COSY, 13 C, 1 H-HSQC, 13 C, 1 H-HSQC, 1 H, 1 H-TOCSY, and 13 C-NMR. Polymer samples were dissolved in CDCl 3 . Measurements were carried out on a Varian Inova 400 MHz spectrometer (Edinburgh, UK). To monitor the reaction extent during the polymerization process, 100 µL of the reaction mixture was collected at different time points, diluted with 800 µL of deuterated solvent and then measured by NMR.

Data Sets
In this study, the weight ratios of each component in different polymers (within a specific range of molecular weights for successful transfection: 2,000-30,000 Da) were used as input feature.
For instance, in the case of Polymer-1 mentioned in Table S1, at a polymer/DNA weight ratio of 40:1 for transfection, the weight ratio of P1 to P8 is 2:6:8:8:7:4:3:2. Thus, in this particular sample, the input values would be (2,6,8,8,7,4,3,2). The weight ratio of each component can vary from 0 to ∞, and the sum of the weight ratios from P1 to P8 should equal the polymer/DNA weight ratio (e.g., 40 in the case of Polymer-1). The corresponding output is the transfection efficiency, which is measured by the GFP expression and is represented in S-7 logarithmic form for each sample.

Machine Learning Models
In this study, we employed six machine learning models to develop and validate models for optimizing the Polymer Component Distribution (PCD) of polymer vectors. The models used were Support Vector Machine (SVM), k-Nearest Neighbors (KNN), Decision Tree (DT), Extreme Tree (ET), Random Forest (RF), and eXtreme Gradient Boosting (XGBoost).

Support Vector Machine (SVM)
is a supervised learning model used for regression tasks. It finds an optimal hyperplane that best fits the data points and maximizes the margin between the predicted values and the actual values. For regression tasks, SVM regression also follows the idea of maximum margin classification. SVM regression allows an error of between the true value and the predicted value, decision function following: (1) In this work, the Radial Basis Function (RBF) was used as kernel.

k-Nearest Neighbors (KNN)
is a non-parametric supervised learning algorithm used for regression tasks. It predicts the value of a data point by averaging the values of its k nearest neighbors. For regression in KNN, the predicted value is typically the average of the k nearest neighbors' target values. A variety of formulas can measure the distance in the KNN algorithm, such as Euclidean distance (2), cosine distance (3) and Manhattan distance (4). and represent two samples, and . Assuming the distance between sample and = 0 = is d, the expression of Euclidean distance (2), cosine distance (3) and Manhattan distance (4) are shown below: (2) In this work, the number of neighbors is set as 5.
Decision Tree (DT) is an interpretable supervised learning model that constructs a tree-like structure to make predictions based on simple decision rules learned from the data features. C represents a constant, K represents the total number of areas that the decision tree model divides the input feature space into, I represents the input data sample, k represents the unit number that divides the input space into ,…, ,…, , and is the average prediction 1 value of division units.
In this work, the max depth is set as 8.
Extreme Tree (ET), also known as Extremely Randomized Trees, is an ensemble learning method that builds a collection of uncorrelated decision trees. It introduces additional randomness during tree construction, reducing overfitting. Similar to Decision Trees, Extreme Trees use if-else conditions to make decisions, but during tree construction, random thresholds S-9 are chosen instead of optimal ones, increasing the diversity among the trees.
In this work, the max depth is set as 8. The predicted value of a model of K trees can be defined as:

Random
By learning such a K tree model and making the error between the final predicted value and the real value of the model as small as possible, the objective function can be defined as: The objective function consists of two parts: one part is the loss function l, which is used to measure the gap between the predicted score and the real score. The other part is the regularization term . and , represent the penalty for the structural complexity of the CART decision tree, which can control the number of leaf nodes and limit the node score, preventing the model from overfitting the training data and loss prediction effect.
In this work, the max depth is set as 6.

Model Training and Evaluation
Then a ten-fold cross validation was conducted to compare the performance of different models.
Predictive performance for the models was evaluated different metrics ( Figure S9 and Table   S3). Based on the evaluation, the final candidate model was selected. After that, the data was randomly split into a training set (80%) and a validation set ( utilizing machine learning that is based on data from the same batch offers a more viable option in terms of practicality and data integrity; Secondly, even though the Bayesian optimization techniques excel at efficiently exploring the parameter space to find an optimal solution, the objective of this work extends beyond solely locating the best ratio. We are also interested in uncovering the intricate relationship between PCD and transfection performance, aiming to gain a deeper understanding of the underlying factors that play a role in the gene delivery process. In this situation, a trade-off between uncovering the impact of PCD and finding the optimal PCD is necessary. By employing a supervised learning model, we prioritized exploiting the available data to construct a predictive model that can provide insights and predictions beyond the singular optimal solution. This approach enabled us to derive valuable insights about the relationship between PCD and transfection performance, contributing to a more comprehensive understanding of the system.