Beyond the Local Volume. I. Surface Densities of Ultracool Dwarfs in Deep HST/WFC3 Parallel Fields

Ultracool dwarf stars and brown dwarfs provide a unique probe of large-scale Galactic structure and evolution; however, until recently spectroscopic samples of sufficient size, depth, and fidelity have been unavailable. Here, we present the identification of 164 M7--T9 ultracool dwarfs in 0.6~deg$^2$ of deep, low-resolution, near-infrared spectroscopic data obtained with the Hubble Space Telescope Wide Field Camera 3 instrument as part of the WFC3 Infrared Spectroscopic Parallel Survey and the 3D-HST survey. We describe the methodology by which we isolate ultracool dwarf candidates from over 200,000 spectra, and show that selection by machine learning classification is superior to spectral index-based methods in terms of completeness and contamination. We use the spectra to accurately determine classifications and spectrophotometric distances, the latter reaching to ~2 kpc for L dwarfs and ~400pc for T dwarfs.


Introduction
The structure and evolution of the Milky Way is largely informed from the spatial and kinematic distributions of its luminous stars, an area of study known as Galactic archeology (Freeman 1987;Ivezić et al. 2012). Large imaging, astrometric, and spectroscopic surveys such as the Sloan Digital Sky Survey (SDSS; York et al. 2000) have been critical for building our model of the Milky Way through six-dimensional position and velocity data, and detailed spectroscopic characterization of large-scale stellar populations. Star-count data spanning decades have revealed that the Milky Way contains multiple stellar populations, including a kinematically young population spatially distributed into two exponential disks (the thin and thick disks); a centrally concentrated and spherically distributed population of metal-rich stars (the bulge), and a widely dispersed, old, and metal-depleted population (the halo; de Vaucouleurs & Pence 1978;Bahcall & Soneira 1981;Jurić et al. 2008). Analysis of the stellar evolutionary states of these populations suggest that the Galactic disk population has been continuously forming stars over the past 8-11 Gyr, while halo stars were largely formed 10-13 Gyr ago (Leggett et al. 1998;Tolstoy et al. 2009;Haywood et al. 2013).
More recently, the Gaia astrometric mission (Gaia Collaboration et al. 2018), combined with large-scale spectroscopic surveys such as SDSS, the Large sky Area Multi-Object fiber Spectroscopic Telescope (Zhao et al. 2012), the Apache Point Observatory Galactic Evolution Experiment (Majewski et al. 2017), and Galactic Archaeology with HERMES (De Silva et al. 2015;Martell et al. 2017), has greatly refined this picture of the Milky Way. Notable insights include the inference of major merger events that likely formed the inner stellar halo and thick disk populations (Gaia-Enceladus/Gaia sausage: Belokurov et al. 2018;Helmi et al. 2018;Myeong et al. 2018;Gallart et al. 2019; and the Sequoia event: Myeong et al. 2018Myeong et al. , 2019; phase-space substructure and mixing among stars in the solar neighborhood, indicative of past perturbations of the Milky Way by satellite systems (Antoja et al. 2018); an ensemble of stellar streams that trace the tidal disruption of, and stellar accretion from, these satellites (Boubert et al. 2018;Malhan et al. 2018;Koppelman et al. 2019); and the detection of dozens of hypervelocity stars ejected through encounters with central supermassive black holes in the Milky Way and Large Magellanic Cloud (Boubert et al. 2018;Erkal et al. 2019). These results show the Milky Way and its environment to be a complex and dynamically evolving system.
All of these studies have focused on the brightest red giants and main-sequence stars, which provide both reach and reliability in the inference of stellar properties. Ultracool dwarfs (UCDs; M  0.1 M ☉ , T eff  3000 K; Kirkpatrick 2005) provide an alternative, and potentially more enriching, population for studying the Milky Way system (Burgasser 2004;Ryan et al. 2017). UCDs constitute ∼50% of stars by number in the immediate solar neighborhood (d < 100 pc), and are abundant in every environment in the Galaxy (Reid et al. 1999;Chabrier & Baraffe 2000;Cruz et al. 2007;Bochanski et al. 2010;Kirkpatrick et al. 2019). Stellar UCDs have lifetimes far in excess of the age of the Galaxy (>10 3 Gyr, Laughlin et al. 1997), while substellar UCDs (brown dwarfs) do not fuse hydrogen and have effectively limitless lifetimes (Kumar 1962(Kumar , 1963Hayashi & Nakano 1963). Brown dwarfs also cool and dim as they age, developing distinct spectra shaped by strong molecular absorption features that are highly sensitive to atmospheric temperature, surface gravity, and metallicity. The thermal and chemical evolution of stellar and substellar UCDs provide age diagnostics that have been exploited in stellar cluster studies (Stauffer et al. 1998;Jeffries & Oliveira 2005;Martín et al. 2018), coeval binary systems (Song et al. 2002;Burgasser & Blake 2009), and searches of new members of young moving groups near the Sun (Lopez-Santiago et al. 2006;Gagné et al. 2015;Mamajek et al. 2015;Faherty et al. 2018).
UCDs have historically been uncovered in wide-field red optical and infrared sky surveys, including SDSS, the Deep Near-Infrared Survey of the Southern Sky (Epchtein et al. 1997), the Two Micron All-Sky Survey (2MASS; Skrutskie et al. 2006), the United Kingdom Infrared Telescope Infrared Deep Sky Survey (UKIDSS; Lawrence et al. 2007), the Canada-France Brown Dwarf Survey (Reylé et al. 2010), the Wide-field Infrared Survey Explorer (WISE; Wright et al. 2010), and the Panoramic Survey Telescope and Rapid Response System (Chambers et al. 2016), among others. The intrinsic faintness of UCDs means that these surveys only reach the immediate solar neighborhood (d 100 pc). While these surveys have enabled study of the local UCD luminosity and mass functions Metchev et al. 2008;Reyle et al. 2010;Bardalez Gagliuffi et al. 2019;Kirkpatrick et al. 2019Kirkpatrick et al. , 2021 they cannot probe Galactic structure or UCD halo populations, of which relatively few examples are currently known Lépine & Scholz 2008;Zhang et al. 2019).
Deep, narrow-field imaging surveys provide one approach to investigating more distant UCD populations. Many of these surveys have exploited the sensitivity and imaging resolution of the Hubble Space Telescope (HST), often in parallel with searches for high-redshift galaxies for which UCDs are a contaminant population (Reid et al. 1996;Ryan et al. 2005Ryan et al. , 2011Holwerda et al. 2014). Very deep, multiband, ground-based surveys have also expanded our reach for UCDs (Kakazu et al. 2010;Carnero Rosell et al. 2019;Sorahana et al. 2019). However, these imaging surveys are subject to contamination and inaccuracies in spectral classification, which inhibits a detailed evaluation of completeness and the analysis of population composition in terms of mass, temperature, metallicity, and other properties. Deep spectral surveys, notably those deploying the slitless grism modes of the HST/Advanced Camera for Surveys (ACS) and HST/Wide Field Camera 3 (WFC3) instruments, have provided more robust and well-characterized samples of distant UCDs, but are shallower and smaller in sample size (50 sources; Pirzkal et al. 2005Pirzkal et al. , 2009Masters et al. 2012).
In this paper, we present a new analysis of HST/WFC3 slitless grism spectra from the WFC3 Infrared Spectroscopic Parallel Survey (WISPS; Atek et al. 2010) and 3D-HST (Momcheva et al. 2016;Brammer et al. 2012;Skelton et al. 2014) survey, expanding both the areal coverage and spectral types evaluated in these deep spectroscopic data sets. In Section 2, we describe the survey data used. In Section 3, we describe our selection process, including a robust analysis of three independent procedures using spectral indices and template fitting: selection by predefined index ranges and by two supervised machine-learning methods. In Section 4, we review the properties of the 164 UCDs identified in this sample, including classifications and spectrophotometric distance estimates. In Section 5, we summarize the main results of our study. Further analysis of this sample is presented in a companion paper (C. A. Aganze et al. 2021, in preparation; hereafter Paper II).

WISPS Survey Data
WISPS is a 1000 orbit, HST pure-parallels survey covering 390 fields (∼1500 arcmin 2 ), obtained concurrent with observations made with the Cosmic Origins Spectrograph (COS) or Space Telescope Imaging Spectrograph (STIS). The goal of WISPS is to conduct a census of star-forming, high-redshift galaxies; hence, WISPS fields are typically at high Galactic latitudes (|b|  20°). Grism data in G102 and G141 settings were obtained with an exposure time ratio of 2.4:1, although the individual exposure times varied. For G141 data, exposures ranged over 600-1400 s and 3000-9000 s for G102 data. Data were reduced using a combination of aXe (Kümmel et al. 2009) and custom software, the latter to remove residual background and to flag bad pixels. Reference images were used to determine source location, and contamination corrections for overlapping spectra were computed from aXe. Direct images were also obtained in the broadband F110W, F140W, and F160W filters, with source catalogs generated using SExtractor (Bertin & Arnouts 1996). We analyzed reduced grism and photometric data provided in WISPS release version 6.2 (Atek et al. 2010).

3D-HST Survey Data
3D-HST is an HST parallels survey of 248 orbits covering ∼600 arcmin 2 . The goal of 3D-HST is to understand the physical processes that shape galaxies in the high-redshift universe. This survey targeted four standard deep extragalactic fields: the All-wavelength Extended Groth strip International Survey (AEGIS; Davis et al. 2007), the Cosmic Evolution Survey (COSMOS; Scoville et al. 2007); the UKIDSS Ultra-Deep Survey (UKIDSS-UDS; Lawrence et al. 2007), and the Great Observatories Origins Deep Survey (GOODS-South and GOODS-North; Giavalisco et al. 2004). These fields are also covered by the Cosmic Assembly Near-infrared Deep Extragalactic Legacy Survey (Grogin et al. 2011;Koekemoer et al. 2011). Spectral data were obtained using both the ACS/G800L grism covering 0.5-0.9 μm with a resolution of λ/Δλ ≈ 100, and the WFC3/G141 grism. We focused on the WFC3 data for this study. WFC3 exposures were obtained over two orbits, with exposure times varying between 2500 and 6600 s per pointing. Data were reduced using custom software, including generation of a contamination model, as described in Momcheva et al. (2016); we used these data for our spectral analysis. 8 Note that the extracted 1D spectra reported in Momcheva et al. (2016) are not corrected for sensitivity; we applied the sensitivity curves provided in the same 3D-HST data products to recover each spectrum's continuum shape.

Spectral Calibration Sample
To anchor our identification of new UCDs in the WISPS and 3D-HST data sets, we first created a spectral calibration sample composed of known UCDs with similar spectral coverage and resolution to the HST spectra. This sample was drawn from nearly 3000 low-resolution (λ/Δλ ≈ 75-120), near-infrared (0.9-2.5 μm) spectra of nearby UCDs observed with the SpeX spectrograph on the NASA InfraRed Telescope Facility (Rayner et al. 2003), all contained in the SpeX Prism Library (Burgasser 2014). We selected all spectra with median signalto-noise ratio (S/N) > 10, and visually inspected these spectra to remove background contaminants, which are retained for our random forest classification as non-UCDs (Section 3.4). We also used 22 Y dwarf spectra from Schneider et al. (2015) and 77 UCD spectra from Manjavacas et al. (2019), all obtained with HST/WFC3. The resulting sample of 2,197 spectra of UCDs with spectral types M7 and later is referred hereafter as the UCD spectral template sample.

Point-source Selection
We combined all grism and photometric data from the WISPS and 3D-HST data sets, an initial sample of 254,264 sources. To refine our selection of UCDs, point sources in the WISPS survey were selected using the SExtractor stellarity index CLASS_STAR ≠ 0. 3D-HST provides a different flag, star_flag, that identifies point sources based on F160W imaging data. However, this flag rejected three visually confirmed UCDs from the 3D-HST sample that were not rejected with the SExtractor stellarity index, so we used a different selection criteria based on the half-light radius parameter FLUX_RADIUS (r). Following Skelton et al. (2014), we required r 1.5 and r < −0.7 × F140W + 20.6 ( Figure 1). Point-source selection reduced the sample down to 104,346 sources, or 41% of the initial spectral sample.

J-band S/N Rejection
To eliminate low S/N spectra that may be too ambiguous to identify or classify as UCDs, we applied an S/N cut on the WFC3 spectral data. UCDs display strong H 2 O and CH 4 molecular absorption features in the J and H bands (1.1-1.6 μm), so an S/N calculation that encompasses the full spectral range will produce different results for different spectral types. Therefore, we defined S/N ratios for the J-band continuum in the range 1.2 μm λ 1.3 μm (hereafter J S/N), the H-band continuum in the range 1.52 μm λ 1.65 μm (hereafter H S/N), and an overall median S/N in the range 1.1 μm λ 1.65 μm (hereafter MEDIAN S/N). We also computed an average of the J-band and H-band S/N (hereafter JH S/N). Figure 2 shows that a J S/N cut at 3 allows us to effectively probe a deeper sample of later-type objects. We rejected the lowest S/N spectra by requiring J S/N 3, which retained 46,561 spectra, or 18% of the initial spectral sample.

Spectral Template and Line Fitting
Visual inspection of the remaining spectral data identify some common contaminants, including emission line sources, featureless spectra with low S/N, spectra with absorption or emission features outside the primary H 2 O and CH 4 bands, and other artifacts. To distinguish featureless spectra from UCDs, we fit each WFC3 spectrum to the full UCD spectral template sample and to a straight line using χ 2 minimization. The χ 2 for each template fit was computed as where Sp(λ) is the WFC3 spectrum, σ(λ) is its uncertainty, T (λ) is the UCD template spectrum, and α is a scale factor that minimizes T where a and b are linear parameters determined through leastsquares minimization.
To distinguish between these fits, we used an F-test statistic to determine if the best-fit template was a significant improvement over the line fit, by requiring F( T L 2 2 c c , DOF T , DOF L ) < 0.02, where DOF is the effective degrees of freedom of each fit, equal to the number of pixels in the wavelength range 1.15-1.65 μm (N = 108), minus 1 for the template fit or minus 2 for the line fit. This constraint imposes the condition that the probability that the UCD template is a worse fit than a line is less than 0.5%, a limit chosen based on the distribution of F statistics for similar fits to the spectral templates. This selection cut retained 98% of the templates and 5946 of the WISPS/3D-HST point sources (2.3% of the initial sample), and provided a first-order estimate of the spectral classification of true UCDs in our sample. Figure 1. Point-source selection cut for the 3D-HST survey. The shaded 2D histogram shows the density of sources in 3D-HST, while our final sample is shown as yellow stars set by the criteria r 1.5 and r < −0.7 × F140W + 20.6.

Defining Spectral Indices
Our final selection criteria was based on the measurement of spectral indices which sample the strong H 2 O and CH 4 molecular features present in the J-and H-band regions. Spectral indices are commonly used to classify UCD spectra, and are typically tailored to the spectral resolution and region sampled (e.g., Tokunaga & Kobayashi 1999;Cushing et al. 2000;Testi et al. 2001;Allers et al. 2007;). We selected five wavelength regions bracketing the primary absorption features (Table 1, Figure 3), measuring the median flux in these regions. Uncertainties were estimated by Monte Carlo sampling of the individual flux measurements, assuming normal distributions scaled by the spectral uncertainties. Ten index ratios were then defined from these band fluxes, each of the form where 〈F〉 denotes the median flux value in the wavelength range λ l < λ < λ h . To select for Y dwarfs, which have extremely low levels of flux in the H 2 O and CH 4 absorption bands, we defined six additional indices that averaged flux measurements in multiple absorption bands, e.g., (H 2 O-1+CH 4 )/J-cont, Figure 4 displays the trends in these spectral indices among dwarf and subdwarf spectral standards. The spectral indices were measured on both the WFC3 data and our UCD spectral template sample. The latter were used to determine selection criteria for UCD subtype groups of M7-M9, L0-L4, L5-L9, T0-T4, T5-T9, Y dwarfs, and late-M/L subdwarfs. For each possible index-index pairing, we evaluated the distribution of index measurements for calibration templates within each of these subtype groupings, and defined parallelogram regions in which these templates clustered. The parallelograms were determined by first measuring a linear trend (I 2 = aI 1 + b) in the index-index data for a given subtype group, then defining a perpendicular range about that line that encompasses five times the standard deviation of template values. This process produces four vertices encompassing each template cluster in each of the possible index-index spaces.

Completeness and Contamination
To quantify the effectiveness of these selection regions to identify UCDs in a given subtype group and exclude contaminants, we defined two statistical metrics measuring completeness (CP) and contamination (CT): In Equation (5), N T (SG) is the total number of templates in subtype group SG, while * N R SG, T ( )is the number of templates within an index-index selection region R; CP = 1 indicates selection of all templates. In Equation (6), N WFC3 is the total number of WFC3 spectra, while * N R WFC3 ( ) is the number of these spectra within an index-index selection region R. As our expected number of UCD discoveries is assumed to be much smaller than N WFC3 , a high selection rate of WFC3 spectra is  consistent with a high level of contamination; CT =1 indicates minimum contamination. To maximize UCD selection (high CP) and minimize contamination (low CT) for each spectral subgroup, we rank ordered by CT those index-index pairs with CP 0.9, and chose the pair with the lowest contamination value; these are listed in Table 2. The best indices for each subtype group generally reflects the strongest molecular features present in those spectra ( Figure 5). The greatest contamination was found in the selection regions for late-type M and early-type L dwarfs, and late-M/L subdwarfs, due to the relative weakness of molecular absorption features for these spectral types and hence greater similarity to background objects. Conversely, the latetype T and Y dwarf groups had the lowest contamination due to their distinct spectral morphologies.
Applying the single best index-index selection criteria to the sources that passed the prior cuts yielded 3400 unique objects, primarily identified as late-M dwarfs and subdwarfs ( Table 2). The subdwarf selection criterion appear to be the most highly contaminated, as it identified 2042 objects, while visual inspection ruled out the detection of any unambiguous ultracool subdwarfs (Section 4.5).

Visual Selection
After the preceding selection criteria were applied, we visually inspected the remaining spectra to confirm their UCD nature and spectral classifications. Keeping only those sources whose spectra were a clear visual match to an UCD spectral standard (including d/sd and sd subdwarf standards), we identified a total of 164 UCDs, listed in Tables 3 and 4.
Retrospectively, we estimated the false-positive rates (FP) for each subtype group as where N T WFC3 is the number of WFC3 spectra visually confirmed as UCDs. With the exception of the late-T and Y Figure 3. Illustration of spectral bands used to define spectral indices in this study (blue bands). The black lines are normalized low-resolution late-M, L, and T dwarf spectral standards defined in Burgasser et al. (2006a) and Kirkpatrick et al. (2010), and the Y0 dwarf WISE J1738+2732 from Cushing et al. (2011). dwarfs, the best index selection criteria have FP values of nearly 100%, indicating significant contamination remains after these selection steps, although in terms of numbers these late-T and Y dwarfs select < 15 objects each. Nevertheless, the cumulative selection process reduces the number of spectra requiring visual inspection by over 99%, and spot checks of the rejected sample indicate that few if any UCDs were missed by this process.  Note. v1-v4 denote the (x, y) positions of the vertices of the selection box for the given index-index pair; CP, CT, and FP are the completeness, contamination, and false-positive rates defined in Equations (5)-(7).

Selection by Random Forest Classifier
Despite a high completeness and a significant reduction in the contaminants with the spectral index selection, this method selects more than 3000 candidates for visual inspection. To further reduce this number, we explored a different selection: a random forest classifier. A random forest (Breiman 2001) is a set of decision trees constructed on randomized samples of a data set. A decision tree is a model that recursively splits the sample into two parts until a predefined stopping criterion is reached. For example, if values of a feature j are {X j }, the decision tree could split this data set into two parts, {X|X j < s} and {X|X j s}, based on a threshold value s.
These steps are repeated on the new subspaces until a stopping criterion (e.g., minimum set size) is reached. For supervised classification, the optimal parameters for this tree (e.g., the stopping criterion, the structure of each tree) are chosen to minimize some predefined error function based on a set of preclassified training data. Individual decision trees can be highly variant, since the final classifications at the end of the tree depends on where splits are made. Random forests seek to reduce this variance by using statistical methods such as bootstrap aggregation (Breiman 1996).
The final classifications are determined by a majority vote of all of the decision trees. Random forests are a popular machinelearning algorithm as they are fast, can be parallelized, and do not require large training data sets to be constructed. They have been used to reliably predict M dwarf subtypes based on photometric colors (Hardegree-Ullman et al. 2019), perform star-galaxy photometric classification in wide-field and transient surveys (Miller et al. 2017;Clarke et al. 2020), and conduct other photometric classification tasks (e.g., Richards et al. 2011;Dubath et al. 2011;Bloom et al. 2012;Brink et al. 2013). Random forests have also been used in a regression form to infer atmospheric parameters from brown dwarf and exoplanet spectra (Márquez-Neila et al. 2018;Oreshenko et al. 2020). We used the RandomForestClassifier implementation of the random forest algorithm provided in the scikitlearn package (Pedregosa et al. 2011). Our preclassified training data (all with J S/N > 3) come from two sources: WISPS and 3D-HST spectra for non-UCD sources and the UCD template sample (Section 3.1). For the latter, we selected the 20 highest S/N spectra for each spectral subtype, smoothed and resampled these spectra to match the WFC3 resolution and wavelength range, and added Gaussian noise to encompass the range of J-S/N values present in the WFC3 data set down to the J S/ N = 3 limit. The training data we organized into five classes: a contaminants class composed of WFC3 spectra of nonpoint sources and point sources visually confirmed to not be UCDs (138,214 spectra total), M7-M9 dwarfs (4,686 spectra), L dwarfs (23,140 spectra), T dwarfs (13,479 spectra), Y dwarfs (12,871 spectra), and late-M/L subdwarfs (14,977 spectra) Choosing an appropriate set of features to classify a data set is a critical design choice of any machine-learning model. We used the five spectral indices and four S/N statistics defined in Section 3.2.2, the best-fit spectral type obtained from the spectral template fitting, and the F-test statistic comparing the template and line fits. In cases where features were not measurable due to small/zero denominators (common for noisy spectra), we assigned an extreme value of −99 to indicate missing data. The initial preclassification data set was shuffled and split into a training set (75%) and a testing set (25%) for constructing, optimizing, and evaluating the random forest model.
The parameters describing the random forest model ( Table 5) were optimized using the randomized search crossvalidation algorithm (Bergstra & Bengio 2012). We used the Kfold cross-validation method in which the data is randomly split into K equally sized subsamples, the model is fit to K − 1 of these subsamples, and the model is then validated on the last subsample. The cross-validation algorithm conducts a random search over distributions of the tree parameters, weighting those with the smallest classification error highest. We assumed an initially uniform distribution of parameters, and utilized the default K = 5 cross validation implemented in RandomizedSearchCV.
Classification error, and hence model convergence, was determined by evaluating four metrics 9 that compare the rates of true positives (TP), true negatives (  Minimum number of objects needed to split on an internal node 53 criterion Measure of the quality of the split (node purity) on each tree Entropy c Notes. a Parameter name in the scikitlearn RandomForestClassifier object. b Bootstrapping was done by splitting the training set into two randomly-assigned 50% partitions, training the random forest on one partition, and using this to classify the second partition (see, for example, Miller et al. 2017). Out-of-the-bag (OOB) samples are the data that were not used when fitting a tree to the bootstrapped samples c The entropy is defined as p p log  These metrics were historically developed for information retrieval tasks in linguistics (Chinchor 1992). Accuracy measures the model's ability to correctly classify all sources, both positive and negative classifications. Precision measures the model's ability to distinguish true positives from false positives. Recall measures the model's ability to correctly select members of a particular class, and the F 1 score combines the precision and recall measures. We computed these metrics for individual classes and averaged across classes. Our model was constructed in the Google Colab environment 10 (Carneiro et al. 2018). The final model was selected as the one with the highest F 1 score in order to minimize the false negative rate and maintain accuracy; its parameters are summarized in Table 5. Figure 6 displays the confusion matrix of our random forest model, the matrix of true and false positives and negatives as determined by the test set. Our best model has a >97% accuracy score across all classes, but varies between classes ( Table 6). The precision, recall, and F 1 scores are generally above 90% with the exception of M7-M9 class precision. In terms of performance, there is little confusion between UCD classes; the percentage of M7-M9, L dwarfs, and T dwarfs classified as another UCD class is below 1%. The primary loss in accuracy and precision stems from early M dwarfs (<M7) being classified as galaxies, likely due to their weak molecular features. Figure 6 shows the distribution of feature importance among the features used for classification. Feature importance captures the sensitivity of classification to a particular feature by measuring the decrease in classification error due to splits on that feature. The most important feature is the classification spectral type from the standard template fits. However, this feature alone cannot distinguish between contaminants and UCD types. For example, low S/N contaminants with strong emission features (active galaxies) are often classified as T dwarfs. Features combining H 2 O and continuum flux ratios therefore have high importance, as these are the highest S/N features and enable segregation of UCDs and contaminants. The least important features tend to be those sampling the noisiest parts of the data; e.g., CH 4 and H 2 O-2 indices.
In traditional random forest models, the final classification of a source is based on the highest probability class across all decision trees in the random forest. To further reduce contamination, particularly in the M7-M9 dwarf class, we examined an alternative classification strategy in which classified UCDs were required to have classification probabilities >80%; all other sources were classified as contaminants. This 80% cutoff is determined from distribution of the class probabilities for UCDs selected by the index method. Table 6 shows that this additional classification criterion generally increases the precision of detecting UCDs, particularly for the M7-M9 class, while significantly reducing the recall. In effect, this criterion produces a trade-off between minimizing contamination and maximizing the recovery of true UCDs. Given the rarity of UCDs in the overall sample, we find this trade-off to be warranted, as it can be modeled out when evaluating the overall sample (Paper II).
Following the optimization of our model parameters, we built and trained a final random forest model using the entire preclassified training set (with the 80% classification threshold), and applied it to the 46,561 WFC3 spectra of point sources with J S/N > 3. In total, the random forest model classified 143 spectra as late-M, L, T, or Y dwarfs, including two objects identified as subdwarfs. Visual inspection confirmed 89 of these spectra as bona fide UCDs. The random forest classifier did not select 72 with spectral types of M7-L8 previously selected by the index selection as they fall below the threshold classification probability of 80%. On the other hand, the random forest method identified one additional M7 UCDs that was missed by the index selection. While finding fewer UCDs overall compared to the visual index selection, this method (at 80% probability cutoff) provided a more than tenfold reduction in contamination.

Selection by a Deep Neural Network
We explored a second machine-learning classification approach using artificial neural networks Although the random forest method provided significant reduction in contamination, feed-forward artificial NNs outperform random forests in large-scale empirical studies (Caruana & Niculescu-Mizil 2006). In the simplest design, a multilayer, feed-forward NN with L hidden layers, each of dimensionality N, is a set of functions that transform an input vector X  of dimensionality M into an output vector f X k  ( ) of dimensionality K. Following the notation of Hastie et al. (2009) Here, σ 0 , σ 1 , σ 2 Kσ L , and g k are scale functions also known as activation functions; while the matrices α of dimension N × M and β of dimension K × N are the NN weights that linearly combine the various nodes between the layers. Training an NN entails optimizing the NN weights in order to minimize a classification convergence criterion R(θ) for a network model parameter set , , , , are design parameters for the network ( Table 7). A convergence criterion commonly used is cross entropy (Hastie et al. 2009), where Y k represents the expected label for input variable X  . Minimizing R(θ) can be achieved numerically using a stochastic gradient descent algorithm (Kiefer & Wolfowitz 1952).
While the mathematics behind NNs have been understood for decades, the recent emergence of large data sets in science and industry, and significant improvements in computational hardware, have contributed to their increased application in general and toward astrophysical problems in particular. Examples include the use of convolutional neural networks (CNNs) to model stellar properties based on light curves (Blancato et al. 2020), to predict galaxy spectra from their broadband images (Wu & Peek 2020), and to discover extrasolar planets (Shallue & Vanderburg 2018); and the use of generative adversarial networks (GANs) to extract features from noisy galaxy images (Schawinski et al. 2017). Baron (2019) provides a review of machine-learning applications, including NNs, in astronomy.
To train our NN, we used the same features and preclassified training set as the random forest classifier. Our NN model is a simple deep neural network (DNN) consisting of a series of fully connected layers (see Figure 7), implemented using the Keras library (Chollet et al. 2015). To converge to optimal parameters, we use adaptive moment estimation (Adam, Kingma & Ba 2014), a type of gradient descent algorithm. We used two forms of activation functions; for the input and hidden layers we used the rectified linear unit (ReLU, Nair & Hinton 2010) given by while for the output layer we used the normalized exponential function (softmax) given by The number of layers and nodes for the input and hidden layers are left as parameters to be determined through optimization.
To find the optimal parameters for the DNN model, we split our preclassified spectral data into a training set, a validation set, and a test set with proportions of 40%, 40%, and 20%. The training and validation sets are used to find and train the best model while the test set is used to evaluate the network performance on new data. We used a random search algorithm to optimize the model parameters listed in Table 7. We varied the number of nodes N in each of the input and hidden layers, Notes. a Predicted class is based on the highest probability class for that particular object. b Predicted class must have a >80% classification probability; sources that do not satisfy this constraint are classified as contaminants.  7. Schematic illustrating the DNN architecture used on this work. We summarize the model properties in Table 7 as well as the learning rate (η), which is a scale factor used to update model parameters from (θ) to (q¢).
The parameter optimization was done on data samples (batches) of 300 spectra for each training epoch. We evaluated the same four performance metrics as used for the random forest (accuracy, precision, recall, and F 1 score), as well as cross entropy, which was used to optimize the performance of the DNN model. As shown in Figure 8, model convergence as traced by the performance metrics occurs within 150 update cycles. Our best-performing model has L = 6 hidden layers of N = 64 nodes each, for a total of 118,085 trainable weights. Analysis of the test set indicates an overall accuracy, precision, recall, and F 1 score of ∼98% (Table 8).
After rebuilding the DNN model from the optimal parameters and training it with the full preclassified training set, we determined class probabilities for each of the six classes -contaminants, M7-M9 dwarfs, L dwarfs, T dwarfs, Y dwarfs, and subdwarfs-for the full WFC3 spectral sample. To reduce the number of false positives, we again imposed a classification probability threshold of >80% on the UCD Figure 8. Top: confusion matrix for the DNN showing the performance on the test set. Predicted labels in (a) are computed from the maximum probability while labels (b) are computed by first making our selection cuts and then computing the maximum probability. Here, objects classified as <M7 and/or not satisfying our selection cuts are labeled as contaminants. The label Sd is used to denote the subdwarf class and the label CTM is used to denote contaminants. Bottom: DNN performance metrics of accuracy (Ac), precision (Pre), recall (Re), cross entropy, and F 1 on the training (Tr) and validation (Val) sets. We achieve high performance after roughly 120 epochs. Beyond 120 epochs, the performances of the validation and training sets start to deviate. classes. Figure 9 displays the probability distribution of point sources in the full WFC3 spectral sample. We identified a total of 537 UCD candidates, including 158 objects classified as subdwarfs, of which 128 were visually confirmed. Like the random forest classifier, the DNN classifier provided a substantial reduction in contamination, but missed 36 UCDs identified by the index selection. On the other hand, the DNN classifier identified three additional UCDs (with spectral types of M7-M8) that were missed by the other two methods.
Combining the outcomes of all three selection methods, we identified 164 unique UCDs in the WISPS and 3D-HST surveys, listed in Tables 3 and 4 and subdivided into halfclass subgroups in Table 9. An example of the field image, 2D spectral image, and 1D spectrum of one of our L dwarf discoveries is shown in Figure 10; see the associated figure set for the complete set of HST/WFC3 data for our sample. While each method selected roughly the same number of UCDs, the index-selection method had over 10 times the number of contaminants compared to the random forest classifier and over five times the number of contaminants compared to the DNN classifier. The contamination rate was largely reduced by the application of a classification probability threshold, which cannot be equivalently computed for the index selection method. In total, we found 113 UCDs in the WISPS fields and 51 UCDs in 3D-HST. We also identified an additional 83 sources that were classified earlier than spectral type M7 based on comparison to spectral standards, but matched to high S/N UCD spectral templates classified M7 or later (Section 3.1). These sources, listed in the Appendix, are not included in the main sample but are listed here as they likely to be mid-to late-M dwarfs. Notes. a Predicted class is based on the highest probability class for that particular object. b Predicted class must have a >80% classification probability; sources that do not satisfy this constraint are classified as contaminants. Figure 9. Log probability distributions of the random forest and NN models on the prediction set for each class. We mark the 50% and 80% cutoff as dashed lines. The majority of sources have low probabilities of being UCDs and high probabilities of being contaminants. The label Sd is used to denote the subdwarf class and the label CTM is used to denote contaminants.

Spectral Classifications
Classifications for our targets are based on comparisons to spectral standards using the method described in Section 3.2.3. We used the best-fit matches to determine the spectral types. Classification uncertainties were estimated using a weighted standard deviation, with weights computed from the F-test cumulative density distribution W F min , (1). The median classification uncertainty is 1.4 subtypes, but varies between 0.5 and 9 subtypes. The largest classification uncertainties arise from the lowest S/N spectra, particularly for M-type UCDs. Figure 12 displays the distribution of spectral types, showing that our sample is heavily weighted toward late-M dwarfs (128 sources), as expected given their relatively bright intrinsic magnitudes and the corresponding larger volumes sampled in this magnitude-limited survey. We identified 26 L dwarfs (primarily L0-L4) and 10 T dwarfs, but no Y dwarfs or subdwarfs.

Color-spectral Type and Absolute Magnitude-spectral Type Relations
All of the sources identified in the WISPS and 3D-HST surveys have apparent F110W, F140W, or F160W magnitudes from imaging data, which can be used in conjunction with source classifications to estimate distances. We determined new absolute magnitude/spectral type relations for M5-Y2 dwarfs in the F110W, F140W, and F160W filters by bootstrapping to the 2MASS J-and H-band relations of Pecaut & Mamajek (2013). For T7-Y2 dwarfs, we shifted to the absolute H-band magnitude/spectral type relation of Kirkpatrick et al. (2021).
We first computed color offsets between 2MASS (Vega magnitudes) and HST/WFC3 photometry (AB magnitudes) by computing spectrophotometric colors from our UCD template sample (Section 3.1). To ensure fidelity, we selected only those M5-T9 dwarfs with precise ( 1 subtype uncertainty) red optical (late-M and L dwarfs) or infrared (late-L and T dwarfs) classifications; small 2MASS J-and H-band magnitude uncertainties (σ < 0.4); and spectra with median S/N > 100 for M dwarfs, S/N > 70 for L dwarfs, and S/N > 10 for T dwarfs. We also included all of the Y dwarfs from the template sample, and removed known binaries.
Magnitudes were computed by convolving the spectra with the appropriate filter profiles (HST 11 ) or response functions (2MASS; Cohen et al. 1992), and uncertainties in spectral fluxes were propagated by random sampling. Figure 13 shows the resulting color offsets as a function of spectral type. Variations in color offsets are largest when the 2MASS and HST filters are more widely separated, reflecting the substantial evolution in spectral morphology across UCD spectral classes. We chose those colors that minimized spectral type variation, and hence bootstrapped the F110W and F140W relations to the J band and the F160W relation to the H band. Adding the relevant correction for each individual source to its absolute 2MASS magnitude from the absolute magnitude relation, we fit the entire spectral sequence with a sixth order polynomial using least-squares fitting, masking sources that were more than 5σ outliers. Fit coefficients and corresponding absolute magnitude uncertainties are listed in Table 10. The scatter in these absolute magnitude relations is approximately 0.4-0.6 mag, comparable to other absolute magnitude/spectral type relations (e.g., Dupuy & Liu 2012;Best et al. 2018).
The 2MASS colors of discovered UCDs compared to colors of UCDs in our calibration samples are shown in Figure 11. These colors are computed using our relations in Table 10 and incorporating the intrinsic scatter in these relations. Our discoveries generally follow the spectral type/color trend from our calibration samples. However, objects with fairly large uncertainties in subtyping are significantly offset.

Distances and Galactic Coordinates
Distance estimates were computed for each source based on the relations in Table 10 and measured HST/WFC3 photometry. We assume for simplicity that all sources are single, although magnitude-limited surveys of UCDs typically have a ∼20% multiplicity rate (Bouy et al. 2003;Burgasser et al. , 2006bFontanive et al. 2018). Uncertainties in spectral classifications, source photometry, and the absolute magnitude relations were propagated by random sampling. In cases where more than one distance estimate was available from the absolute magnitude relations, we computed an uncertaintyweighted average. The distance of our UCD discoveries range over 60-3000 pc for late-M dwarfs, 40-2000 pc for L dwarfs, and 82-800 pc for T dwarfs (Figure 13). The most distant UCD with <50% distance uncertainty is the M7 ± 1 J23450092-4239288 at a distance of 2.1 ± 0.8 kpc. The closest sources are the L1.0 ± 0.7 WISPS J0927+6027 with an estimated distance of 48 ± 11 pc, and the M8 ± 1 WISPS J2333+3925 with an estimated distance of 67 ± 17 pc. Both of these sources are detected in Gaia with parallaxes of 19.497 ± 0.567 mas (51.4 ± 1.5 pc) and 19.498 ± 0.567 mas (50.5 ± 0.5 pc), respectively, such that WISPS J2333+3925 is in fact marginally closer. These are the only two sources in our UCD sample that are found in Gaia, and their measured and estimated distances are in agreement. Two of our extended sample targets (J10002943+0217108 and J14190738+5246567, Appendix) also have tentative Gaia detections, and their Gaia distances also agree with our photometric distance estimates within the uncertainties. Figure 14 displays the distribution of source Galactic angular and spatial coordinates. 12 The angular distribution of the UCDs follows the HST pointings, with more sources where m is the power of the polynomial. Coefficients are written in decimal exponent notation. a Decimal spectral type, with 10 = M0, 20 = L0, and 30 = T0, etc.

Detailed Discussion of Sample UCDs
Late-M dwarfs-We identified 128 sources with estimated spectral types of M7 SpT < L0. Most of these classifications are robust, in that the major H 2 O absorption feature between 1.3-1.45 μm is clearly distinguishable from noise. For many of the WISPS spectra, we were able to confirm classification with G102 grism spectra (0.8-1.15 μm), which encompasses additional H 2 O, FeH, and TiO absorption features characteristic of M dwarf spectra. As the G102 and G141 spectra sample distinct, non-overlapping wavelength ranges, we were unable to converge on a correct scaling to merge these spectra, so we retain the classifications based on the G141 spectra alone. We visually inspected all source images to confirm our UCDs are point sources unblended with other sources in the field, although a handful of sources with misaligned imaging data prevented this verification. 13 One notable late-M dwarf discovery is the M8 ± 1.5 dwarf WISPS J2333+3921, which appears to have stronger H 2 O bands compared to its best-fit standard, which may be a signature of subsolar metallicity (Aganze et al. 2016). The G141 spectrum of WISPS J2333+3921 is better fit to a d/sd M9 standard despite its weaker FeH in the G102 spectrum. We also note that M7 ± 2 dwarf WISPS J1242+3538, located at an estimated distance of just over 3 kpc, is very close to an adjacent bright source and its spectrum may be partly contaminated. The closest M dwarf in our sample is the M8 ± 1 WISPS J2333+3925, as noted above at a Gaia parallax-based distance of 50.5 ± 0.5 pc. This source has a relatively high proper motion, (μ α , μ δ ) = (389.688 ± 0.203, −60.328 ± 0.178) mas yr −1 , corresponding to a tangential velocity of 94 km s −1 . It may be an old thin disk or a thick disk star.
L Dwarfs-We identified 26 L dwarfs, nearly half of which (12 sources) have classifications between L0 and L2; the distribution of sources per subtype throughout L0-L4 is nearly constant. Note that given the classification uncertainties, some of the earliest-type L dwarfs may be late M dwarfs, and similarly several classified late M dwarfs could be early L dwarfs. Most of these spectra are well matched to the L dwarf spectral standards, and several are confirmed by additional G102 grism data. As noted, the relatively nearby L2 ± 1.1 dwarf, WISPS J0927+6027, at a Gaia parallax-based distance of 51.4 ± 1.5 pc, is sufficiently bright (F140W = 16.697 ± 0.001) to make it amenable for further ground-based follow-up. The Gaia proper motion (μ α , μ δ ) = (−6.780 ± 0.550, 28.435 ± 0.544) mas yr −1 corresponds to a modest tangential velocity of 7 km s −1 .
T dwarfs-T dwarfs are readily distinguished from other sources by their strong H 2 O and CH 4 absorption features. We identified 10 T dwarfs which match well with spectral templates. These include the three T dwarfs previously identified in WISPS by Masters et al. (2012): WISPS J0307-7243 (T4 ± 0.5 at ∼320 pc), WISP J1232-0033 (T7 ± 0.9 at ∼150 pc), and WISPS J1305-2538 (T8 ± 0.7 at ∼80 pc); and AEGIS J1418+5242 (T4 at ∼330 pc), first identified in 3D-HST data by Brammer et al. (2012). One of the early T dwarf discoveries, WISP J0326-1643 (T1 ± 1.4 at 740 pc), is also closely aligned with a brighter source, although its spectrum is a good fit to standards.

Ultracool Subdwarfs and Y Dwarfs
Despite creating specific selection criteria, we did not identify any obvious Y-type dwarfs or metal-poor subdwarfs in our sample. The lack of Y dwarfs is likely a consequence of the relatively small volume sampled for these intrinsically faint brown dwarfs. In Paper II, we estimate an expected number of Y dwarfs in this sample to be <1. The lack of subdwarfs may be due to their low abundance and selection biases. Ultracool subdwarfs are known to exist in the solar neighborhood Zhang et al. 2019;Schneider et al. 2020), but are rare compared to solar-metallicity disk dwarfs (∼0.25% in the solar neighborhood; Pirzkal et al. 2009). However, subdwarf members of the halo and thick disk populations should be more common at the larger distances probed by this survey (see, Ryan & Reid 2016). Lodieu et al. (2017) measured a surface density of M5 subdwarfs of 0.04 deg −2 to a depth of J = 18.8. For an average limiting depth of F110W = 22 this corresponds to ≈ 16 deg −2 , or roughly 10 M5 subdwarfs over our total search area of ≈0.6 deg 2 . Nevertheless, no subdwarfs were identified by our index selection, random forest classifier, or DNN classifier. There are several possible reasons that these sources were missing. First, subsolar metallicity is known to suppress the strength of molecular features, notably H 2 O, in the near-infrared spectra of UCDs, largely through the presence of enhanced collisioninduced H 2 absorption (Linsky 1969;Zhang et al. 2017). Relatively featureless late-M subdwarf spectra would have been rejected by our line fitting test, while more structured L subdwarf spectra would have been exceedingly rare in our sample. In Paper II, we model the full selection function of our search criteria applied to our calibration samples. Thus, a measurement of the density and distribution of ultracool subdwarfs likely requires the inclusion of the shorter-wavelength red optical data to capture characteristic features in this region (e.g., red optical slope and TiO, VO, FeH, and CrH bands), but the data remains incomplete for these short wavelengths.

Summary
We summarize our findings as follows: 1. We have identified 164 late-M, L, and T dwarfs in 0.6 deg 2 of HST/WFC3 parallels grism spectra from the Figure 13. Top: offsets between 2MASS J-and H-band magnitudes and HST/WFC3 F110W, F140W, and F160W magnitudes as a function of spectral type, based on spectrophotometry of spectral templates. Bottom: derived absolute magnitude-spectral type relations for HST filters. Orange points show relations bootstrapped to 2MASS J-band photometry, blue points show relations bootstrapped to 2MASS H-band photometry. The solid lines delineate the sixth order polynomials whose coefficients are listed in Table 10. WISPS and 3D-HST samples. These objects were selected from over 250,000 sources using a combination of imaging morphology, spectral indices, template fitting, and visual confirmation. 2. We evaluated three different methods for source identification: traditional spectral index-based selection, random forest classification, and DDN classification. The latter two provided a substantial reduction in source contamination while missing only a small number of late-M dwarfs, and provide a superior approach to the discovery of rare sources in large spectral samples. 3. Spectral classification and empirical relations allow us to map the distances of these sources to up to 3 kpc, including many of the most distant spectroscopically confirmed brown dwarfs currently known. 4. The lack of detection of clear subdwarfs and/or Y dwarfs in the sample is expected due to the rarity of these populations and small area coverage of these surveys, although the limited wavelength range and our selection criteria may have biased the sample against late-M subdwarfs.
This paper provides an expansion of previous deep spectroscopic surveys, limited to only a handful of sources with d  1 kpc. Nevertheless, the number of known distant L and T dwarfs remains small. Fortunately, forthcoming wide-field spectral surveys achieved with the James Webb Space Telescope (JWST; Gardner et al. 2006), SPHEREx (Doré et al. 2014), and Nancy Grace Roman Space Telescope (Spergel et al. 2015), and deep multi-epoch, multi-color photometry and astrometry with the Vera Rubin Observatory (Ivezić et al. 2019) and the Euclid space telescope (Laureijs et al. 2011) will allow us to probe to be greater depths (J ≈ 27) and broader fields of view. Ryan & Reid (2016) have predicted a mean UCD surface density of ≈ 0.3 arcmin −2 (or ≈1000 deg −2 ) in JWST down to a depth of J = 24. The recently approved JWST Parallel Application of Slitless Spectroscopy to Analyze Galaxy Evolution survey (PASSAGE, Cycle 1 GO-1571, PI Malkan) will obtained slitless grism spectra over a total area of 7.5 deg 2 down to F115W and F200W ≈ 27 in a manner similar to WISPS and 3D-HST. We expect this survey to detect of order 10 4 -10 5 thin disk UCDs, thousands of thick disk UCDs, and hundreds of halo UCDs out to distances of 10 kpc, with full 1-2.2 μm spectra, enabling more robust . Gray X symbols indicate HST pointings without any UCDs. Bottom panel: distribution of observed UCDs in cylindrical galactic coordinates (R, Z; left and middle) and scatter plot of (right) these coordinates. Sources are color coded according to spectral type. Later-type UCDs are concentrated around the solar neighborhood while earlier-type M dwarfs are located further into the Galactic disk. segregation of contaminants and characterization of discoveries. In Paper II, we use this sample and a set of Monte Carlo simulations incorporating assumptions about the stellar mass function, star formation history, UCD evolutionary models, and Galactic structure to constrain the Galactic scale-height and population ages for these distant UCDs.
This research is based on observations made with the NASA/ESA Hubble Space Telescope obtained from the Space Telescope Science Institute, which is operated by the Association of Universities for Research in Astronomy, Inc., under NASA contract NAS 5-26555. These observations are associated with program(s) GO-12177 and GO-12328 (3D-HST), and programs GO-11696, GO-12283, GO-12568, GO-12902, GO-13352, GO-13517, and GO-14178 (WISPS Portions of this work were conducted at the University of California, San Diego, which was built on the unceded territory of the Kumeyaay Nation, whose people continue to maintain their political sovereignty and cultural traditions as vital members of the San Diego community. Software

Indices
Notes. Table 11 is available in machine-readable format. a Classification by using spectral standards. b Selection methods are index-index selection (IND), random forest selection (RF), and deep neural network (DNN).
(This table is available in machine-readable form.)