A Data-Driven Approach to Estimating Occupational Inhalation Exposure Using Workplace Compliance Data

A growing list of chemicals are approved for production and use in the United States and elsewhere, and new approaches are needed to rapidly assess the potential exposure and health hazard posed by these substances. Here, we present a high-throughput, data-driven approach that will aid in estimating occupational exposure using a database of over 1.5 million observations of chemical concentrations in U.S. workplace air samples. We fit a Bayesian hierarchical model that uses industry type and the physicochemical properties of a substance to predict the distribution of workplace air concentrations. This model substantially outperforms a null model when predicting whether a substance will be detected in an air sample, and if so at what concentration, with 75.9% classification accuracy and a root-mean-square error (RMSE) of 1.00 log10 mg m–3 when applied to a held-out test set of substances. This modeling framework can be used to predict air concentration distributions for new substances, which we demonstrate by making predictions for 5587 new substance-by-workplace-type pairs reported in the US EPA’s Toxic Substances Control Act (TSCA) Chemical Data Reporting (CDR) industrial use database. It also allows for improved consideration of occupational exposure within the context of high-throughput, risk-based chemical prioritization efforts.


Section S1
To demonstrate how our model could be used to screen new substance and workplace combinations without OSHA monitoring data, we leveraged data from the US EPA's Toxic Substances Control Act (TSCA) Chemical Data Reporting (CDR) database. The data under the US EPA's CDR rule is collected in four-year cycles. The most recent cycle which was publicly available at the time of our analysis was the 2016 cycle. The comma separated values (CSV) file of the 2016 CDR was obtained from the US EPA's CDR download page. 1 We used the industrial use file from the 2016 CDR cycle as it was considered most relevant to the field of occupational exposure and was the only reporting file that contained information on industrial sectors.
To make predictions, our model required the NAICS sector and subsector codes (i.e., the first 3 NAICS code digits) and the chemical structure in the form of a SMILES string, which is passed to the OPERA QSAR models to obtain the necessary physicochemical properties. While chemical and industrial use information is required to be reported to the US EPA, this information can be redacted from the public version of the data if it is considered confidential business information (CBI), which leads to many records in the public database being unusable.
In other cases, due to the manufacturing pipeline, industry type may not be known or reasonably ascertainable and may be declared as such. Further, when reporting data, reporters must choose one of 48 US EPA industrial sector (IS) codes for each record to convey the industry that uses the substance. The US EPA does provide a crosswalk of IS codes to NAICS codes when discussing the reasons for switching from requiring manufacturers to report NAICS codes to requiring IS codes. 2 However, the 48th industrial sector code (IS48) is essentially a code that allows a CDR reporter to insert free text into this field. For this reason, matching the text in the IS code field is not a straightforward process.

S4
The CDR industrial use file for the 2016 cycle contains 64,389 records of reported chemical information. Records whose chemical substance or industrial sectors was either redacted as CBI or left blank were removed. For the remaining records, the text for the industrial codes was canonicalized to facilitate matching using the IS-to-NAICS crosswalk. After this, only unique records of Chemical Abstracts Service Registry Number (CAS RN), chemical name, and industrial sectors were retained. This left a dataset of 24,757 records with 7,382 unique CAS RNs and 544 unique industrial sectors. An attempt was made to match all chemical names and CAS RNs in this dataset to DTXSIDs via the US EPA's CompTox Chemicals Dashboard's batch search functionality. 3 We dropped any records for substances that could not be matched to DTXSIDs, or did match but did not have associated QSAR-ready SMILES strings, as there would be no way to obtain the necessary physicochemical properties from the OPERA QSAR suite. After matching, there were 12,512 records with 3,702 unique CAS RNs and 215 unique industrial sectors. The 215 unique sectors were matched to the IS-to-NAICS code crosswalk, which provided 54 direct matches. The rapidfuzz Python package 4 was then used to perform sorted, tokenized matching of the remaining 161 IS codes to NAICS codes. In some cases, it was found that the free text of the IS code field mapped to multiple NAICS codes. When this occurred, the record was split into multiple records so that each record contained only a single substance by NAICS code pair.
After fuzzy text matching and expansion, the dataset contained 3,701 unique substances and 15,152 unique substances by NAICS code pairs. We then eliminated any records with NAICS sector and subsectors that were outside the domain of our OSHA dataset (Table S2) since our model could not make predictions for these observations. Additionally, 341 of the substances from the CDR data had OPERA-predicted physicochemical properties that were S5 outside the domain of the OSHA data our model trained on (Table S3), so we removed these substances to avoid extrapolation. Finally, we removed substance-by-NAICS sector/subsector combinations that were already included in the OSHA data analysis. This resulted in a final dataset of 2,875 new substances and 5,583 new substance-by-NAICS sector/subsector pairs. Of the 7 NAICS sectors represented in the final dataset, "Manufacturing" was by far the most common, with ~96% of reports falling into this category. Each unique pair was passed to our pre-trained two-part Bayesian hierarchical model in the form of NAICS sector and subsector and the six required physicochemical properties and their two-way interactions. For each pair, our model predicted a detection probability and an air concentration for detects.