Smartphone-Based CO2e Emission Estimation Using Transportation Mode Classification

As a first step towards decreasing greenhouse gas emissions originating from transportation, it is critical that we create efficient systems for monitoring individual travel patterns and the associated carbon footprints. To this end, this paper presents a CO2e emission estimator that combines transportation mode classification with mode-specific emissions data. In addition to assessing the accuracy of the final emission estimation, we also categorize error sources and discuss their relative importance. Finally, we provide recommendations for designers of future carbon footprint estimators. Experimental results support the notion that transportation mode classifiers used for carbon footprint estimation should be evaluated based on their ability to identify carbon emitting transportation modes, while giving lower priority to recognition of various stationary activities and low-emission transportation modes. Additionally, it is demonstrated that errors in the estimated traveled distance have a low impact on the overall emissions error compared to errors in the transportation mode classification or in the assumed emissions per traveled distance for a specific mode.


I. INTRODUCTION
Climate change is one of the greatest challenges of our age. The continuous release of greenhouse gases causes rising atmospheric temperatures, and thereby disrupts nature's fragile balance. Between 1990 and 2019, the global warming effect increased by 45% [1]. While there are several contributing factors, including deforestation and livestock, transportation is one of the main contributors. In the US, over a quarter of all greenhouse gas (GHG) emissions in 2020 were due to transportation, with road and rail modalities liable for about 75% alone [2]. In the UK, other sectors within the economy, such as energy, have steadily declined their emissions by, for example, using alternate energy sources. Meanwhile, the transportation sector has remained relatively static, and in 2016 became the largest emitting sector in the UK [3]. Globally, the emissions due to transportation have increased by 2% annually over the last decade, with some regions seeing annual increases exceeding 4% [4].
The associate editor coordinating the review of this manuscript and approving it for publication was Parikshit Sahatiya.
By monitoring an individual's travel patterns, it would be possible to both estimate their carbon footprint and provide incentives towards changing their travel behavior. One way to do this is to use smartphone-embedded sensors [5]. Smartphones are increasingly fundamental in modern day life, with the smartphone penetration rate exceeding 90% in many developed nations [6]. Moreover, they are embedded with a wealth of sensors, including inertial sensors and GPS receivers, and are natural platforms for user feedback. All in all, the high availability and versatility of smartphones make them very attractive devices for transportation mode classification. The first transportation mode classification systems only utilized cellular positioning [7]. However, following the widespread increase in smartphone usage, these systems have mainly relied on measurements from GPS receivers and inertial sensors. Important features include the speed [8], which tends to be very different for motorized and non-motorized modes; the rate of change in the velocity direction [9], with non-motorized modes allowing for more rapid and frequent turns; the position [10], which, for example, can provide information on whether the user is travelling a along a highway or a railway line; the acceleration distribution [11], which generally looks different depending on whether a vehicle is moving alongside other traffic (like a car) or uses a separate track with few disruptions (like a train); and the number of stops per driven distance, which generally is higher for public transport than for other modes. The window length used for classification can either be a fixed variable or it can be allowed to vary based on the output from a separate algorithm used for identifying switches from one transportation mode to another [12]. Moreover, classification algorithms are categorized as either heuristic rule-based approaches or machine-learning approaches [13].
Although transportation mode classification can be used within a variety of applications [14], including urban transportation planning [15], traffic safety [16], traffic management [17], and insurance telematics [18], the focus of the present study is on carbon footprint estimation [19]. Despite the critical nature of climate change and the large number of studies within transportation mode classification, research which utilizes mode classification results for the estimation of carbon emissions is very rare. There are scientific studies on transportation mode classification that are motivated by carbon footprint estimation [20], however, most of these only implement and evaluate the transportation mode classification, and do not describe how to compute emission estimates [21]. For example, [22] describes a transportation mode classifier based on a decision tree that uses frequency-domain features computed from accelerometer measurements. The choice to not utilize GPS measurements is motivated by the associated increase in energy consumption. Nevertheless, at the end of the article, the authors point out that GPS measurements would be necessary to obtain CO 2 emission estimates. Similarly, the study in [23] describes a smartphone-based system for CO 2 emission estimation. However, the study does not evaluate the CO 2 emission estimation (only the transportation mode classification), and there is no description of the algorithm for estimating the traveled distance.
In this paper, we design a CO 2 estimator all the way from sensor measurement to individual carbon footprint estimates. As illustrated in Fig. 1, the estimator combines information from a transportation mode classifier with GPS-based estimates of traveled distance and mode-specific data on emissions per distance to estimate the total carbon equivalent emissions during a specific time window. By adding up multiple such emission estimates over a longer time period, it is possible to obtain, for example, daily or monthly individual transportation carbon footprints. The resulting estimator could be integrated in a smartphone app that provides detailed emission statistics and encourage the use of eco-friendly modes through, for example, gamification and discount awards. While analyzing our emission estimation system and the associated experimental results, we discuss several implementation and performance aspects that have previously been ignored in the context of smartphone-based carbon emission estimation. As an example, we analyze and compare the error sources of our estimation system. In particular, note that the emission estimation system described above can be said to have three error sources: (i) errors in the transportation mode classification, (ii) errors in the estimated traveled distance, and (iii) errors in the assumed emissions per traveled distance for a given mode. So far, there have been no investigations of how these error sources compare or how to reduce their impact on the final emission estimates.
The main contributions of this paper are as follows: 1) An investigation into the relationship between transportation mode classification accuracy and CO2 emission estimation accuracy. As is demonstrated, a high classification accuracy does not necessarily imply a high emission estimation accuracy. Therefore, it is important that the transportation mode classification is VOLUME 11, 2023 54783 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
specifically designed for emission estimation applications. 2) A comparative analysis of the impact of the different error sources in the emission estimation. This analysis is based on both experimental results and surveys of available datasets. 3) Based on experimental data, we provide insight into the relationship between algorithm parameters, such as the classification window length and the chosen sensors, and performance characteristics.

II. ALGORITHM DESIGN: TRANSPORTATION MODE CLASSIFICATION
The design of the transportation mode classifier will be constrained by the chosen dataset. We have chosen the commonly used Sussex-Huawei Locomotion (SHL) Dataset [24]. This dataset considers a wide variety of sensor modalities (including accelerometers, gyroscopes, GPS receivers, magnetometers, orientation, gravity, linear accelerometers, pressure, altitude, temperature) and provides manually annotated mode information at a high resolution. Most sensor data and the mode labels have a sampling rate of 100 Hz, while the GPS measurements have a sampling rate of 1 Hz. The mode labels are categorized as null (recordings that cannot be classified with confidence or are not any of the other possible transportation modes), still, walk, run, bike, car, bus, train, and subway. We used the data from user2 since all sensor modalities were available for this user. Given that this is a new area of study, our focus has been on assessing the accuracy that could be achieved when training and testing on a single user. Estimating the accuracy on larger cohorts or on unseen users is left for future studies. Since many individuals have their smartphone in their hand while travelling [25], we decided to use the measurements from the handheld smartphone.
A. DATA CLEANING Generally, data from smartphone-embedded sensors often contain missing values [26]. The SHL dataset, in particular, includes a sizable number of missing completely at random (MCAR) rows at the head and tail of various files, due to the asynchronous initialization and finalization of various sensors. The presence of MCAR data is due to external factors, and thus cannot be accurately predicted based on observed values. Therefore, in this study, readings where the number of missing values exceed a user-specified threshold are removed from the dataset to create a reliable and unbiased model. The omission of these samples was deemed safe, given their small contribution to the overall dataset [27]. Another data cleaning approach that was considered was imputation, in which missing data is replaced through analyzing patterns in the dataset. This method should only be applied when the number of missing values is low, as guessing large portions of data will reduce a dataset's natural variation and the reliability of the resulting model [28]. With these caveats in mind, imputation will not be applied to the SHL dataset which often includes long series of missing data.
Moreover, imputation has proven time-consuming within real-time applications [29], and researchers that impute smartphone-sensor data mostly do so in the context of offline data recovery [30]. Further, this paper will assess the relative performance of sensors. Therefore, imputing values for one whilst implementing real values for another would result in an unfair comparison.

B. SEGMENTATION AND FEATURE ENGINEERING
The transportation mode classification utilized nonoverlapping sliding windows. There is no standardized window length for transportation mode classification. Papers using only inertial-based sensors use shorter window sizes. For example, [31] uses window sizes of between 2 and 10 seconds with only accelerometer data. However, location-based transportation mode classification studies tend to use longer window sizes that span several minutes [32]. Given that this paper will use a combination of inertial and location-based sensors, it is necessary to find a balance between the varying window sizes. Therefore, in most experiments, the classification features are computed using data from one-minute windows. However, additional window sizes are studied in Section V-A, where we analyze the relationship between the window size, the mode classification performance, and the emission estimation performance. The computed features are the sum, minimum and maximum of each sensor axis over a given window size.
To mitigate dimensionality issues, we used the maximum relevance -minimum redundancy (MRMR) feature selection method, in which the number of desired features is input as a parameter K . It is worth noting that feature selection does not always improve classification accuracy. If K is insufficient, the model may not identify strong enough relationships to make adequate predictions. Conversely, excess features will encourage overfitting and reduce the model's performance on unseen smartphone sensor data. For these reasons, this study will consider several values for K and analyze the effect of K on the emission estimation in Section IV-B.

C. CLASSIFICATION
To ensure that the classifier is able to utilize information from all different sensor modalities, all features are scaled to have zero mean and unit variance. This is particularly important given the large number of different sensor modalities. The data is then split into training and testing sets.
In similarity with previous studies on transportation mode classification, we use 75% of the dataset for training and 25% for testing [33]. The trained models do not consider the temporal dependency of sequential time windows. The SHL dataset has an unequal proportion of transportation modes. Therefore, stratified sampling is implemented to ensure that the relative proportions of different modes is consistent across the original, training, and test datasets, thereby reducing potential sampling bias [34]. Two classifiers will be explored, namely support vector machines (SVM) and random forests (RF). Both of these have been proven effective in previous studies on transportation mode classification [35], [36]. For SVMs, both the linear and the Gaussian kernel will be considered, and for RF, we will study how the performance varies with the number of decision trees.

III. ALGORITHM DESIGN: CO 2 e EMISSION ESTIMATION
Once the transportation mode classifier has output a transportation mode m i for each window i in a given time period, this can be used to compute the emission estimatê Here,Ê is the estimated total emissions over the time period composed of the windows numbered from 1 to N ,D i is the estimated traveled distance over window i, and E m i is the emissions per traveled distance for transportation mode m i . Details on how we obtained emissions data E m i for each considered transportation mode and how we estimated the traveled distanceD i are presented in Sections III-A and III-B, respectively.

A. EMISSIONS DATA
The greenhouse gas protocol [37] outlines desired characteristics of datasets used to measure GHG emissions for private and public sectors. Specifically, the protocol states that in order to create accurate carbon footprint estimates, the dataset must be relevant (to the project's purpose), current (relative to the data collection period), reliable, and consistent (allowing for fair comparisons over time). To achieve these objectives in our emission estimation, this study uses the UK Department for Business, Energy & Industrial Strategy (BEIS) Conversion Factors 2018 [38]. Note that while the SHL dataset was recorded in 2017, the conversion factors from 2018 were chosen since the conversion factors for 2017 lack crucial information regarding battery electric vehicles (BEVs), which are becoming increasingly popular with a market share increase in the UK from 0.7% (2018) to 11.6% (2021). The conversion factors are specified in g/km of produced emissions. The GHG protocol further highlights general practices that studies should follow when computing carbon footprint estimates. Firstly, the project must outline comprehensible and unambiguous details of the processes and sources utilized. Secondly, there must be detailed reasoning of any assumptions and justifications of omissions.
In this study, we have made the following assumptions when extracting mode-specific emissions data: • The BEIS provides several different conversion factors for cars, based on e.g., the size of the car and the fuel type. In this study, we used information from these multiple conversation factors to compute one conversion factor for cars. This conversion factor was computed by considering the data for cars of ''Average'' size, and by using the market share of cars by fuel type in 2018 to weigh the conversion factors for cars with varying fuel types [39]. Although the market share data contains the category ''Mild Hybrid Electric Cars'', this category is not among the BEIS conversion factors and is therefore omitted in the weighting. Given that these cars only make up 0.6% of the total market share, this decision is not expected to have any major impact on the analysis.
• Sampling instances labelled as Bike in the SHL dataset are assumed to refer to regular non-emitting bicycles, rather than motorcycles or e-bikes.
• Sampling instances labelled as Subway in the SHL dataset are assumed to refer to the London Underground as the data was collected in the south-east of the UK, including London, and there are no other known underground passenger systems around this area.
• Sampling instances labelled as Train in the SHL dataset are assumed to refer to National Rail. The emissions for travelling via Eurostar are excluded since no subject traveled on this modality during the data collection.

B. ESTIMATING THE TRAVELED DISTANCE
The distance traveled over a classification window was estimated by computing the distance in between sequential GPS position measurements and then adding these up. The distance between sequential GPS position measurements was computed using the Haversine formula, which outputs the great-circle distance between latitude and longitude pairs on a sphere [40]. Thus, the input to the Haversine formula is consecutive GPS position measurements and the output is the distance between these measurements. To illustrate this process, Fig. 2 shows an example of GPS coordinates at varying sampling rates within a one-minute window. Each line represents the shortest distance between two consecutive readings. Evidently, at 0.1 Hz, the GPS sensor misses several readings. Section V discusses the effect of the sampling rate on the computed distances, and the resulting emission estimation error.

C. EVALUATION METRICS
The classification models will be evaluated based on two metrics. The first, called the overall F1-score, is the F1-score when considering all modes. Classification models are commonly evaluated using the F1-score, commonly referred to as the ''harmonic'' mean between precision and recall. The F1-score metric works well when the dataset is imbalanced, and given that the number of Run windows makes up only 2.8% of the dataset, whilst 15.3% is comprised of Train, this metric is very suitable. The second considered evaluation metric is the carbon F1-score. This is the F1-score for the binary classification between carbon emitting and nonemitting modes. Note that the binary classification of modes is only used for performance evaluation; all the implemented models did a transportation mode classification using all available modes. The carbon F1-score was used to evaluate how well a model differentiates between carbon emitting and neutral modes, as this ability is considered more desirable for carbon footprint estimation applications than accurately differentiating between, for example, Walk and Run, or Car and Bus. In the context of the carbon F1-score, false positives will refer to false classifications of non-emitting modes as carbon emitting mode and vice versa.

IV. PARAMETER TUNING AND FEATURE SELECTION
Both the SVM and RF require parameter tuning and feature selection. In addition, studying the impact of the model parameters will help us analyze how errors in the transportation mode classification propagate through to the CO 2 e emission estimation. When studying parameter tuning in Section IV-A, both classifiers used all available features.
A. PARAMETER TUNING FOR TRANSPORTATION MODE CLASSIFIERS 1) SVM KERNEL Two common SVM kernels are the linear and Gaussian kernels, both of which have been considered previously in transportation mode classification [41]. Fig. 3 shows the confusion matrix when using a linear kernel. With a linear kernel, the SVM achieved an overall F1-score and carbon F1-score of 79% and 90%, respectively. Three of the nine classes reach nearly 100% accuracy, and only 4% of windows were misclassified as a false negative (FN), and even fewer as a false positive (FP) (3%). In comparison, the Gaussian kernel outperforms the linear kernel with respect to classification performance, with an overall F1-score of 81%. Fig. 4 displays how it more accurately differentiates between Train and Subway compared to the linear kernel, and all classes -bar one -reach a minimum accuracy of 73%. These results align well with previous studies that concluded that the Gaussian kernel produce the best results for transportation mode classification [41]. However, although it achieved the same carbon F1-score as the linear kernel (90%), it does falter in some ways. In fact, 6% of windows were FN, an unsought increase from the linear kernel, which can be attributed to over half of the Bus windows being classified as carbon neutral modes. Further, note that the Gaussian kernel achieved an emission estimation error of 3.1%, whilst the linear kernel achieved an error of 0.77% (only considering errors from the transportation mode classification). Therefore, in the  subsequent analysis, we have used the linear kernel for CO 2 e emission estimation.

2) RF DECISION TREE CLASSIFIERS
For the RF, we need to set the number of decision tree (DT) classifiers. To evaluate the effect of this parameter on the CO 2 e emission estimation, the RF classifier was trained using between 1 and 200 decision trees. Fig. 5 shows how the overall F1-score and training time are dependent on the number of 54786 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  decision trees. As can be seen, the F1-score begins to stabilize at around 30 decision trees. This indicates that to optimize performance without wasting computational resources, the number of decision trees should be around 30. It is important to note that while the training time here is rather short, datasets may be much larger than what is considered here. Likewise, there may also be reasons to consider incremental on-device training. As a result, both the training time and the motivations to reduce it will be greater. Fig. 6 displays the emission error and carbon F1-score as dependent on the number of DTs. Here, when the number of DTs is 30, both the carbon F1-score and the emission error reaches optimal values of 97% and −0.1%, respectively. Thus, in the subsequent analysis, we used a RF with 30 DTs.

B. FEATURE SELECTION FOR TRANSPORTATION MODE CLASSIFIERS
The number of features in the transportation mode classifiers, K , was optimized based on the resulting CO 2 e emission estimates. As illustrated in Fig. 7, the RF achieves a higher overall F1-score than the SVM for all values of K . Further,  as seen in Fig. 8, the RF is more stable than the SVM in terms of the CO 2 e emission error. The SVM ranges between emission errors of −19% and 100%, whilst the RF has a much smaller range of −16% to 17%. Interestingly, the SVM fluctuates between −20% and close to 0% seemingly randomly, whilst the RF model stabilizes quickly. As a result, only the RF model will be considered for the rest of the study, as they have proven to perform consistently better than SVMs with respect to overall classification performance and emission estimation. Unlike for the number of DTs, there is no point at which both the F1-score and emission error are at their respective optimal values. As a result, 10-fold cross validation was performed on the number of features, and the highest balanced accuracy (88%) was achieved when K = 81, therefore making this number suitable for minimizing the estimation error. Based on these findings, the next section will, unless otherwise specified, use an RF trained with 30 DTs and 81 features.

V. RESULTS AND ERROR ANALYSIS
This section will present the results of the emission estimation. In addition, we will analyze and compare the impact of three error sources in the emission estimation system, namely errors in the transportation mode classification, errors in the estimated traveled distance, and errors in the data specifying emissions per traveled distance. When studying one error source, the analysis will be performed as if there are no errors originating from the other error sources.

A. EFFECT OF ERRORS IN THE TRANSPORTATION MODE CLASSIFICATION
Implementing an RF with the parameter settings described in Section IV-B results in an emission error of −0.97%. Fig. 9 demonstrates which misclassifications that are responsible for the emission error. The majority of the error can be attributed to mistaking Train as Car, which results in an overestimation of 55g CO 2 e per kilometer. The second biggest contributor was mistaking Car as Subway, with around 20g of emissions lost per kilometer. Generally, the carbon neutral modes (mainly Walk, Run, and Bike) account for a very small share of the emission error. This is because when they are misclassified, the mode is commonly assumed to be another carbon neutral mode (e.g., Still mistaken for Walk). Next, we will investigate how variations in the chosen window length and in the employed sensors impact the transportation mode classification and, in turn, the emission estimates.
To investigate the impact of the window size, we computed the F1-score of the RF model for window sizes ranging from 1 second to 2 minutes. First, it is important to consider how the training time is affected by the window size. Fig. 10 shows that the training time plateaus at around 20 seconds, and that the computational power required is significantly higher when the segments are smaller. This relationship will be considered throughout the following analysis. Fig. 11 shows the overall and carbon F1-score as dependent on the window size. Notably, the performance is highly unstable, and gradually begins to decrease as the window size increases, particularly for the overall F1-score.     12 demonstrates how the overall F1-score changes with the estimation error. The reference line at 0% on the secondary y-axis is solely used as a visual aid to indicate the optimal emission estimation performance. When the window size is 1 second, the overall F1-score is 96%, and we also achieve a good carbon F1-score of 98%. Further, the emission error is only 0.47%, making this window size the ideal value for carbon footprint emission estimation. However, as demonstrated in Fig. 10, the training time at 1 second is exceptionally high. For window sizes between 5 and 11 seconds, the emission error is relatively stable, not exceeding 0.2%, and reaches a minimum absolute value of 0.06%. After this point, the emission error becomes unpredictable, and the F1-score begins to decline. However, throughout this, the largest overand underestimations were only 5.05% and −5.04%, respectively. It should be noted that the sensors chosen by MRMR differ as the window size varies. The distance traveled was not considered important until the window size increased to around 3 seconds. Generally, as the window size increased, the contributions from the GPS measurements were gradually seen as more and more essential to the RF model. To investigate the impact of the sensors chosen to be used for transportation mode classification, the RF model was trained individually on the ten sensors. For the first six sensors, we considered a combination of the features obtained from their respective dimensions (i.e. X, Y, Z). Fig. 13 shows the overall F1-score obtained by each sensor. The accelerometer and gravity sensors, the former being one of the most commonly used sensors in transportation mode classification [32], [42], achieved the best performance with respect to overall classification. However, note that the best performing sensor still had an emission estimation error around 5.6%. To gain a better understanding of how the performance of the transportation mode classification relates to the emission estimates, we also studied the confusion matrices to see how well each sensor can differentiate between different transportation modes (for reasons of brevity, we only present the accuracy for each sensor-transportation mode pair in Fig. 14). One thing to note is that linear acceleration is the best sensor at separating Walk and Run, which is intuitive given the close connection between linear acceleration and walking speed. The altitude and pressure sensor is the best at classifying Subway, with an accuracy of 78% and 80%, respectively. This can be explained by the fact that the Subway is underground, thus averaging at a lower altitude than other modes, while the air pressure in the London underground can be significantly higher than the air pressure above ground [43]. Moreover, the temperature sensor is the best at differentiating between Train and Subway, two modes which commonly are mixed up by the other sensors.
The analysis above clearly motivates the need to combine information from multiple sensors. On their own, individual sensors typically struggle to classify individual modes. For example, the orientation sensor mistakes 50% of Run windows for Walk, and 35% of Bus windows for Subway. In fact, Run was misclassified entirely by three sensors, namely orientation, temperature, and GPS. Similarly, accelerometers encounter significant challenges when attempting to differentiate between different forms of locomotive transportation, mistaking nearly half of Bus windows as Car or Subway. Further, the pressure sensor mistakes 50% of Run windows as Bike, and 43% of Train windows as Car. In fact, no isolated sensor could classify every class with at least 50% accuracy. This analysis agrees with findings from [44], which only used iPhone accelerometer data and discovered that their GDA model often confused Biking and Driving, only reaching an accuracy of 45% for Biking. It should also be noted that the misclassifications often transcend the boundary between carbon emitting and non-emitting modes. Temperature performed the worst in this respect, misclassifying 11% of samples as FN. Contrarily, the pressure sensor performed best at 4%, but produced a large estimation error of −19.12%, due to its inability to accurately differentiate between carbon emitting modes. In addition to combining information from multiple sensors, it may also be useful to incorporate additional information such as bus stop locations [45] or dwell times at these stops [46].

B. EFFECT OF ERRORS IN THE ESTIMATED TRAVELED DISTANCE
As discussed in III-B, the traveled distance is estimated using GPS measurements sampled at 1 Hz. To analyze the impact of imperfect estimation of traveled distance on the final emission estimation, we looked at how the GPS sampling rate affects the estimated distances, and subsequently, the emission error. Fig. 15 visualizes how the computed distance for each mode changes with the GPS sampling rate. As can be seen, the differences are comparatively small, with only around 3km lost for Car, corresponding to about 1.5% of the total traveled distance. Fig. 16 shows how these distances subsequently affect the emission estimation error. Here, the largest emission error was only 1.68%. These results indicate that for applications prioritizing energy efficiency, using a more sparse GPS sampling rate is a viable option and will in many cases only contribute to minor errors in the carbon footprint estimation. Likewise, the emission estimation system should be able to provide accurate estimates also in situations with occasional GPS outages. However, note that to analyze the total impact of missing GPS samples we must also consider how this affects the transportation mode classification.

C. EFFECT OF ASSUMED EMISSIONS PER TRAVELED DISTANCE
We will analyze the effect of errors in the assumed emissions per traveled distance on the emission estimation in two ways. First, we will simulate errors in the assumed car type (note that car emissions contribute to a substantial share of the total emissions). Second, we will study the impact of using emissions data from the wrong year.
In Section III-A, we used emissions data for a fictional ''average'' car. In practice, however, an individual will use either a petrol, diesel, hybrid, plug-in hybrid electric, or battery electric vehicle. For the purpose of this analysis, the    ''ground truth'' fuel type is assumed to be a petrol car, since this was the most common car type in 2018 [39]. Fig. 17 shows how the total emission error changes when the estimation system incorrectly assumes that the car is using an alternative fuel type. As expected, we see much larger emission errors for the more environmentally friendly fuel types. For example, assuming the use of a plug-in hybrid electric car, the estimated emissions drop by about 50%. In conclusion, it is crucial for carbon footprint estimation applications to consider the user's actual fuel type. While it is unlikely that the car fuel type can be inferred from the sensors considered in this study, it is possible to provide the user with opportunities to specify this as manual input (see the discussion in Section VI-B).
In similarity with the car fuel type, the year in which the emissions data is taken from can also affect the carbon footprint estimation (as discussed in Section III-A, the 2018 BEIS conversion factors were chosen based on the time period in which the SHL Dataset was collected). To analyze this effect, the BEIS Conversion Factors from 2019 [47] and 2020 [48] were used to recompute the emission estimates. It should be noted that the relative market shares for the different fuel types of cars were also updated accordingly. Fig. 18 shows a decline in emissions between 2018 and 2020. This change can be attributed to two factors. Firstly, cars powered by crude oil have been emitting less carbon in recent years. The emissions of both diesel and petrol cars decreased by 10g of CO 2 e per kilometer on average between 2018 and 2020. A similar trend can be seen for public transportation, with National Rail and underground lines decreasing by 7g/km and 10g/km, respectively. Secondly, in terms of the market share, there was a simultaneous decrease in crude oil cars and increase in electric vehicles. Petrol shares dropped from 61.9% to 55%, and plug-in hybrid cars increased from 1.8% up to 4%. In total, using BEIS conversion factors and market share data for the year 2020 resulted in an emission error of 21%, which highlights the importance of using up-todate emissions data when calculating an individual's carbon footprint.

VI. SUMMARY AND DISCUSSION
In this section, we will summarize and draw conclusions from the results in Sections IV and V. In addition, we will discuss design choices related to manual input, energy consumption, and cloud computing, and how these impact accuracy and usability.

A. ERROR SOURCES AND ERROR CHARACTERISTICS
As described in Section I, the presented emission estimation system has three error sources: (i) errors in the transportation mode classification, (ii) errors in the estimated traveled distance, and (iii) errors in the assumed emissions per traveled distance for a given mode. Examples of how these error sources affect the emission estimates are presented in Section V. The worst-case emission estimation errors found in these examples were about 20%, 2%, and 52% given errors in the transportation mode classification, estimated traveled distance, and conversion factors, respectively. This indicates that updated and accurate conversion factors are of utmost importance for the overall performance of an emission estimation system. Likewise, these findings also highlight the importance of being able to differentiate between traditional petrol cars and electric cars, a task which is outside of the scope of transportation mode classification. The results also indicate that errors in the estimated traveled distance are of lesser importance. This can partly be explained by the fact that a large share of the emissions derives from travelling in a car, which is a mode for which it is comparatively easy to estimate the traveled distance also with sparse GPS samples.
One important finding from our studies is that the performance of the transportation mode classification is not always a good indication of the performance of the emission estimation system. For example, when the window size was 4 seconds long, the transportation mode classifier reached a rather high F1-score of 94%. At the same time, the emission estimation error was −4.08%, which was one of the largest errors obtained during the study of window sizes. For this reason, designers of emission estimation system need to ensure that their transportation mode classification is specifically tailored to emission estimation. In particular, note that mistaking a non-emitting mode for an emitting mode comes with a much greater penalty than mistaking a non-emitting mode for another non-emitting mode. Thus, when evaluating a transportation mode classifier that is intended for use in an emission estimation system, the evaluation metric should give a higher penalty to those misclassifications that are associated with 1) a greater difference in the emissions per kilometer and 2) a greater traveled distance. To reduce the total number of misclassifications, developers may also consider merging all non-emitting modes into one class (or, more to the point, any two modes with same emissions per kilometer). However, from a user perspective, there are some benefits associated with using a more fine-grained mode classification. For example, the user could be provided with special offers or value-added services based on their travel habits. Additionally, being able to annotate their travel diary in detail would encourage more confidence in the emission estimation, and thereby increase user retention.
Additional design factors that were studied in Section V-A include the window size and the sensor set used for the transportation mode classification. Obviously, a smaller window size will increase both the computational complexity and the rate at which the emission estimation system can provide real-time updates. As demonstrated in Section V-A, a smaller window size will generally also result in a lower emission estimation error. Further, it should be noted that the analysis in Section V-A does not consider the loss in granularity resulting from the use of a longer window sizes. For example, a one-minute window that consists of 30 seconds of driving a car and 30 seconds of walking will always result in a loss of information. With regards to the sensor set, the results in Section V-A demonstrate clear benefits of using a diverse set of sensors. However, if only one sensor can be utilized, the accelerometer should be used. Other than being widely available as a smartphone-embedded sensor, it provides both the highest overall classification performance and one of the lowest emission estimation errors.
Future research on reducing emission errors could, for example, consider the relationship between emissions and driving style (aggressive driving, eco-driving, idling, etc.) and study the use of human activity recognition models to improve transportation mode classification. In addition, there will be a great need for solutions based on federated learning.

B. THE INCORPORATION OF MANUAL INPUT
Allowing for manual input or prompting the user for input at selected time points could provide several benefits. For example, in Section V, it was shown that missing information regarding the car fuel type could lead to errors of up to 50% in the emission estimates. However, if the user is able to provide information about what car is being used, errors of this kind can be avoided. Similarly, the model of the car will specify the size of the vehicle, which also impacts the user's emissions; the BEIS conversion factors in 2018 claim that a small petrol car emits 0.156g of CO 2 e per kilometer, whilst a large petrol car emits nearly twice as much at 0.284g. It would also be beneficial to know whether or not a user has a driving license. If they do have a driving license, it can be assumed that they may complete some car journeys alone, and will thus be solely responsible for the emissions produced during those trips. Conversely, if they do not have a license, this means that every detected car trip is being made together with at least one other person (that is, the driver). If we take this one step further, it would be possible to give users the opportunity to input the total number of passengers for every trip that they take part in. Assuming that the responsibility for the total emissions of a car trip should be split equally between all individuals in the car, the number of people present in the car will have a huge impact on the total emissions attributed to any single individual. Allowing users to manually input the number of people in the car would result in more accurate emission estimates, and may also encourage users to car share during commutes to work, etc. This being said, designers of emission estimation systems must also consider potential negative consequences of manual input, including the added inconvenience for users and how this may impact user satisfaction [49], and whether this may inadvertently give users incentives to provide false information.

C. ENERGY CONSUMPTION, COMPUTATIONAL COMPLEXITY, AND CLOUD COMPUTING
Applications aiming to promote more environmentally friendly travel habits should also aim to reduce the application's own energy consumption as much as possible. Therefore, since GPS receivers are notoriously energy-expensive, there have been several studies on transportation mode classification that have intentionally excluded GPS receivers. For example, [50] compared classifiers that used either a combination of GPS and accelerometer measurements or WiFi and Bluetooth data, and found that for five out of six modes, using GPS and accelerometers led to greater classification performance. It was only on train journeys with an unstable GPS connection but stable proximity to other Bluetooth devices that the latter information source resulted in a better classification accuracy. Thus, the authors concluded that alternative sensors should mainly be used to enhance the accuracy of the GPS data, or to infer location in the event of GPS signal loss [51]. Similarly, the general agreement across the literature is that that WiFi data works better for coarse-grained transportation mode classifications (that is, identifying whether or not the user is on motorized transport) than for the fine-grained mode classification that is required for carbon footprint estimation [52]. There is also the option to dynamically change what sensors are being used based on the user's access to power outlets. For example, since GPS receivers are rather power hungry, they should not be used when the user is unlikely to have access to a charger for an extended period of time (for example, when on a train or subway); in this situation, Bluetooth and WiFi signals should be utilized instead.
Another way to increase the energy efficiency of a emission estimation system is to reduce the sampling rate of the GPS receiver. As discussed in Section V, reducing the GPS sampling rate to 0.1 Hz only has a minor effect on the performance of the CO 2 e emission estimation. One study in this area used a particle filter to create a dynamic sampling rate, thereby avoiding unnecessary sensing and waste of energy [53]. They found that after reducing their energy-consumption by 15.0%, they still achieved a classification accuracy of 96.3%. To further improve the energy-efficiency of the application, developers should consider hosting their product on the cloud. Cloud servers make data hosting and computations more energy-efficient by using techniques such as flexible processing allocation. For example, research by Berkeley Lab and Northwestern University found that a business using cloud computing can reduce their energy consumption by 87% [54]. Further, storing the user's data on their smartphone will require storage space and could slow it down significantly, problems which may be alleviated by cloud hosting. However, it is important to consider the energy consumed during the transfer of data between the data center and the user. Those promoting cloud computing as a more environmentally friendly alternative often do not take the data transfer into account. Additionally, developers should consider the risks involved in cloud computing itself, such as user privacy, data security, and availability of service [55].