A Deep Learning-Based Approach to Enable Action Recognition for Construction Equipment

,


Introduction
Human action recognition (HAR) is one of the most popular research areas in computer vision (CV), and the outcomes have been extensively applied in video surveillance and human-machine interaction for a variety of application scenarios such as safety monitoring and control [1]. A basic methodology for recognizing human actions is to detect a human in a video/photograph, segmenting and mapping the body attributes followed by the result [2]. A number of methods for achieving HAR have been developed in the past few decades. Poppe [3] and Purohit and Chauhan [4] reviewed and classified different recognition algorithms for human actions. e deep learning (DL) method has attracted more attention in the area of CV following the success of AlexNet [5] in using convolutional neural networks (CNN) for image classification, gradually replacing the traditional HAR approaches based on the spatial and temporal structure of body movements. With advances in action recognition and the popularization of high-end hardware such high-resolution surveillance cameras, HAR is increasingly applied at work (e.g., in video surveillance at airports for security purposes) and at home (e.g., in health monitoring for elderly people).
In the construction industry, digital twin has been a wellrecognized concept for smart construction. Beside the analysis and optimization supported by a digital representation of the facility itself, there is also a need to enhance the performance of workers and equipment, which is facilitated by automated perception and analysis of on-site activities.
is requires the computer to not only identify workers and equipment but also recognize their locations and actions [6]. Many potential benefits could be realized if HAR could be applied for the monitoring and control of construction equipment. With a detailed analysis of equipment actions and a continuous optimization of equipment operation, a construction project can be better managed: for example, equipment productivity would be improved by reducing the idle time, the environment would be protected by lowering carbon emissions, and accidents would be avoided by coordinating the movement of workers and equipment [7]. Currently, the analysis of construction machinery activity is mainly performed by human workers, a process that can be time-consuming, costly, and prone to errors. A low-cost, reliable construction equipment action recognition (CEAR) method can enable automated analysis of many scenarios at construction sites and can support smart construction applications.
Researchers have long studied the recognition of construction equipment actions and shown promising results. e CEAR methods can be roughly divided into sensorbased methods and visual-based methods. In the works that use sensors for recognizing the activity of construction equipment, of note are real-time location systems (RTLS) [8,9], audio signals [10,11], inertial measurement units (IMU) [12,13], and so forth. In the works using visual-based methodology, many studies have adopted image processingbased approaches. Zou and Kim [14] used image color space (hue, saturation, and value) as the basis for image segmentation and tracing algorithms to identify the changing centroid coordinates of an excavator in successive images taken at fixed time intervals to achieve recognition of equipment movement. Gong et al. [15] proposed a visual learning approach to classify actions of construction workers and equipment using the Harris 3D interest point detector as the feature detector, local histograms as the feature representation, a bag-of-words model as the feature model, and Bayesian network models as the learning mechanism. Using a similar method, Golparvar-Fard et al. [7] studied single actions of construction equipment used for earthmoving. In this method, a video is initially represented as a collection of spatiotemporal visual features by extracting space-time interest points and describing each feature with a histogram of oriented gradients (HOG). e algorithm automatically learns the distributions of the spatiotemporal features and action categories using a multiclass support vector machine (SVM) classifier.
Many existing CEAR approaches rely on hand-crafted features, which is necessary to manually segment the timeseries data to extract statistical features.
is feature extraction process will be limited by human knowledge and can only extract shallow features specified by humans. us, they will only work well for simple actions that can be easily distinguished. e results will no longer be reliable once the viewpoint has changed, more background noise is present, and the views become blocked [3], which will be inevitable due to the dynamic nature of a construction site. Furthermore, these traditional methods require manual intervention at multiple points to adjust parameters, which limits the application of these methods in cases where end-to-end output is desired. Moreover, nearly all current methods for action recognition are data driven; that is, they require a large amount of data in order to train the recognition model to achieve a better recognition rate. As such, a large-scale dataset is of considerable significance. Golparvar-Fard et al. [7] developed the first comprehensive video dataset for action recognition of excavators and dump trucks. However, it is rare to find similar datasets for other types of construction equipment.
As a part of this research, a video dataset for excavators and dump trucks is first established to expand the size of the existing datasets on construction equipment, along with a corresponding optical flow dataset that can be used for related algorithm studies. Second, a deep learning-based CEAR model was proposed by taking advantage of deep learning, which has good generalization performance, is capable of end-to-end training, and requires no feature engineering [16]. Lastly, the deep learning-based method is compared with existing similar methods and some of the best HAR methods to determine whether the proposed method is advanced and whether advanced HAR algorithms could be directly transferred to CEAR. Related research works in HAR and CEAR are discussed in the next section, along with their limitations. Following that, the dataset development is illustrated in terms of camera arrangement, data acquisition method, and data processing method, and a detailed presentation of proposed deep learning-based CEAR model is provided. e experimental results and a comparison of the results of the proposed model to those of two HAR algorithms are discussed. e final section presents the conclusions that can be made based on the results of this research study.

Related Works.
In recent years, numerous research efforts have been made to apply action recognition technologies in the construction industry. Some of these studies used construction workers as their research subjects and focused on worker health and safety, while other studies used construction equipment as the research subject and aimed to improve productivity and reduce costs for the construction project. According to different data sources, the approaches used in these studies can be divided into methods based on visual data and methods based on sensors data. In the current construction industry, cameras are generally installed for surveillance and other purposes.
is makes the acquisition of visual data more convenient and cost-effective. Moreover, visual data can usually provide richer information. As such, this research focuses on determining the activities of construction equipment from visual data. Previous works that use vision-based methods, known as computer vision (CV), are discussed in this section. Seo et al. [17] reviewed previous attempts to apply CV for construction safety and health monitoring from both a technical perspective and a practical perspective. ey categorized previous studies into three groups-object detection, object tracking, and action recognition-based on the type of information required to evaluate unsafe conditions and actions. However, in the current status of CV application in construction sites, even the most advanced theories are faced with challenges such 2 Advances in Civil Engineering as a lack of high-quality datasets and the slow development of algorithms. At the same time, the development of deep learning technology, the increased image processing power provided by a graphics processing unit (GPU), and the decrease in price of specialized cameras bring about new opportunities for adopting CV-based applications in the construction industry.
2.1.1. Human Action Recognition. Many research studies have investigated CV-based human action recognition (HAR) systems, including both traditional hand-crafted and learning-based action representation approaches. e difference between these two approaches lies mainly in the method used to extract features from images. A traditional hand-crafted representation-based approach relies on the expert-designed feature detectors and descriptors such as Hessian3D, scale-invariant feature transform (SIFT), HOG, enhanced speeded-up robust features (ESURF), and local binary pattern (LBP). On the contrary, a learning-based representation approach uses a trainable feature extractor that automatically learns features from the raw data, eliminating the need for manual assignment and enabling endto-end learning. e traditional hand-crafted action representation approach has been popular in the HAR community and has achieved remarkable results when using various well-known public datasets [18]. is approach includes four techniques: the space/time-based method [19], the appearance-based method [20,21], the LBP-based method [22], and the fuzzy logic-based method [23]. Using these methods, the important features from a sequence of image frames are extracted to build the feature vector prior to classification by a trained classifier. For example, dense trajectory (DT) uses trajectories to capture the local motion information of the video, and a dense representation guarantees good coverage for capturing foreground motions as well as the surrounding context [24]. Wang and Shmid [25] proposed an improved DT (iDT) approach to improve the performance of video representation by making corrections that take camera motion into account. Currently, researchers are aiming to increase the quantity and quality of the dataset for human action recognition. However, most successful hand-crafted representation methods are based on local densely sampled descriptors, which will result in a higher computational cost. e appropriate and efficient representation of data is the key to HAR. Unlike the above-mentioned approaches, where an action is represented by hand-crafted feature detectors and descriptors, learning-based representation approaches have the ability to learn a feature automatically from the raw data, thus introducing the concept of end-to-end learning, which refers to transformation from the pixel level to an action classification and is not limited by human knowledge [18]. Some learning-based approaches are based on genetic programming [26] and dictionary learning [27], while others employ deep learning-based models for action representation.
Deep learning is an important area of machine learning which aims to achieve learning at multiple levels of representation and abstraction in order to make sense of data such as speech, images, and text. e research of Karpathy et al. [28] showed the potential of CNN for largescale video classification tasks. Simonyan and Zisserman [29] proposed a two-stream convolutional networks (twostream ConvNets) architecture that incorporates spatial and temporal networks and demonstrated that a ConvNet trained on multiframe dense optical flow is able to achieve very good performance despite the limited amount of available training data. Tran et al. [30] argued that deep three-dimensional (3D) ConvNets trained on a large-scale supervised video dataset are effective for spatiotemporal feature learning. e 3D ConvNets build on two-dimensional (2D) ConvNets but include a time dimension; thus, they solve the issue of the inability of CNNs to extract temporal features. Carreira and Zisserman [31] introduced a new two-stream inflated 3D (I3D) ConvNet, where filters and pooling kernels of very deep image classification ConvNets are expanded into 3D, making it possible to learn seamless spatiotemporal feature extractors from video while leveraging successful ImageNet architecture designs and even their parameters. Varol et al. [32] made identifications from video representations using neural networks with longterm temporal convolutions (LTC) and demonstrated that LTC-CNN models with increased temporal extents improve the accuracy of action recognition. Ng et al. [33] employed a recurrent neural network that uses long short-term memory (LSTM) cells that are connected to the output of the underlying CNN. is LSTM-CNN approach exhibits significant performance improvement over previously published results on the Sports 1 million dataset and the UCF101 dataset [34]. Donahue et al. [35] developed a novel recurrent convolutional architecture suitable for large-scale visual learning, which is end-to-end trainable, and they demonstrated the value of these models for benchmark video recognition tasks. Sevilla-Lara et al. [36] investigated the impact of different flow algorithms and input transformations to better understand how these would affect a state-ofthe-art action recognition method, and they recommended a better way of using optical flow in the future.
Recently, the research community has paid a great deal of attention to deep learning-based approaches, mainly due to their excellent performance as compared to hand-crafted action representation approaches. However, some of the best learning-based methods still rely on hand-crafted features. e main reason is the lack of huge datasets for action recognition which are required to train the feature extractors. As no huge dataset such as ImageNet in the field of object recognition exists, the HAR community is working on the development of useful datasets. HMDB [37] is an action video database with 51 action categories, which in total contain around 7,000 manually annotated clips. UCF101 [34] consists of 101 action classes in a total of 13,320 clips of video data. More recently, the development of a large-scale dataset called ActivityNet [38] provides samples from 203 activity categories with an average of 137 untrimmed videos per class and 1.41 activity instances per video, for a total of 849 hours of video. YouTube-8M is the largest multilabel video classification dataset [39] and it includes about 8 Advances in Civil Engineering million videos (approximately 500,000 hours of video), annotated with a vocabulary of 4,800 visual entities.

Construction Equipment Action Recognition.
Although many research studies have investigated the use of HAR in the construction industry for health/safety monitoring and for the control of construction workers, the study of construction equipment action recognition (CEAR) is still premature. e first work that can be found in CEAR was conducted by Gong and Caldas [40], who developed an intelligent video computing method to interpret videos of cyclic construction operations and translate the images automatically into productivity information through the recognition of actions of a concrete bucket in the concrete pour process. Akhavian and Behzadan [41] used built-in smartphone sensors as ubiquitous multimodal data collection and transmission nodes in order to detect detailed construction equipment activities, which can ultimately contribute to the process of simulation input modeling. In a case study of front-end loader activity recognition, certain key features are extracted and are used to train supervised machine learning classifiers. Cao et al. [42] proposed a classification algorithm based on acoustics processing for four types of excavation equipment. ey developed new acoustic statistical features (short frame energy ratio, concentration of spectrum amplitude ratio, truncated energy range, and pulse interval) to characterize acoustic signals; then, based on the probability density distributions of these acoustic features, a novel classifier was proposed. is approach has a great potential to be generalized. Han and Golparvar-Fard [43] investigated current strategies for leveraging emerging big visual data in construction performance monitoring from the standpoints of reliability, relevance, and speed, and they structured a road map for research in visual sensing and analytics for construction. Roberts and Golparvar-Fard [44] presented a new benchmark dataset consisting of ten videos that can be used to detect, track, and analyze construction work activities of excavators and dump trucks. ey also gave an action recognition framework composed of detection module, tracking module, and recognition module. is method can automatically identify excavators and dump trucks from per-frame of the video sequence and track their activities and finally identify their construction actions. Rashid and Louis [45] proposed a data-augmentation framework for generating synthetic time-series training data for RNN-based deep learning networks.
eir research results show that deep learning framework outperformed the shallow network regarding model accuracy and generalization, and the dataaugmentation methodology has the ability to correctly simulate real-world dataset. Kim and Chi [46] proposed a vision-based action recognition framework that considers the sequential working patterns of earthmoving excavators. e framework includes three main processes: excavator detection, excavator tracking, and excavator action recognition. Among them, the action recognition process used CNN-DLSTM model. e framework demonstrates good generalization performance and proves the important positive impact of sequential pattern modeling on recognition performance. Rashid and Louis [13] researched the use of activity-specific equipment motions instead of vibration for action recognition. e study showed that using inertial measurement unit (IMU) data of different articulated elements can significantly improve the activity recognition results.
Previous CV-based research in the construction industry mainly focused on either identifying/tracking workers and equipment or recognizing workers' actions. Although deep learning-based action recognition algorithms have dominated the field of HAR, they are not widely investigated for CEAR. e challenges in doing so are threefold. e most significant challenge is the lack of comprehensive datasets for CEAR. Deep learning-based approaches largely rely on large numbers of high-quality raw video clips to train the feature extractor for a better performance in the later classification task. Although the number of available datasets for HAR is increasing, as mentioned earlier, this is not the case in the field of CEAR. A comprehensive dataset for CEAR needs to include video clips from a variety of working environments, from different viewing angles, and with various amounts of background clutter.
Currently, there is a shortage of algorithms that can be applied at real construction sites. Most existing researches of CEAR in the construction industry use pattern recognition methods and those algorithms rely on hand-crafted features and will be limited by human knowledge. When they are applied to actual cases, their recognition performance will be greatly affected by human factors.
At the present time, there is no clear benchmark for CEAR. e number of studies in CEAR is fewer than that for HAR and, due to the shortage of standard video datasets for construction equipment, it is hard to compare different approaches that are based on the same dataset.
Focusing on these three aspects, in this study, a new dataset for excavators and dump trucks was initially developed. is dataset was used to expand the existing dataset for the same types of equipment [7] and to train/verify the feature extractor using the deep learning-based method proposed in this research. e action recognition method is based on deep learning theory, which can automatically extract high-level features from raw data without feature engineering.
e possible problems of manual operation features can be avoided. Comparison with existing similar CEAR method [46] proves that it has comparable performance. By comparing the results after applying two advanced deep learning-based HAR approaches to the same dataset, this research study investigates the possibility of transferring some of the best HAR algorithms to CEAR and broadens the path of researching CEAR methods.

Dataset Collection.
e construction equipment considered in the new dataset includes an excavator and a dump truck. For these two pieces of equipment, there are five activities in all, as listed in Table 1 and shown in Figure 1. 4 Advances in Civil Engineering To take advantage of the recent improvements to cameras in smartphones, this research used smartphones with a 12-megapixel rear camera to collect raw data. Video footage was collected at a resolution of 720 progressive scan (720p) and at 25 frames per second (fps). e selection of camera placement needs to be taken into account to prevent possible obstructions and to ensure a good proportion of construction equipment in the picture. Based on the site survey, it was found that the front views and the side views of construction equipment can provide more effective information about their actions than other views. e front view refers to the projection view from the direction that the driver faces when driving the device normally moving forward. erefore, in order to capture the various views of the construction equipment and their actions from different perspectives, four cameras were used to collect videos within a 180-degree range around the construction equipment, as shown in Figure 2. Cameras 1, 2, and 3 were used to capture the front view, the side view, and the rear view, and Camera 4 was placed at a position in between Camera 1 (front view) and Camera 2 (side view), as can be seen in Figure 2, in order to provide supplemental information. is camera configuration guarantees a sufficient number of action views while using a minimum amount of video capturing resources. Finally, an annotation document was created based on the five action types for construction equipment. Table 2 shows the annotations for the action types, the total number of video clips of each type, and the numbers of video clips for various subsets of data.

Data
Processing. DivX, a video codec developed by DivX, LLC, was used to transcode the original videos to MPEG-4 format with a resolution of 480 pixels by 360 pixels. e transcoded videos were classified according to the four recording angels as shown in Figure 2; then videos were cut into shorter clips to ensure that each video clip, which had a duration ranging from 3 to 20 seconds, would contain only one complete action of a single piece of equipment. e format for the file-naming convention for the shorter video clips is "v_equipment_action_ID#" (e.g., v_evacuator_swing_001.MPEG). Finally, in the category of each recording angle, video clips were classified according to the five action types, and stratified sampling was performed according to a ratio of 6 : 2:2 to form a training set, a test set, and a verification set.
For the processed data, corresponding optical flow dataset was developed by the method of Lucas-Kanade algorithm [46]. is dataset will be used to verify the performance of the HAR algorithm used for CEAR and can be used as a complement to the development of CEAR datasets. e examples of the optical flow dataset are shown in Figure 3. e video dataset developed in this research was of open source. Everybody can obtain the dataset from https:// github.com/hnpyn/CEAR_dataset. e authors will update and maintain this project regularly. Network (STCN). Ng et al. [33] proposed a method for modeling video frames into ordered frame sequences by using a recurrent neural network (RNN), which connects the LSTM units to the output of the CNN. e CNN structure is based on GoogLeNet [47], while the RNN adopts a deep LSTM structure [48] with five LSTM layers.

Development of a Simplified Temporal Convolutional
is model performed very well in action recognition as compared to the best approaches available at the time. Similarly, Donahue et al. [35] also employed a method of directly connecting the RNN to the CNN structure and found that when nonlinearity is incorporated into the network status updates, learning of long-term dependencies is possible. As such, temporal dynamics and convolutional perceptual representations can be learned by jointly training the RNN and the CNN. It has been proved that the joint architecture of the RNN and the CNN is effective and feasible. Inspired by these studies, the authors have proposed a CEAR process (described in Figure 4) that can automatically extract video data features and perform end-to-end training for action recognition.
e core of this process is the neural network shown in the dashed frame, which combines a CNN for extracting features from video clips and a LSTM for extracting the temporal dynamics. A fully connected layer is employed to connect the CNN and the LSTM, and a softmax layer is used to determine the classification of the equipment action.

Deep Learning.
As an emerging research direction in the field of machine learning [49], deep learning was first proposed by Hinton et al. [50]. e merit of deep learning lies in the additional levels of nonlinear operation that it encompasses [51], and deep learning is able to form more abstract high-level representations or features by combining low-level features, thereby displaying the hierarchical feature representation of the data. erefore, deep learning is able to automatically learn to obtain the hierarchical feature representations [16] that are more conducive to classification tasks. Traditional machine learning and pattern recognition methods require the manual extraction of features. e model itself only classifies or predicts according to the features, and the hand-crafted features, to a large extent, will determine the quality of the method. It requires both professional knowledge and enough time to allow for manual extraction of features.
us, it will be limited by human knowledge, and most of the models can only obtain shallow features. is is the intrinsic driving force of this research to develop action recognition algorithm based on deep learning.

Convolutional Neural Network.
A convolutional neural network (CNN) refers to a class of feedforward neural networks (FNNs) having a convolutional structure, and this type of network is one of the representative algorithms in deep learning. e CNN uses the idea of sparse connection and weight sharing to solve the parameter explosion in ordinary FNNs, and it adopts convolution and pooling operations to obtain local features and reduce the dimension of feature space. e CNN generally completes the final classification through softmax by connecting to a fully connected layer. CNN can use the original data as the input characteristics, which avoids the complex feature extraction process in a traditional machine learning algorithm, and it reduces the number of weights in the weight-sharing structure, thus  Advances in Civil Engineering decreasing the complexity of the model. At the same time, the feature map is subsampled by using the principle of local correlation in images at the subsampling stage, which effectively reduces the amount of data processing required while retaining useful structural information [49]; because of this, CNNs have been widely used in CV-related tasks in recent years. LeNet-5 [52] was a milestone in early CNN development, and it has been used to determine the basic structure of a CNN-which contains convolutional layers, pooling layers, and fully connected layers. Later CNNs have basically followed this same structure, with less or more optimization and improvement. AlexNet [5] adopted a structure consisting of five convolutional layers and three fully connected layers, which helped it to succeed in the ImageNet competition. e success of AlexNet indicates that deep learning is a reliable method, and it lays the foundation for using deep learning in image classification and object recognition tasks.

Long Short-Term Memory.
Long short-term memory (LSTM) [53] is a variant of RNN, which is a feedback neural network that not only inherits most features of the RNN model but also solves the issue of vanishing gradients in regular RNNs [54]. As a nonlinear model, it can be used to build larger and more complex deep neural networks. A common LSTM architecture is composed of a cell (the memory part of the LSTM unit) and three "gates" of the flow of information inside the LSTM unit: an input gate, an output gate, and a forget gate, as shown in Figure 5. ese gates are used to either remove or add information to the cell. Cells are circularly connected to each other, replacing the hidden unit in the regular recurrent network. e state unit has a linear self-loop structure whose weight is controlled by the forget gate.
LSTM has two states, the cell state and the hidden state. e cell state changes slowly with time, while the hidden state can vary widely at different times. e gate mechanism can adjust the focus of memory according to the training target and then recode to control the trade-off between the input at one time and the input at a subsequent time.
erefore, LSTM can remember the information that needs to be remembered for a long time, while forgetting any relatively unimportant information. With LSTM, it is easier to learn long-term dependency than in the simple RNN architecture, and it is useful for dealing with problems that are highly related to time series, such as the video sequences used for CEAR.

Simplified Temporal Convolutional Network for CEAR.
Considering that video sequences contain dynamic images that include information on both space and time, that it is difficult to extract temporal features using a simple CNN, and that a regular RNN is not able to extract image features well, a combination of CNN and LSTM is proposed in this research for CEAR [54], where the CNN is used to extract the image features of the video frame sequences and LSTM is used to extract the temporal features. In this process, the probabilities for all frames generated by the softmax layer are averaged, and the label with the highest probability is selected as the final classification result. is proposed method is called a simplified temporal convolutional network (STCN), and the structure of the STCN is shown in Figure 6. eoretically, a model with more parameters will have a higher complexity, and when the amount of training data is insufficient, a complex model is easy to be subjected to overfitting [55]. At present, considering the small amount of data in the field of CEAR, a less complex CNN model

Advances in Civil Engineering
consisting of five convolutional layers is employed in this research, as shown in Figure 7. In order to obtain a sufficient number of effective receptive fields [56], the CNN model uses 7 × 7 and 5 × 5 convolution kernels in the Conv1 and Conv2 layers, respectively, and 3 × 3 convolution kernels for the other convolutional layers. In order to ensure the stability of data distribution in each layer to improve the training efficiency, batch normalization is used in all convolutional layers [57]. Because the maximum pooling operation can reduce the estimated mean shift caused by parameter errors in the convolutional layers while maintaining translation invariance, thereby minimizing the number of parameters and reducing model complexity [58], a max pooling layer is added at the end of each convolutional layer, and 2 × 2 convolution kernels are used in all pooling layers. ere is no pretraining in STCN because, based on the research result of Glorot et al. [59], the performance of a rectified linear unit (ReLu) network is far better than those of other activation function networks even without pretraining. Sparsity can be introduced to the ReLu activation function to allow each neuron to fully play its screening effect-those values matching the median value of a certain feature will be amplified, while the outlying values will be abandoned. Since the ReLu activation function only needs to perform the calculation for the maximum value, it is also superior in terms of calculation speed. As a result, ReLu is chosen as the activation function in the proposed model.

Advances in Civil Engineering
A fully connected layer is used to transmit the output of the CNN model to the three subsequent layers of the LSTM network. LSTM units for all continuous image subsequences were connected to each other. e probability of each action category is then generated using the fully connected layers and the softmax function. In order to generate the action prediction label of a given video clip, an average layer is employed to take the average probability of all image frames in the video clip, and the category label with the maximum probability is used as the result for action recognition.

Results and Discussion
3.1. Results. Using the dataset introduced in Section 2.2, the authors extracted 16 image frames from each video clip to train the algorithm discussed in Section 2.3. For video clips that are less than 4 seconds, between 5 and 10 seconds, and longer than 10 seconds, image frames were extracted at an interval of 5 frames, 10 frames, and 15 frames. If there were less than 16 frames that can be extracted based on this rule for a very short video clip, the last frame was repeated to complement 16 frames. After that, all images were adjusted to a resolution of 320 pixels by 240 pixels, and a random cropping with a size of 224 pixels by 224 pixels was used as data augmentation. Finally, all these frames were sent into the STCN model in a chronological order.
All the work was carried out by PyTorch on a workstation with a 6-core 3.8 GHz Intel processor, 16 GB memory, a GTX1060 graphics card with 6 GB memory, and Windows 10 operating system. In order to quantify the performance of the action recognition algorithm, three common performance metrics are employed, that is, precision, recall, and F-1 score. Calculating the precision and recall rate of the model can assess the costs associated with misclassification.
e F-1 score is the harmonic mean of accuracy and recall, which takes into account both precision and recall.
eir mathematical equations are shown as follows: (1) As mentioned above, STCN model is mainly composed of two parts, that is, CNN module and LSTM module. e inputs to the model are the video clip frames. Experiments were conducted on the parameter configuration of the STCN model. In the experiment for weight initialization, this study performed Xavier initialization [60] on the CNN module.
e results showed that the model with weights initialization has a better F-1 score (2.31% higher) than a scenario where no parameter initialization was performed. en there was an experiment on the parameter settings of the CNN module. is experiment mainly studies the influence of the size of the convolution kernels on the recognition performance. e experimental results are shown in Table 3. Four sets of convolution kernels configurations were used for control experiments. e results indicated that this model has the best performance, when Conv1 and Conv2 used 7 × 7 and 5 × 5 convolution kernels and the other convolution layers used 3 × 3 convolution kernels.
Next, the experimental research on LSTM parameter settings showed that the number of LSTM layers has a greater impact on recognition performance of the dataset developed in this research. In this study, the LSTM model architecture used is shown in Figure 5. Experiment shows that the three-layer LSTM model has the highest F-1 score, which is at least 2% higher than the F-1 score of other LSTM layer settings. e result is shown in Figure 8. In contrast, it was found that the number of hidden units in the LSTM had little effect on the F-1 score. e authors tested the STCN model with 64, 128, and 256 hidden units in the LSTM module, and the difference in F-1 score was within 1%, as shown in Table 4. erefore, in order to reduce the complexity of the model to save computational cost and reduce overfitting, this model employs 64 hidden units.
In the experiment of this research, all models use Adam optimization algorithm [61] as the optimizer, the learning rate is set at 0.001, and the other parameter settings are β1 � 0.9, β2 � 0.999, and ε � 10e − 8. Under the above settings, the STCN model achieves a precision of 77.55%, a recall rate of 75.00%, and an F-1 score of 76.25%, as shown in Table 5.
In addition, this research also studied the recognition effect of the STCN model for each action category. e experimental results are shown in  Advances in Civil Engineering performance of the CNN-DLSTM model used by Kim and Chi [46] for CEAR, 3D ConvNets proposed by Tran et al. [30], two-stream ConvNets proposed by Simonyan and Zisserman [29], and I3D ConvNets proposed by Carreira and Zisserman [31]. e results, which are summarized in Table 5, indicate that the STCN method has a performance comparable to those of the method proposed by Kim and Chi, the 3D ConvNets method, and the two-stream Con-vNets in precision, recall, and F-1 score. e results of using CNN module and LSTM module for CEAR, respectively, are also presented in Table 3. eir performance when working alone is not very satisfactory, and it is also verified that extracting key visual features and extracting context features have a significant positive impact on CEAR performance. e CNN-DLSTM model refers to the method used by Kim and Chi [46] in the action recognition process. Its CNN consists of 10 convolutional layers and 5 pooling layers, and the sequential patterns learning module consists of two LSTM models. e 3D ConvNets model was reproduced as shown in Figure 9. e convolutional and pooling operations of the 3D ConvNets model were performed in the temporal dimension. All 3D convolution filters are 3 × 3 × 3 in size with a stride of 1 × 1 × 1. e first 3D pooling layer is 1 × 2 × 2 with a stride of 1 × 2 × 2, and the remaining 3D pooling layers are 2 × 2 × 2 with a stride of 2 × 2 × 2. is design preserves the temporal information in the early stages. e model can simulate both appearance information and action information simultaneously, and it produces excellent results in HAR tasks. Using the same dataset, the 3D ConvNet model achieved F-1 score of 73.55%.
Two-stream ConvNets were also examined using the same dataset. As shown in Figure 10, the two-stream ConvNet consists of two convolutional network structures: a spatial stream and a temporal stream. e spatial stream convolution network is essentially an image classification network that acquires static appearance features using the input of a single frame image. e temporal stream network uses multiframe optical flow images as input, as shown in Figure 11. In order to employ the two-stream ConvNets, a corresponding optical flow dataset was created to extract temporal features. An action classification result was obtained by taking the average of the classification scores. e two-stream ConvNets model is in a leading position in HAR tasks in terms of its performance, and it achieved an F-1 score of 76.26% in CEAR in this research.
By comparing the results for the STCN method, the CNN-DLSTM method, the 3D ConvNets method, the twostream ConvNets method, and the I3D ConvNets method, it is found that the STCN method proposed in this study exhibits a performance in CEAR tasks which is comparable to those of other deep learning-based methods that have proven to have good results. Figure 12 shows the time consumption of training these methods. It can be seen that, in the case of close performance, STCN requires the shortest training time, so it has a certain speed advantage. As what has been mentioned before that the two-stream ConvNets model achieved a slightly better F-1 score than STCN (76.26% versus 76.25%) but the two-stream ConvNets model and I3D ConvNets model consumed much more computing time (9 h 28 m and 17 h 22 m versus 7 h 43 m), the STCN model is still better in general. In addition, the study demonstrates the feasibility of directly transferring some HAR methods to the CEAR field, as the 3D ConvNets model and the two-stream ConvNets model both achieved an acceptable rate of accuracy when using the same dataset as input.

Conclusions
Digital twin is not only for the facility to be built but also for the workers and construction equipment on a construction site. Research in CEAR is no less significant than that of HAR in terms of smart construction, because the perception of a construction activity needs to understand the what (object identification), where (object tracking), and how (action recognition) for both construction workers and equipment. Much of the research in the field of HAR can be applied in the construction industry. However, the research in the field of CEAR is very limited to date, due to a lack of high-quality datasets, the complexity of the application environment at construction sites, and the difficulty to benchmark the performance in CEAR.
In this research, the authors developed an open-source video dataset of 2,064 video clips with five action types for excavators and dump trucks, including a corresponding optical flow dataset. is dataset supplements the existing datasets for CEAR research and could potentially be used by other researchers in later studies. One major reason that hinders the research in the area of CEAR is the lack of highquality datasets, and the authors encourage other researchers to share their datasets. Another contribution of this research in the field of CEAR is a new deep learning-based approach, STCN, which combines CNN and LSTM-where CNN is used to extract the image features from the video frame sequences and LSTM is used to extract the temporal features. e STCN proposed in this research achieved an F-1 score of 76.25% for the dataset developed earlier. In order to evaluate the performance of the STCN, a similar CEAR method and three of the best-performing deep learning-based approaches in the field of HAR, namely, the 3D ConvNets method, the two-stream ConvNets method, and I3D Con-vNets method, were examined using the same dataset as input, and they achieved the F-1 scores of 75.25%, 73.55%, 76.26%, and 76.54, respectively. ose four methods either underperformed in comparison to STCN method or had a similar performance but needed significantly higher computing time. e time consumption of training the STCN is the shortest.
is comparison indicates not only the advantage of the STCN but also the possibility of directly transferring some HAR methods to the field of CEAR.
ere are some limitations in this research. First, the dataset developed in this study is still relatively insufficient compared to datasets available in other application areas using deep learning-based solutions. It is expected that a better accuracy rate would be achieved once a much larger dataset is developed. With a limited dataset, one possible solution is to pretrain the recognition algorithm using datasets for HAR. Second, this research only studied the recognition of actions for two types of construction equipment (excavators and dump trucks) and did not investigate whether different types of equipment (e.g., equipment with apparent joints such as excavators and equipment with no joints such as dump trucks) require different recognition algorithms in order to achieve better action recognition.
ere is much important work to be done in the future. ere exist hundreds of types of construction equipment in the construction industry, and each type of equipment has multiple actions. It is important to develop more comprehensive video datasets for various types of equipment under different conditions, for example, camera motion or disruptive weather (e.g., heavy winds or rain). In addition, it is important to study the perception of activities where multiple types of construction equipment are moving at the same time. In many cases, a surveillance camera will capture a scene of a construction site containing several pieces of equipment; this becomes a challenge when different types of equipment in the same frame require different action recognition algorithms to achieve better recognition performance.

Data Availability
e dataset used to support the findings of this study may be released upon application to Tianjin University, which can be contacted at jinyuezhang@tju.edu.cn. 12 Advances in Civil Engineering