Improved action proposals using fine-grained proposal features with recurrent attention models

Recent models for the temporal action proposal task show that local properties can be an alternative to the region proposal network (RPN) for generating good proposal candidates on untrimmed videos. In this study, we devise an RPN model with a new two-stage pipeline and a new joint scoring function for temporal proposals. The evaluation of local properties is integrated into our RPN model to search for the best proposal candidates that can be distinguished mainly in fine details of proposal regions. Our network models proposals in multiple scales using two recurrent neural network layers with attention mechanisms. We observe that joint training of the RPN with local clues and multi-scale modeling of proposals with recurrent attention mechanisms improve the performance of the proposal generation task. Our model yields state-of-the-art results on the THUMOS-14 and comparable results on the ActivityNet-1.3 datasets.


Introduction
Understanding long-term video sequences has attracted increasing attention in video content analysis.These sequences comprise a varying number of actions with foreground and background temporal regions, and the quality in identifying regions as action proposals is a critical success factor for video analysis applications.In particular, identifying each region requires a challenging detection task for the localization of confident segments with high scores and accurate temporal boundaries.Once high quality action segments are available, they can be used for other tasks, such as action classification [1,2] and automatic video description generation [3,4].
Two important factors of effective localization are (1) to generate segment proposals that overlap highly with the ground-truth data and (2) to produce good estimates of boundary locations.However, action intervals can be quite short when compared to the video duration, and various intervals can be observed within the same action category (depending on the performer) or among different action categories.To address these limitations, most existing studies rely on region proposal networks (RPN) to accurately localize candidate segments [5].These models may still not capture the boundaries precisely despite their use of regression models.Therefore, the recent state of the art focuses on local clues to analyze boundaries, and it has been shown that local properties are crucial for accurate localization with improved detection rates [6,7].However, most of these models are based on multiple stages, combining boundary discovery and proposal estimation modules trained independently.Two related studies to our approach, Boundary Matching Network (BMN) [8] and Multi-granularity Generator (MGG) [9], propose unified models with integrated boundary discovery and proposal estimation modules.They perform boundary discovery in coarse granularity using local estimates along the whole video sequence [7] and the models are designed based on the convolutional neural network.However, our model is fully RPN-based, and by integrating coarse-to-fine scale analysis, boundary discovery is performed in the temporal neighborhood of proposals.Moreover, our design considers modeling temporal proposal regions based on recurrent models.One recent work, Dense Boundary Generator (DBG) [10] follows an idea similar to our boundary modeling for estimating boundaries along proposal regions, however, their proposal generation module is based on a three-stage sampling strategy for the start, end and center regions, respectively, with different loss alternatives and a scoring function.
In this paper, we introduce a proposal-based end-to-end trainable network that performs the proposal prediction and the boundary discovery jointly to generate high-quality action segments.Our proposalbased framework investigates coarse-scale properties (such as proposal representation) and fine-scale details (such as boundaries) simultaneously in the extracted proposal regions.The network model is based on recurrent attention models to investigate the local temporal behavior of proposal regions.In the first place, the framework follows the proposal generation strategy of the region proposal network [11] to produce a set of temporal proposals of varying intervals, but our The network outputs temporal proposals, where each proposal is scored using confidences of (i) the proposal region and its proposal-vicinity (neighborhood) and (ii) unit-level regions within the proposal-vicinity.The proposal region and its associated vicinity pass through two recurrent models allowing coarse-to-fine scale actionness evaluation.model produces them in multiple temporal scales.Adapting RPN for the temporal detection task, anchors are generated as 1D-temporal segments in multiple scales and aspect ratios over long-term videos.Later, temporal proposals are examined in both coarse and fine scales.Fig. 1 illustrates the pipeline.
For the coarse-scale analysis, the proposal itself and its neighborhood are important as holistic features [12].A vicinity is defined around the target proposal to encode the neighborhood.For the finescale analysis, the unit-level regions (snippets) are important as local features.The unit-level regions are defined in three-fold: starting, ending and actionness regions.The starting and ending regions are recognized as the boundary of an action, and the actionness defines the likelihood of containing a generic action in each snippet [1,7].Having both coarse-scale proposal labels (foreground and background) and fine-scale unit-level binary labels on proposal snippets (start, end and actionness), our model jointly accomplishes the proposal extraction task.
In short, our model provides proposals highly overlapping with the ground-truth via an RPN-based framework and accurate localization of boundaries via the detection of unit-level features in the proposal neighborhood.We validate our model with experiments on the THUMOS-14 [13] and the ActivityNet-1.3 [14] datasets and the results of our experiments show that we achieve state-of-the-art performance.The main contributions of our study are that (i) we propose a new RPN model with a two-stage pipeline supporting coarse-to-fine scale modules and introduce a new scoring function, (ii) we propose a new multi-scale segment-level proposal module using a recurrent attention model, (iii) we show that joint training of coarse-to-fine scale action proposals and recurrent attention model improve proposal quality and (iv) we verify that our new temporal proposal framework achieves state-of-the-art performance.

Background
Many human action video datasets have emerged as standard benchmarks for action recognition: HMDB-51 [15], UCF101 [16] and Kinetics [17].Recently, a large-scale benchmark HiEve [18] has been introduced with complex events for a wide range of tasks including multi-person pose tracking, pose estimation and action recognition.Holistic representation [19] and structure of motion [20][21][22][23] enable effective action classification on these benchmarks with significant performance improvement.
In addition to atomic action recognition, another line of research focuses on spatio-temporal action detection using benchmarks such as UCF101 Sports [24], JHMDB-21 [25], and AVA [26].Most works for this task follow a common detection-linking pipeline [27][28][29], and they first require dense detection in video frames and then linking over proposals.To leverage the sequential information, some works exploit 3D convolutional network [2], and some other works employ LSTM to encode temporal information within action tubes [30].Recently, there are more effective strategies with a coarse-to-fine approach [31,32] in which refinement is performed over coarser action tubes.There are also attention-based models in various forms including second-order pooling [33], soft-attention [34] and self-attention [35].Hierarchical Self-Attention Network (HISAN) produces spatio-temporal tubes with the hierarchical bidirectional self-attention mechanism [36].Action Transformer model [35] aggregates features from the spatio-temporal context around the person.These methods and benchmarks are basically for recognizing actions and detecting actors on trimmed video sequences [37].Instead, our research targets proposal generation for action localization on untrimmed video sequences.
Action localization approaches on untrimmed videos can be grouped as fully supervised and weakly supervised.Fully supervised models can be further grouped into two: (i) two-stage detectors that categorize actions over generated proposals and (ii) one-stage detectors that jointly generate proposals and categorize actions.Our focus is on a fully supervised two-stage model targeting temporal action proposals that can further be used for action detection.For the two-stage detection process, good action proposals are crucial.To generate high-quality proposals, the confidence score of being a foreground action region should be high and the boundaries should be accurate to capture the ground-truth segments.Related approaches mostly rely on the sliding-window principle to create proposals [5] and follow the Faster R-CNN [11] proposal generation strategy to predict proposal locations.Contrary to the overlapping sliding windows, Single-Stream Temporal Action Proposal (SST) [38] processes the entire video in a single pass using recurrent neural network models.In the one-stage detection process, the proposal and classification stages are jointly trained.Single Shot Action Detector (SSAD) [39], which directly detects action instances skipping the proposal generation step, is inspired by single shot detectors such as Single Shot MultiBox Detector (SSD) [40] and You Only Look Once (YOLO) [41].On the other hand, another line of research on action localization focuses on weakly supervised localization in which the ground-truth temporal locations are absent.To model the background and foreground activity effectively, attention-based models are introduced.UntrimmedNets [42] introduces a selection module including hard selection based on multiple instance learning and soft selection based on attention.One recent approach is HAM-NET [43] with a hybrid attention mechanism including temporal soft, semi-soft and hard attentions.
Targeting fully supervised action localization task, the approaches for proposal generation can be categorized into two: top-down and bottom-up approaches.While the former group relies on the slidingwindow principle [5] or the Faster R-CNN proposal generation strategy to classify candidate proposals, the latter is based on detected temporal boundaries through unit-level features for extracting candidate segments [1,7].The complementary characteristics of these approaches are the main motivation for our work.Temporal Unit Regression Network (TURN) [44] decomposes videos into short units and uses them for generating proposals.It also employs temporal regression to adjust the boundary of sliding windows.A Temporal Action Grouping (TAG) [1] connects high-scoring regions into proposals by adopting the watershed algorithm.Boundary Sensitive Network (BSN) [7] detects boundaries locally and evaluates proposal confidence scores within a region globally.On the other hand, Complementary Temporal Action Proposal (CTAP) [6] uses sliding windows and grouping-based methods jointly to generate high-quality proposals.The sliding windows and actionness proposals are processed by a temporal CNN for proposal ranking and boundary adjustment.Another approach, called Snippet Relatedness-based Generator (SRG) [45], represents long-range dependencies among snippets by using a score map.One recent study, Boundary Content Graph Neural Network (BC-GNN) [46], uses a graph neural network to model interactions between boundaries and contents of proposals.Another study, Relaxed Transformer Decoder (RTD-Net) [47], proposes a transformer model inspired by a recent Transformer object detection framework DETR [48].
Both proposal-level and unit-level features are important for extracting precise boundaries and high-scoring proposals [6,9].Some state-of-the-art works are BMN [8] and MGG [9] with joint models integrating video features and boundary embedding features.Following these approaches, we take full advantage of end-to-end trainable models.We intend to join the proposal prediction and the boundary discovery in an end-to-end trainable network model.However, our joint model is designed over a proposal-based network model that analyzes a proposal in the segment and unit levels within two complementary modules.Furthermore, we propose an alternative novel network for temporal action proposal generation.

Our approach
In this section, we present the details of our model for the joint modeling of action proposals and boundaries on long-term untrimmed video sequences.We assume that the analysis of segments both in the coarse-scale as a whole and in the fine-scale with unit properties (e.g.localizing starting and ending positions) helps to improve confidences with high-quality proposals in temporal proposal generation task.With this aim, we present a temporal proposal network consisting of two main modules, segment-level proposal module and unit-level proposal module, to detect proposals using their holistic and unit-level properties simultaneously in a joint proposal-based framework.
The segment-level proposal module is devised to encode temporal proposals with two complementary recurrent neural network layers employing temporal dynamics over video snippets (features) of proposals.While the first recurrent layer encodes the proposal region, the second layer encodes the proposal region with its neighborhood.Besides, the unit-level proposal module designed in parallel to the segment-level proposal module identifies the local properties of the proposal snippets.The unit-level module detects snippets of types {, , } in the neighborhood of a proposal region by either utilizing the second recurrent layer or bypassing it.In particular, each proposal is characterized in terms of three components: the proposal region, its neighborhood and the unit-level properties in this neighborhood.The devised model for proposal generation is illustrated in Fig. 1.

Visual encoding of untrimmed videos
Our network architecture inputs untrimmed video sequences consisting of video snippets.Having a video  with duration , it is divided into  snippets, where each snippet is encoded using various pretrained CNN models.In our study, we use two types of CNN models for visual encoding [49,50], the details of which are given in Section 4. While one model encodes a snippet using a video-level visual feature,     , the other model supports the encoding of the snippet using a framelevel visual feature,    .After extracting the two types of features for  snippets, the features are concatenated as   =    ⊕   .As a result, each video is converted to a sequence of  unit features,  = {  }  =1 .

Proposal generation
Following the RPN for object detection [11], we introduce a set of anchors as candidate proposals, but in the temporal domain.Given a video of  snippets (features) as input, a set of temporal anchors is produced as output.These anchors serve as references in various intervals to predict the corresponding segment proposals.We generate  anchors using a sliding window along  snippets, where a new set of anchors are sampled in multiple intervals (scales) at each sliding window location.
In our architecture, an anchor is generated with an anchor-vicinity that is the complementing peer region covering the anchor with its immediate vicinity.For an anchor  within the time interval [  ,   ], its anchor-vicinity is defined within [  −   ,   +   ], where   and   are the starting and ending positions of the anchor, and the margin   is computed as   = (  −   )∕2.An anchor and its vicinity are encoded with  2 -normalized -dimensional and L-dimensional snippets (features),   = {   }  =1 and F  = { f   } L =1 , respectively, and the details are given in Section 4.
During training, positive anchors are those that have at least 0.7 temporal-Intersection-over-Union (tIoU) overlap with the closest ground-truth segment or those that have the highest tIoU value with a ground-truth instance; and negative anchors are hard negative instances that overlap with any ground-truth instance within a tIoU range of [0.1,0.3].For each ground-truth segment, we generate a maximum of   = 10 positive and   = 10 negative samples, to keep the positive negative ratio close to one.

Backbone module
Our network architecture first inputs the encoded anchors and vicinities and then passes them through a backbone network.The backbone network is a small inception module.The main purpose of the inception module is to fetch the local clues in various scales (with various kernel sizes).It consists of three convolution layers with 1 × 1, 1 × 3 and 1 × 5 kernel sizes.The outputs of the layers are combined with a max-pooling layer, see Fig. 2. The backbone output is fed into the main modules of the proposed architecture as input.
The backbone module can be designed as a shared network among the anchor and anchor-vicinity batches or it can be unshared with two backbone components.We experiment on both shared and unshared backbone alternatives.With shared weights, the discriminative clues on anchors and vicinities will be encoded together, but with a slight drop in performance (see Section 4).

Segment-level proposal module, 𝑀 𝑆
The segment-level proposal module aims to score the holistic actionness of temporal proposals.The module refines the generated anchors (explained in Section 3.2) and predicts the possibility of being a foreground anchor.
Two streams of neural network models are employed to perform proposal scoring, where one model evaluates the proposal (anchor) batches and the other evaluates the proposal (anchor) vicinity batches, respectively.Each stream consists of a bidirectional LSTM recurrent neural network module [51] and an attention module.Our choice is the bidirectional LSTM network, since it models the temporal dependencies from the context using memory cells, and propagates frame information in both forward and backward directions.In our framework, LSTM layers are utilized to capture temporal dynamics within the proposals and vicinities.
On top of each bidirectional LSTM, we attach an attention module to encode context information over an anchor.We focus on two different attentions (i) soft-attention and (ii) multi-head-attention, see Fig. 3.For the -th anchor (similarly for its vicinity), the outputs of the bidirectional network are aggregated with one of these attention modules.

Soft-attention.
The soft-attention module [52,53] is formulated as where   = {ℎ   }  =1 contains the outputs of the bidirectional LSTM and  is the length of the input.Each ℎ   contains information about the whole anchor with a focus on the -th feature.The context vector   is computed as a weighted sum of these features.The weight   is computed using a softmax function.In our case, while the first attention module performs on -dimensional sequences   as outputs of the LSTM and generates   for an anchor, the second attention module performs on L-dimensional sequences Ĥ and generates â   for the vicinity.
Multi-head-attention. The multi-head-attention module [54] is formulated as where   is the concatenation of the outputs of the bidirectional LSTM,   , with an attention token   .Similar to BERT's [] token [55], the attention token is a learnable embedding attached to the sequence of LSTM outputs.Multi-head-attention unit is designed using 8 heads and the same matrix   is fed into the attention module as , ,  inputs.  is the dimension of queries (  = 96).We employ a residual connection followed by layer normalization [54].The state of the   at the output of the multi-head-attention unit,   0 , is used as the context embedding,   .
Finally, the outputs of the first bidirectional LSTM layer (regarding anchors) and the outputs of both attention modules of the two streams are concatenated as   = ℎ  1 ⊕ ℎ   ⊕   ⊕ â   , where ℎ  1 , ℎ   ∈   are the summary vectors (the first and the last outputs) of the bidirectional LSTM layer in the forward and backward directions for the anchor .  and â   are the attention vectors for the anchor and the anchor-vicinity, respectively.The concatenated outputs are fed into three consecutive linear layers to predict the -th anchor score denoted as (  ,   ), where   ,   are the anchor's starting and ending positions.

Unit-level proposal module, 𝑀 𝑈
As opposed to recent state-of-the-art models [7][8][9] that localize unitlevel properties on the whole video sequence (over a sequence of T snippets), our unit-level module intends to detect these properties in the anchor-vicinity which covers the anchor neighborhood in a wider extent (see anchor generation in Section 3.2).With this aim, unitlevel snippets are categorized into three types: {, , } units.In particular, the anchor-vicinity with a fixed number of snippets ( L features) is fed into this network to rank each snippet (feature) for all three categories using predicted confidence scores.Please note that the unit-level proposal module,   , is evaluated on positive anchor-vicinity instances.
For positive vicinities, we generate three binary label vectors for the starting, ending and actionness categories, respectively.The unit-level module is designed as an inception module either on (i) the vicinity bidirectional LSTM outputs Ĥ = {ℎ   } L =1 or on (ii) the backbone module outputs by bypassing the LSTM (see Section 4).The inception module performs two simultaneous layers of convolutionsone layer with 1 × 1 and the second layer with 1 × 5 (1 × 1 followed by 1 × 5).The filter outputs are concatenated to predict scores for the three types of unit properties (three score values for each snippet on the anchor-vicinity), see Fig. 4.

Training and objective function
There are five components in our loss function.Two of them are defined for the segment-level proposal module, while the three others belong to the unit-level proposal module.As a part of the segmentlevel proposal module, the classification loss   is evaluated to identify foreground-background regions and the regression loss   is evaluated for regressing segment intervals.As a part of the unitlevel proposal module, unit-level losses   ,   ,   are evaluated for predicting {start, end, actionness} scores on the snippets of positive anchor samples.Final loss function for our proposed model is defined as (3)  Since our model includes multiple tasks, the system performance gets affected by the relative weighting of the task losses.In this work, we first report our results using a naive approach where losses are simply combined using a weighted linear sum in which we use all weights equal to one.Then we combine the functions of the task losses with a multi-task learning approach (MTL) [56] where the model learns the weights of our five tasks within the network jointly.In the following sections, we give the details of each loss component.

Classification loss
The classification loss   is evaluated for identifying foreground anchors and defined as where   is the predicted score of an anchor  being foreground and the ground-truth label   * is 1 if the anchor is positive, and 0 if the anchor is negative (see Section 3.2).The term   is the number of anchors.In this study,   is the focal loss [57].Using the focal loss, a modulating value added to the cross-entropy loss helps with the class imbalance problem,   = −(1−  )    log(  ) with a focusing parameter    (please note that we do not use -balanced variant of the focal loss) [57].

Regression loss
The regression loss   is evaluated for regressing segment intervals and defined on positive anchor instances as where   = (   ,    ) is the predicted boundary coordinate of a positive anchor , and   * is the coordinate of the ground-truth segment associated with it.The term   is the number of positive anchors.The ground-truth label   * is 1 if the anchor is positive, and 0 if the anchor is negative.That means only the positive anchor instances are used in the loss   computation.
In this study,   is either the standard smooth L1 loss [11] or the KL loss [58], and we introduce a localization head as a part of our network.The smooth L1 loss,  (1) , [11] estimates the boundary (start-end) offsets of segments and the localization head includes three linear layers to predict these offsets.On the other hand, the KL loss,  () , [58] returns not only the estimates of the boundary offsets, but also the ambiguities of these values with confidences as where △   * and △   * are the boundary offsets of the ground-truth segment, and △   and △   are the estimated offsets.   and    are the estimated variances of the starting and ending offsets, respectively.To avoid gradient exploding, network predicts  = log( 2 ) instead of .Given a proposal feature,   , as the concatenated outputs of the recurrent models, we feed it into a localization head with three linear layers to predict the start and end offsets with their standard deviations.In this, we follow [58], but add one more extra linear layer for the branches of the location and the variance.Following [58], we initialize the weights of the linear layers for  prediction using random Gaussian initialization.

Unit-level losses
As a part of the unit-level proposal module, the unit-level loss   includes three components   ,   ,   to predict {start, end, actionness} scores for the snippets of positive anchor samples.  ,   , and   are evaluated using weighted binary logistic regression as defined in [7].The final unit-level loss function is defined as ) , (7) where    ,    , and    are the predicted starting, ending and actionness unit scores of snippets, respectively.The ground truth labels    * ,     * and    * are 1 if the snippet is positive for the corresponding unitlevel property, and 0 otherwise.The term   is the total number of snippets.The ground-truth label   * is 1 if the anchor is positive, and 0 if the anchor is negative.That means only the positive anchor instances are used in the loss   computation.

Proposal and location scoring
In traditional proposal generation models, both the proposal scores and location estimates embody uncertainties.Jointly training coarseand fine-scale network components, we later combine predictions of proposal confidences and locations from the coarse-scale network   with predictions of unit-level snippet confidences from the fine-scale network   to improve the quality of proposals.

Proposal confidence scoring
In an anchor-vicinity  with features f   ∈ F  at locations  ∈ {1, 2, … , L}, the starting, ending and actionness unit scores are denoted by  ,  ,  ,  and  ,  , respectively.Then, a starting location is detected if 0.5 ≤ where  ,  is the actionness unit score at location  and   = |  | is the interval length.The intersection over union function  ( s, , ẽ, , ŝ, , ê, ) computes the tIoU score between the intervals [ s, , ẽ, ] and [ ŝ , ê ] ( ŝ and ê are the transformed positions of anchor  relative to anchor-vicinity ).The score of the foreground function  ( s, , ẽ, ) is the average of the actionness scores of snippets within [ s, , ẽ, ], and the score of the background function ( s, , ẽ, ) is the average of the none-actionness scores of snippets outside of the interval [ s, , ẽ, ]. ( s, , ẽ, ) is the score computed using location and variance estimates of the starting and ending positions by the KL loss, where ( s, ;    ,    ) and ( ẽ, ;    ,    ) are the Gaussian probability distribution functions and    and    are equal to the   and   for the th proposal's estimated start and end positions.With this scoring function, the actionness scores of snippets generate  proposal candidates over a proposal, and we return one proposal with the maximum of scores,  ′ (  ,   ), that is the unit-level proposal candidate.
Our two proposal parts, the segment-level proposal module   and the unit-level proposal module   output two scores, (  ,   ) and  ′ (  ,   ).The actionness score from   and the proposal prediction score from   are combined to compute the final confidence score for the proposal as where  is a weighting factor in the range [0, 1].Simple multiplication can also be used, but empirically we observe that weighted averaging produces better results in experiments.

Location scoring
Similarly to the evaluation of proposal confidences,   and   generate two coordinate estimates, (   ,    ) and (   ,    ).To compute (   ,    ), the intervals of the predicted proposals are updated with a location regressor.As proposed in [11] for object bounding boxes, we regress the boundaries of the detected proposals that in our case are the starting and ending positions relative to the video.The transferred and refined locations are computed using △  = (  −  * )∕(  −   ) and △  = (  −  * )∕(  −   ), where   and  * (resp.  and  * ) are the starting (resp.ending) position of an anchor  and the ground-truth segment, respectively.
To compute (   ,    ), we retrieve the locations with the maximum proposal score from Eq. ( 8) and these locations with respect to the anchor-vicinity  are transferred to the video coordinates.The converted locations correspond to an interval [ We follow the same strategy to generate    .Soft-NMS.After generating proposal candidates, we prune redundant proposals to achieve higher recall rates.Non-maximum suppression (NMS) algorithm is a standard technique, but it has been shown that soft-NMS [7,59] achieves better recalls.In our work, we apply soft-NMS to refine the proposals.

Experiments and results
We demonstrate the performance of our method for accurate action proposal generation in two datasets: ActivityNet-1.3 [14] and THUMOS-14 [13].

Dataset details
ActivityNet-1.3 [14].The dataset consists of 19,994 long-term untrimmed video sequences in 200 action categories.The dataset splits into training, validation and testing subsets with 10,024, 4926 and 5044 video samples, respectively.Each video sequence contains one or more actions with annotated segment intervals.We train our model on the training set and evaluate it on the validation set.THUMOS-14 [13].The dataset includes 1010 and 1574 videos of 20 action categories in the validation and test splits, respectively.Among these videos, 200 validation videos and 212 test videos have temporal annotations of actions.Following the previous studies [6,7], we conduct our training on the validation set and evaluate it on the test set.

Experimental setups
Visual Encoding.We use publicly available pretrained models for the visual encoding of dataset videos.Videos of ActivityNet-1.3 are first encoded using frame level 2048-dimensional ResNet152 features [49] pretrained on the ImageNet dataset [60] and video level 8192dimensional ResNeXt101 features [50] pretrained on the Kinetics dataset [17]. 1 We scale the feature length of all videos to  and set  = 100 for our experiments.For ResNet152,  number of frame-level features are extracted per video.For ResNeXt101 features, temporal scaling of 16 frames is used and temporal features are extracted at every 16 frames.The features are extended to  snippets with zeropadding in case the number of sampled frames is less than  .Otherwise, the features are downsampled using max-pooling to fit into the video length  .After data preprocessing, each untrimmed video sequence is S. Pehlivan and J. Laaksonen represented as a sequence of  ResNet and ResNeXt features that are concatenated.
Videos of THUMOS-14 are encoded using two-stream TSN-based features [61] pre-trained on Kinetics [17].Since the THUMOS-14 videos are longer compared to the ActivityNet-1.3 videos, we need to apply a sliding window strategy where we set the observation window length  to 256 with a stride of 128.
Training Settings and Evaluation Metrics.We train and evaluate the proposed architecture for temporal proposal generation.On Activ-ityNet, a learning rate of 1 − 3 is used with a weight decay 1 − 9 (we use a learning rate of 1 − 4 when the model is trained with KL Loss); and on THUMOS-14 a learning rate of 1 − 4 is used with a weight decay 1 − 5. We use Adam optimizer during training.The goal of the temporal proposal generation task is to extract high quality proposals with high recall and high temporal overlap [38].Following previous works [6][7][8], Average Recall (AR) is used as the first evaluation metric and recall values are evaluated under various tIoU threshold values in the range [0.5, 0.95] for ActivityNet and in the range [0.5, 1.0] for THUMOS-14 with a step of 0.05.As the second metric, the Area Under the AR-AN curve (AUC) is calculated using AR values under various Average Number of Proposals (AN) as AR@AN, where AN varies from 0 to 100 (we also evaluate from 0 to 1000 on THUMOS-14).We evaluate combined proposal scores according to Eq. ( 9) over various  values in the range [0,1] with a step size of 0.1 and report the best result for each model.

Experimental results
We will first introduce various models as variants of our system.Then, in the ablation study, we investigate the models with discussions.
Model   .This model trains the unit-level segment proposal module on top of the proposal-vicinity bidirectional LSTM (see Fig. 1 where the model corresponds to path ).
Model   .This model trains the unit-level segment proposal module on top of the backbone architecture (see Fig. 1 where the model corresponds to path ).
We experiment models   and   on shared and unshared backbone module settings (see Section 3.3).For the unshared case, we also introduce Model  2 and Model  2 that are similar to   and   , respectively, except for the first backbone module.Particularly, we modify the backbone model just before the proposal bidirectional LSTM and simplify it with a single convolutional layer rather than an inception module.

Experimental results on ActivityNet dataset
Experiments on the ActivityNet dataset are first conducted under different settings of the anchor-dimension , vicinity-dimension L and focusing parameter of focal loss    with shared and unshared networks.In these experiments, the models are trained using the smooth L1 loss [11] for regressing the segment intervals.We later report results with the KL loss.In Table 1, performances are reported in three parts: unit-level, segment-level and combined scores (see Section 3.7.1).
In unit-level, proposal scores are computed using  ′ (  ,   ) ( = 0) in Eq. (9).In segment-level, proposal scores are evaluated using (  ,   ) ( = 1) in the same equation.For all experiments reported in Table 1, combined scores are higher than both unit-level and segment-level scores.This shows that proposal-framework with unit-level pipeline improves the AR@100 and AUC rates for the proposal task.
Investigating network parameters in the shared models, we see that model   outperforms model   .The main difference between these two models is in the implementation of the unit-level features where the unit-level features of   are trained using the backbone outputs rather than the LSTM outputs.Although the segment-level performances are closer to each other, there is a significant improvement in the unit-level performances using   over   .
Using a shared backbone model where the same backbone is used both for the proposal and proposal-vicinity paths, we conduct experiments with various values for the dimensions  and L. Results show that increasing the scale of the proposal-vicinity improves performance.According to the experimental results, increasing the scale of L by ×3 performs higher in the AR@100 scores than increasing by ×2, since scale ×3 provides finer features.For instance, model   with  = 14 and L = 42 (14 × 3) over various    values performs better than the one with  = 14 and L = 28 (14×2) in combined scores.We also test various    values and observe that focal loss slightly affects the performance in our implementation.
When model   (with  = 14 and L = 14 × 3) is trained with an unshared backbone module rather than a shared one, we achieve further improvements.While the highest unit-level score of the shared model   is 72.44 AR@100, the highest unit-level score of the unshared model   returns 73.26 AR@100.In the combined proposal scores, we observe that the best shared model results in 73.58 AR@100 and the best unshared model achieves 73.95 AR@100.In the unshared model structure, we feed two separate backbone models to the bidirectional LSTMs.If we simplify the backbone of the first bidirectional LSTM (designed for the proposal region) in  2 , we have further improvement at    = 0.25 where the combined proposal score is 74.06 AR@100 and 65.21 AUC.Since    = 0.25 results in the highest performance, we pick 0.25 as the best focusing parameter.
We also conduct an experiment with a different visual encoding using I3D [22] and ResNet152.The score of the model  2 () is lower with 73.02 AR@100 than the score of  2 (with ResNeXt101+ ResNet152) using the same experimental setting.
Overall, model   that bypasses LSTM and uses the unshared backbone outperforms model   .Simplification of the backbone model affects the performance and model  2 returns slightly higher results than model   when the focusing parameter is 0.25.
Then, we train our model  2 for the best available setting (with  = 14, L = 14 × 3 and    = 0.25) using (i) the multitask learning approach (MTL) [56], and (ii) the KL regression loss as the   loss component together with MTL (see Section 3.6).The results are reported in the bottom lines of Table 1.Our model with MTL training,  2 + , results in 73.85 AR@100.Our model with KL and MTL,  2 + () + , results in 73.19 AR@100 (note that () means that scoring value is computed with the KL component in Eq. ( 8)).Our results are comparable with each other, but we could not observe any improvement compared to that using smooth L1 loss and a naive combination of task weights on the ActivityNet dataset.Performing the same experiment using a single layer of multi-headattention unit,  2 +  * slightly outperforms soft-attention based models with 65.33 AR@100 with the best score, respectively.However, multiple layers of attention block have no effect on the performance and  2 +  * * (a stack of two multi-head-attention unit) results in 64.70 AR@100.Fig. 5 shows the average recall vs. average number of proposals plot for various tIoU values for the best achieved model.
Table 2 compares our model with a few state-of-the-art temporal proposal models.We achieve better results compared to TCN [12], -63.12 Prop-SSAD [63] 73.01 64.40 CTAP [6] 73.17 65.72 BSN [7] 74.16 66.17 MGG [9] 74.54 66.43 BMN [8] 75.01 67.10 DBG [10] 76.65 68.23 BC-GNN [46] 76.73 68.05 RTD-Net [47] 73.21 65.78 Our  2 +   * 74.06 65.33 MSRA [62], Prop-SSAD [63], TURN [6] and CTAP [6] and RTD-Net [47].We have comparable results with BSN [7] (−0.10 AR@100), MGG [9] (−0.48 AR@100), BMN [8] (−0.95 AR@100), DBG [10] (−2.59 AR@100), and BC-GNN [46] (−2.67 AR@100) with a slight drop in AR@100.In Fig. 6, we provide a visual illustration of sample results on the ActivityNet dataset.As our results clearly show in Table 1, the combination of segment-level and unit-level modules improves our overall performance.However, the contribution of the unit-level module on the combined score is small with  = 0.2 according to the best performed model (given in Table 1).On the other hand, unit-level module contributes higher than the segment-level module with  = 0.8 for the best model on the THUMOS-14 dataset (given in Table 3).This shows that the proposal scoring function of unit-level module has a low impact on the ActivityNet dataset.One reason could be that all videos of the ActivityNet dataset are scaled to  = 100 (there is no observation window sliding on the video), and this may cause our model to miss some boundaries on the ActivityNet dataset with a drop in performance.Another reason could be the sparsity of the annotations on the ActivityNet dataset.We analyze that almost 75% of the videos include one segment; only 12% of them include more than two segments (THUMOS-14 includes dense annotations with more segments), and almost 25% of all samples include a segment with more than 90% overlap with the whole video.This may cause labeling more snippets with less discriminative features as positive boundary samples and reducing boundary detection performance.
The main limitation of our model is observed in the inference speed as this is also an expected case for anchor-based models.Using our current implementation, the network inference time for a sample 3 min video is 0.488 sec using model  2 with  = 14 and L = 42 (14 × 3), and inference speed is measured using an Nvidia 2060S graphic card.When we investigate the model implementation, we observe that while the backbone module of the anchor stream lasts 0.054 sec, the backbone module of the anchor-vicinity stream lasts 0.325 sec (backbone modules are unshared).This shows that our model suffers from the S. Pehlivan and J. Laaksonen Fig. 6.Sample results of our framework on the ActivityNet-1.3 dataset.We show ground-truth proposals and our proposal candidates generated using unit-level, segment-level and combined approaches, respectively.The combined approach has a tendency either to choose the one that has a higher overlap with the ground-truth among the proposals of the unit and segment-level modules or to return a better proposal than those of the unit and segment-level modules.backbone computation including a set of convolution layers and the dimension of the anchor-vicinities that is L = 42 (14 × 3).To accelerate the inference speed of our network with a better runtime, the anchorvicinity backbone can be simplified with a single layer of convolution layer (please note that we already simplified the anchor backbone), and we measure that simplifying the anchor-vicinity backbone can help to reduce 0.160 sec.Additionally, the anchor-vicinity backbone can be trained using a reduced L dimension with a slight drop in AUC performance (our model performance slightly drops when  = 14 and L = 28 (14 × 2) as given in Table 1).Another solution as future work can be to remove the backbone module with convolution layers from the two-stream RPN module to speed up, where first the backbone module with convolution operations is applied on the whole video sequence and then RPN including proposals is applied with the recurrent attention modules.

Experimental results on THUMOS-14 dataset
We conduct a similar set of ablation experiments on the THUMOS-14 dataset as reported in Table 3.We perform experiments using the unshared backbone with various values of focal parameters and fixed settings of the anchor and vicinity dimensions according to their performances in the ActivityNet experiments ( = 14 and L = 14 × 3).We use the two-stream features [64] for the encoding of the videos.
Among the models using the smooth L1 loss [11] for regressing the segment intervals, the models whose unit-level modules are built on top of the LSTM, namely   and  2 , achieve significantly low unit-level AR@100 scores despite having better segment-level AR@100 scores than   and  2 .Overall,  2 is the best model with slightly better AR@100 scores than   .We also test various values of the focusing parameter    .The effect of the focusing parameter can be observed on model  2 in Table 3.    = 0.0 means the pure cross-entropy loss and the AR@100 results show that focal loss slightly improves performances compared to the cross-entropy loss.

Table 3
Proposal generation results on the THUMOS-14 dataset.Results are computed over top 100 proposals after soft-NMS pruning.For the model  2 + () +  , the score of 47.49 AR@100 and for the model  2 + () +   †, the score of 50.00 AR@100 are achieved at  = 0.8 and soft-NMS lower threshold value of 0.55.The (best) score of 50.34 AR@100 is achieved by the model  2 + () +   † * * at  = 0.9 and soft-NMS lower threshold value of 0.65.The symbol † represents our model trained using observation window  = 128.For all other experiments,  = 256.The symbol * represents our model trained using multi-head attention unit and symbol * * represents our model with a stack of two multi-head attention units.(Best scores are in bold and second best results are shown underlined.).We also conduct experiments of training model  2 using (i) the multitask learning approach (MTL), and (ii) the KL regression loss together with MTL.We achieve a 47.01 AR@100 score with MTL and this is higher than the best result of 46.82 AR@100 with naive training.Replacing the smooth L1 loss with the KL loss first decreases the performance where model  2 ++ achieves 46.89 AR@100 ( means that the scoring value is computed without the KL component in Eq. ( 8)).However, if we use variance predictions by KL loss during evaluation, model  2 +()+ achieves 47.49 AR@100 (() means that scoring value is computed with the KL component in Eq. ( 8)).In all these experiments, the observation window length,  , is set to 256 with a stride of 128 (see visual encoding in Section 4.2).Following other works [8,10], we train our model with  = 128 and stride of 64, namely the model  2 + () +  †.The model achieves 50.00 AR@100.It shows the value of  has a significant impact on the performance of the THUMOS-14 dataset as our implementation is proposal-based with a fixed anchor set.Performing the same experiment using multi-head-attention unit, models  2 + () +  † * (single layer multi-head-attention unit) and  2 + () +  † * * (a stack of two multi-head-attention units) outperform soft-attention based models with 50.16 and 50.34 AR@100 with the best scores, respectively.

Models (With
In addition to the reported results, we derive a set of additional experiments where we measure the effects of RNN and the attention on our  2 RPN model without unit-level component.Removing the unit-level proposal module   , we compute the proposal scores using (  ,   ) and locations with (   ,    ) in the traditional RPN setting.In Table 4, first, we report the proposal score when segment-level proposal module is evaluated using   =   ⊕ â   (without RNN), second, we report it using   = ℎ  1 ⊕ ℎ   (without attention), and then we evaluate using both in   = ℎ  1 ⊕ ℎ   ⊕   ⊕ â   (see Section 3.4).
The standalone attention mechanism has the lowest performance with 32.98 AR@100, but attention over the outputs of the RNN significantly improves the AR@100 up to 43.05.We further apply the MTL approach for the segment-level proposal module.We observe no improvement.However, the approach improves up to 44.96 if observation window  is set to 128, and improves up to 45.56 if multi-head-attention module is preferred over soft-attention module.Overall, our model trained with the unit-level proposal module achieves better segment-level and combined performances.It shows that our model with the joint modules of coarse-to-fine scales and with the new scoring and localization functions to fuse results outperforms the traditional proposal setting.Table 5 presents a comparison on the THUMOS-14 dataset with a few state-of-the-art models.To be consistent with the state-of-the-art

Table 5
Comparison of our result on THUMOS-14 with the state-of-the-art results.Evaluations are performed over 1000 proposals following other methods.Please note that over top 100 proposals, we obtain 50.00 AR@100 using Our  2 + () +   † and we obtain our best performance 50.34 AR@100 using  2 + () +   † * * .The symbol † represents our model trained using observation window  = 128.For all other experiments,  = 256.The symbol * represents our model trained using multi-head attention unit and symbol * * represents our model with a stack of two multi-head attention units.(Best scores are in bold.).

Method
AR@50 AR@100 AR@200 AR@500 AR@1000  Sample results of our framework on the THUMOS-14 dataset.We show ground-truth proposals and our proposal candidates generated using unit-level, segment-level and combined approaches, respectively.The combined approach has a tendency either to choose the one that has a higher overlap with the ground-truth among the proposals of the unit and segment-level modules or to return a better proposal than those of the unit and segment-level modules.
evaluation, we evaluate the best achieved models over 1000 proposals (the previous experiments are over 100 proposals similar to the ActivityNet evaluations).Our model  2 achieves 47.43 AR@100,  2 + achieves 47.61 AR@100, and  2 +()+ achieves 48.12 AR@100 over top 1000 proposals.Outperforming other methods, the models  2 + () +  † and  2 + () +  † * * achieve 50.49 and 51.23 AR@100 over top 1000 proposals, respectively.Please note that further evaluations using various experimental settings can result in better improvements.In Fig. 7, we provide a visual illustration of sample results on the THUMOS-14 dataset.

Conclusions
The quality of proposals in a proposal generation task is determined with two factors, the overlap ratio with the ground-truth and the precision of the boundary localization.In this paper, our proposalbased action localization framework investigates the proposal region in a coarser scale and the unit-level snippets in a finer scale.With a new proposal scoring function and a technique to compute temporal boundary estimates, we combine the two components of our proposal-based model.The results of our extensive experiments show that investigating the proposal regions in multiple scales is important and the combination improves localization in the proposal generation task.Moreover, our attention modules applied on top of recurrent structures seem to perform well as features of proposal regions for identifying foreground regions.
As one future direction, we aim to advance the model for a better inference time as the model speed suffers from the two-stream architecture and number of anchors.We think anchor-free architectures can help us in this direction.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Fig. 1 .
Fig. 1.Our network architecture for generating high-quality temporal action proposals.The network input is a sequence of visually encoded video snippets from a long-term video.The network outputs temporal proposals, where each proposal is scored using confidences of (i) the proposal region and its proposal-vicinity (neighborhood) and (ii) unit-level regions within the proposal-vicinity.The proposal region and its associated vicinity pass through two recurrent models allowing coarse-to-fine scale actionness evaluation.

Fig. 2 .
Fig.2.The details of the Backbone Module designed as a small inception module.There can be two backbone modules, one for proposal and the other for proposal-vicinity, or a unique backbone can be shared.
Using the starting and ending positions of a ground-truth segment  and the positions of a positive vicinity , we first transfer the positions (  * ,   * ) of the ground-truth segment, that are relative to the video, into vicinity localized positions ( ŝ * , ê * ) (that means to convert the ground-truth positions using the starting position of the vicinity  and the vicinity length L, if ground-truth segment  and vicinity  overlap).The starting and ending intervals are defined as [ ŝ * − 1, ŝ * + 1] and [ ê * −1, ê * +1], respectively, while the actionness interval is defined as [ ŝ * , ê * ].In particular, these intervals are used to set the ground-truth labels of the unit-level features of positive anchors.

Fig. 3 .
Fig. 3.The details of the Attention Module encoding temporal context of anchors based on (a) solf-attention or (b) multi-head-attention.

Fig. 4 .
Fig. 4. The details of the Inception Module designed as part of our Unit-Level Module.

Fig. 5 .
Fig. 5. Average recall vs. average number of proposals for the best achieved model,  2 +   * , with AUC value of 65.33.

Fig. 7 .
Fig. 7.Sample results of our framework on the THUMOS-14 dataset.We show ground-truth proposals and our proposal candidates generated using unit-level, segment-level and combined approaches, respectively.The combined approach has a tendency either to choose the one that has a higher overlap with the ground-truth among the proposals of the unit and segment-level modules or to return a better proposal than those of the unit and segment-level modules.

Table 2
Comparison of our result on ActivityNet-1.3 with the state-of-the-art results.(Best scores are in bold.).

Table 4
Comparison of baseline models with our combined proposal frameworks (anchor-dimension  = 14, vicinity-dimension L =14 × 3, focal parameter    =0.25).Evaluations are performed over 100 proposals similar to the ActivityNet evaluations.The symbol † represents our model trained using observation window  = 128.For all other experiments,  = 256.The symbol * represents our model trained using multi-head attention unit.(Best scores are in bold and second best results are shown underlined.).