From Smart Camera to SmartHub: Embracing Cloud for Video Surveillance

Smart cameras were conceived to provide scalable solutions to automatic video analysis applications, such as surveillance and monitoring. Since then, many algorithms and system architectures have been proposed, which use smart cameras to distribute functionality and save bandwidth. Still, smart cameras are rarely used in commercial systems and real installations. In this paper, we investigate the reason behind the scarce commercial usage of smart cameras. We found that, in order to achieve scalability, smart cameras put additional constraints on the quality of input data to the vision algorithms, making it an unfavourable choice for future multicamera systems. We recognized that these constraints can be relaxed by following a cloud based hub architecture and propose a cloud entity, SmartHub, which provides a scalable solution with reduced constraints on the quality. A framework is proposed for designing SmartHub system for a given camera placement. Experiments show the efficacy of SmartHub based systems in multicamera scenarios.


Introduction
The basic purpose of smart cameras is to cope with the ever increasing resource demands of processing and managing gigantic video data.Researchers argue that efficient and scalable solutions can be achieved by pushing the processing to the edge of the system so that most of the processing takes place at the sensor itself [1][2][3][4][5][6].As a result, we witness a number of workshops and conferences on distributed smart cameras [7,8].Smart cameras perform video analysis tasks at the sensor itself and send only an abstract description of the scene for further processing and viewing.It has been articulated that smart cameras are the key elements of the ongoing paradigm shift from central to distributed surveillance systems [9][10][11].
Despite great progress in terms of research, smart cameras have not seen enough success in commercial systems and real installations.In this paper, we investigate the reasons behind the restricted use of smart cameras through a comparative assessment.It is found that smart cameras are effective only for sparse camera networks where multisensor information fusion is minimal, such as a highway traffic monitoring systems, but they are not suitable for applications requiring the assimilation of data from multiple cameras.
With the decreasing cost of video sensors, however, more applications are using densely placed cameras with overlapping views.For instance, in the surveillance context, multiple cameras with overlapping views are used to seamlessly track targets in the presence of occlusions.Typically, redundant sensors are used to achieve higher accuracy and robustness of detection tasks [11,12].Information from multiple cameras is fused to improve the accuracy and robustness of detection tasks [13][14][15].This type of information fusion is not possible in smart camera systems as the videos are processed in isolation and only abstract data is available at the fusion node.In this way, smart cameras are not the best choice for synergistic integration of current and future research in interdisciplinary areas of multicamera applications.There are multiple limitations of smart cameras that hinder their general usage, such as 2 International Journal of Distributed Sensor Networks Surveillance and monitoring [11] Anomaly detection, alarm generation Object/face detection, recognition, and tracking Ambient intelligence [44] Supporting persons in daily life, living assistance Posture recognition, face detection, and tracking Smarthome [45] Healthcare, eldercare Activity detection, gait recognition Teleconferencing [46] Enhanced communication experience Face detection and tracking Traffic surveillance [47] Monitoring and analysis of traffic flow Segmentation, motion detection, and object classification Crowd management [48] Avoid congestion in narrow areas Tracking, flow analysis Pervasive computing [49] Living assistance, healthcare Activity and behaviour detection (i) multisensor coordination and information fusion is usually inefficient as video from each camera is processed in isolation; (ii) only metadata and compressed video data are available for multicamera coordination.Therefore, detection and recognition tasks involving multiple cameras perform poorly; (iii) the cost of smart cameras is too high in comparison to basic IP cameras, without equivalent benefit in performance; (iv) algorithms designed for smart cameras are custom designed and hence are very difficult to upgrade.
In this work we utilize cloud computing on a local area network (private cloud) as an alternative solution to scalability.We propose SmartHub as a logical entity which processes data from cameras that likely require information fusion.A number of SmartHub instances run on the cloud to process video from the cameras.SmartHub not only overcomes the limitations of smart cameras but also provides a scalable, distributed solution.Because the video streams from the cameras needing information fusion are processed at one node, SmartHub enables efficient multicamera information fusion.We study the trade-off between the scalability (in terms of the degree of processing distribution) and coordination (in terms of communication overhead) and propose SmartHub as an alternative to smart cameras.
With the increasing number of cameras, sending high quality video to the cloud may cause a bandwidth bottleneck.We propose subnet dependent geographical distribution of processing nodes and SmartHubs to avoid the bandwidth bottleneck.With the proposed distribution of processing nodes, high bandwidth data will remain within the subnet and only abstract data will flow across subnets.
The main contributions of this work are as follows.
(i) We provide a comparative assessment of smart cameras with the conclusion that smart cameras are an inefficient choice for growing multicamera applications.
(ii) We propose the cloud entity SmartHub that overcomes the limitations of smart cameras and propose a framework to make design decisions.
The rest of the paper is organized as follows.In Section 2 we review video analysis based applications and smart cameras.We discuss the limitations of smart camera based systems in Section 3. Section 4 describes SmartHub based system design and Section 5 describes the framework to make design decisions.We provide our conclusions in Section 6.

Context Description and Definitions
In this section we first describe potential applications where smart cameras can be used and derive a representative system architecture for these systems.Then we discuss smart camera works and how they are employed in video analysis systems.

Video Analysis Applications.
A camera captures a snapshot of the scene in its view; in the same way a human eye observes a scene.The video captured by the camera is analysed to understand the semantics of the scene.Automatic video analysis is used to assist/automate decision-making in a number of application scenarios, a few of which are listed in Table 1.
The majority of video analysis applications are related to surveillance and monitoring.Another set of applications is concerned with healthcare and elderly care.Examining all these applications, we make the following observations.(i) The most common tasks are foreground detection, object/face detection, tracking, and activity/ behaviour analysis.
(ii) The tasks do not always follow a pipelined structure; that is, we generally need original video even for high level tasks such as tracking and activity detection.
(iii) There is always a central unit that consolidates the analysis results and derives higher level semantics.
Based on these observations, a functional view of a typical video analysis system is drawn in Figure 1.In some cases, tracking is directly performed on the video.Object detection is still required to initialize the trackers.While intermediate functions (first three blocks of Figure 1) can be delegated to various distributed computing devices, the final aggregation and presentation generally take place at one (or more than one) central unit.

Smart Cameras.
Earlier cameras used dedicated coaxial cables to transmit recorded video.Today, such cameras are almost completely replaced by IP cameras [16].A basic IP camera captures images, compresses them, and streams the compressed video on the network [17].In this paper, we will refer to these cameras as normal cameras.
A smart camera, on the other hand, integrates resource intensive advanced image and video processing techniques with compression and streaming.As shown in Figure 2, a smart camera consists of three main blocks: sensing, processing, and communication [4].One of the initial works on smart cameras was by Moorhead and Binnie [18], who integrated edge detection in the camera.In [3], a smart camera extracts the high-level semantics of the scene it is capturing and sends them to the central unit.In this way, the smart cameras are mainly used to delegate detection and recognition tasks to the embedded platforms.Table 2 provides a list of smart camera works and the processing task implemented on the camera.To reduce the communication overhead, smart cameras analyse the image data and only transmit the abstract information [4,19].
Processing video locally at the source camera reduces the communication load by avoiding the transmission of high quality images; only concise descriptions of extracted features are communicated for multicamera collaboration [9,[20][21][22][23].Hence, smart cameras stream low-bandwidth processed information to save bandwidth [24][25][26].Yet, for optimal performance, computer vision algorithms require heavy computing resources.There have been attempts to develop lightweight methods for smart cameras [27][28][29][30] to reduce resource needs.However, these ad hoc methods compromise the overall quality and do not extend easily.If there is any improvement in the original algorithm, the customized version may or may not agree to the improvement.
Based on the discussion above, smart cameras provide processing and bandwidth scalability only when (i) the processing tasks are not repeated; that is, if the foreground detection is done at the camera, it should not be repeated at the central unit; (ii) the data communicated from a smart camera is much less than the original data captured.
In the following section we show the effects of these constraints on the research in other interdisciplinary areas of video analysis systems.Subsequently, we propose a cloud entity, SmartHub, and demonstrate how it provides both scalability and synergistic integration with other research areas.

Limitations of Smart Cameras
The most important limitation of smart cameras is the limited opportunity for information fusion.In the process of video analysis, information fusion can take place at the following three levels [31].
(i) Data Level.In this type of fusion, pixel values are directly compared to come to a conclusion; therefore it requires image data for fusion.
(ii) Feature Level.In a more popular approach, features are extracted from the image and compared for detection.If features are not heavily compressed, they require significant bandwidth for transmission.
(iii) Decision Level.This type of fusion is the most economical in terms of bandwidth overhead.The detection task is performed for individual cameras, and only final decisions from each video are fused together.
The smart camera systems only allow fusion at the decision level.In current multicamera systems, however, generally the cameras are densely placed with overlapping views which require data and feature level fusion [32][33][34][35][36][37].To enable feature level fusion in smart camera systems, feature compression techniques have been proposed [25,26].However, feature compression is an ad hoc process and compromises the overall accuracy of the analysis task.We argue that the features are already compressed and further compression is unfavoured for future analysis techniques.
To assess the smart camera systems with overlapping views, we consider a scenario in which 4 cameras with overlapping views are tracking a person at a subway station.If the cameras do not communicate with each other to save bandwidth, one object is being tracked by 4 cameras, which is a redundancy rate of 75%.There have been research works in which only one master camera with the best view tracks the object [19,21,38].In this case we have 4 hardware units capable of tracking but only one unit is being used at a time.The other 3 units are underutilized, which increases the overall cost per object of the system.
In Figure 3, we show the processing times of the steps of a typical video analysis system.Four videos with overlapping views from a multicamera video dataset [39] are used for this evaluation.While the foreground detection only depends on the frame resolution, the processing times of detection and tracking are proportional to the computational load of each step.It is evident from the figure that tracking is a computationally intensive task.It would require expensive hardware to track objects using state-of-the-art tracking methods, such as particle filters [40].

Work
Tasks Moorhead and Binnie [18] Edge detection Wolf et al. [3] Human gesture recognition, region extraction, contour detection, and template matching Lin et al. [23] Gesture recognition Muehlmann et al. [50] Real-time tracking Heyrman et al. [51] Motion detection Bramberger et al. [9] Traffic surveillance, multicamera object tracking Chen and Aghajan [19] Gesture recognition using smart camera network Quaritsch et al. [21] Multicamera tracking, camShift Rinner and Wolf [4] scene abstraction Aghajan et al. [20] Human pose estimation Sankaranarayanan et al. [24] Object detection, recognition, and tracking Tessens et al. [30] Foreground detection, subsampling Wang et al. [22] Tracking, event detection, and foreground detection Casares and Velipasalar [28] Foreground detection, tracking feedback Sidla et al. [52] Traffic monitoring Pletzer et al. [53] Traffic monitoring, vehicle speed, and vehicle count Wang et al. [29] Foreground detection, contour tracking Cuevas and Garcia [54] Single camera tracking, background modelling In order to decide the best view, smart cameras need to share foreground information with each other.Figure 4 shows the fraction of image area that belongs to the foreground for real surveillance footage of 24 hours.We can see that the foreground area may vary from nothing to 63%.Sharing such a large amount of foreground information would use a great deal of bandwidth.Furthermore, with the overlapping views, the best view can change between consecutive frames.This would require frequent changes in the role of the master camera.Frequently changing the master camera would cause additional bandwidth and processing overhead.

SmartHub System
We propose to use normal IP cameras to capture video and delegate all video analysis tasks to the private cloud.In our discussion, the private cloud is defined as the distributed computing nodes on the same Local Area Network (LAN) to which the cameras are connected.The block diagram of the proposed SmartHub system is shown in Figure 5.The cameras only perform the basic tasks of video compression and streaming.Video analysis tasks of detection, recognition, and tracking are performed by the SmartHub cloud entity.Note that a normal IP camera with basic encoding and streaming capabilities is approximately 10 times cheaper than a smart camera capable of detecting activities.
SmartHub fuses visuals from multiple cameras and provides a set of services to the central unit such as object detection, face detection, and tracking.The central unit can query SmartHub to receive continuous information (video streams) or event information in terms of detected objects.Because the information fusion takes place at SmartHub, it does not need to send video from all cameras to the central unit but only the most informative view.Furthermore, SmartHub can create a synthetic view (e.g., a 3D model) of the scene and send that information.In this system, the cameras that are likely to coordinate are connected to one processing node (SmartHub), which creates an abstract understanding of the coverage area and shares the coordinated and synchronized information with the other processing nodes and central unit.
To avoid the bandwidth bottleneck, we exploit the organization of LAN infrastructure.A LAN consists of multiple switches, arranged in a hierarchical fashion.The cameras are connected to the lowest level of switches.The data going out of the switch has a bandwidth limitation depending on the number of other switches in the network and data flow.For communication within a switch, however, almost full Ethernet bandwidth is available.
We propose that the cloud processing nodes should be connected to the switch to which the corresponding cameras are connected.With that setup, we would be able to send high quality video to SmartHub for information fusion, and abstract information can be forwarded to the central unit through higher level switches.The proposed scheme is shown in Figure 6.A SmartHub based system has various merits over both centralized system and smart camera based system.These merits and salient features of SmartHub based system are discussed below.The topics for the discussion have been chosen in consideration of the current state of research focusing on design and quality issues.

Storage Scalability.
A SmartHub with storage capabilities can provide an excellent distributed data storage architecture.Storage at each smart camera is costly, whereas unified storage at the central location is not scalable.Hence, SmartHub can provide a midway solution for storage.
Storing video at smart cameras is costly because smart cameras generally use flash memory.Adversely, SmartHubs can employ disk memory on the cloud.Table 4 compares the price of hard disk and flash memory.We see that the cost per GB for hard disk memory is from 6.25 to 8.9 cents, whereas flash memory prices can range from 183 to 210 cents.Furthermore, the price of compact memory used in smart cameras increases more rapidly with capacity.

Reduced Processing Repetition. Video analysis involves low level processing (background-foreground classification)
and high level processing (blob detection and tracking).In smart camera systems, the low level steps of background and foreground detection are repeated both at the camera and at the processing node that fuses data from multiple cameras.In SmartHub, since object detection and fusion are performed at the same place (cloud), there is no repetition of processing.We see in Figure 3 that SmartHub needs 30% less processing for the same task, as foreground detection is only done once.

Lower per Sensor Cost.
The per sensor cost of the overall system is very high in smart camera systems due to the enhanced capabilities of smart cameras.On the other hand, SmartHub offers reduced cost as there is only one hardware unit for a group of sensors.This topic has been included to emphasize that the smart cameras add to the cost of the system without providing equivalent benefits.SmartHub provides better performance with reduced cost.A normal IP camera costs around $100, while a smart camera costs approximately 10 times more than a normal camera.If we consider a 4-camera system, building the system with smart cameras would cost at least $4000.Alternatively, a cloud processor with processing power equivalent to 4 cameras would cost less than $1000.Hence, a SmartHub system with 4 cameras would only cost $1400, 65% lesser than the smart camera system.Furthermore, the processing power over cloud is available for other applications when the video processing workload is minimal [41].

Others.
Sensor coordination and synchronization is also difficult in centralized and smart camera based systems due to random network delays at intermediate nodes.For a fixed bandwidth, SmartHub will provide best tracking performance as high quality video from overlapping view cameras is available at one node without causing additional bandwidth overhead.Similarly, to achieve the same level of tracking accuracy, centralized system will need high quality video from multiple cameras causing large bandwidth overhead.A summary of the above discussion is provided in Table 3.

Design and Analysis
The main question in designing the system is the number of cameras to be connected to a SmartHub.To determine a suitable number, we conduct a task based analysis.Consider a video analysis task of human detection and matching.We chose this task because it is a very common task for video based applications [42,43].In this task, we detect the humans in each camera and match them across cameras to obtain the best view of a person.
The task can be accomplished in both centralized and distributed architectures.However, each architecture will have different overheads in accomplishing the task.The overheads are abstracted in two categories: communication overhead and processing overhead.

Communication Overhead.
For human detection and matching, high quality image regions need to be transmitted to other processing nodes over the network.For cameras with overlapping views, it is very useful to share the facial data to match humans and track across obstacles.For nonoverlapping cameras, the human data is only required when tracking a person over a larger territorial region or when there is a specific threat generated at one camera and the person needs to be detected at all possible places.Therefore, we assume that information from each pair of cameras needs to be fused with a nonzero probability.This implies that in a purely distributed smart camera network every pair of cameras needs to communicate with each other probabilistically.
We have modelled the communication overhead in terms of camera overlap and data sharing requirements.Let C = { 1 ,  2 , . . .,   } be the set of cameras where  is the number of cameras.Let A = { 1 ,  2 , . . .,   } be the set of areas covered by the corresponding camera.Let  and ℎ be the average width and height of the human region in number of pixels.Now, the total communication overhead for a smart camera network is calculated as where   is the normalized intersection of the areas covered by the th and th cameras; that is, In a SmartHub based architecture, the communication overhead is mainly due to the communication among SmartHubs.Let S  be the set of cameras connected to the th SmartHub.The total amount of data flowing out of the th SmartHub is calculated as and the total overhead is the sum of the overheads due to all SmartHubs; that is, where   is the overhead due to the th SmartHub,  is the overhead in SmartHub based architecture, and   is the number of SmartHubs.If  is the number of cameras connected to a SmartHub, the total number of SmartHubs is calculated as Note that the communication overhead depends mainly on the communication requirements between the processing nodes and on the camera placement.

Processing Distribution.
In a smart camera network, all the processing is pushed to the edge.The processing tasks are completely distributed among processing units.With the introduction of SmartHubs, we bring the processing one level higher.In a completely centralized system, all the processing is done at a single node.This introduces a processing bottleneck and a single point of failure.Therefore, distributed processing is a desired characteristic of an architecture and it is measured as processing distribution ().
If  is the amount of processing required to complete the task, the processing load on a single node in a smart camera network is .In a SmartHub based architecture, the processing load (Ψ) on a SmartHub is Consequently, the processing distribution is calculated as where  is a normalizing coefficient, which is chosen to be  / to limit the maximum value of processing distribution to 1 for the smart camera case.The minimum value of distribution is /  .To measure the most adequate number of cameras for a SmartHub, we define an optimization function as follows: to emphasize that we wish to reduce the communication overhead while still being able to distribute processing.

Experimental Results.
In this section we obtain the number of cameras to be connected to a SmartHub in a given scenario.While the framework can be applied for any task and any given scenario, we consider a camera placement scenario as given in Figure 7 for experimental purposes.
For the experiments, we considered 100 cameras ( = 100) and assumed that the amount of processing required for the detection and recognition tasks is unity; that is,  = 1.The height and width of the human blob are also assumed to be unity ( = 1, ℎ = 1).For the scenario given in Figure 7, consecutive cameras have a fractional overlap of 0.16, otherwise no overlap; that is,   = { 0.16,      −      = 1, 0.01, otherwise.
The resulting communication overhead for the given placement is shown in Figure 8.We see that the overhead initially reduces rapidly until  ≈ 4 and then becomes horizontal.Thus, beyond a point, adding more cameras to a SmartHub does not provide any significant advantage in reducing the bandwidth requirement.
The processing distribution decreases linearly with the number of cameras connected to a SmartHub as shown in Figure 9.The combined optimization function is plotted in Figure 10.With the help of this figure, we conclude that 4 to 15 cameras could be connected to a SmartHub for adequate trade-off between communication overhead and scalability.

Limitations.
While there are multiple advantages of SmartHub in multicamera systems, these are tightly coupled with network topology.The proposed SmartHub architecture assumes that the network follows tree topology.Experiments also reveal that the benefits of SmartHub are significant only when the number of cameras is large and the task at hand requires fusion of video from multiple cameras.

Conclusions
Smart cameras are inefficient and costly in scenarios with multiple overlapping cameras.Such scenarios are common in setups using cheap video sensors.The scalability achieved by smart cameras puts additional constraints on the system which compromise the performance of vision algorithms.Similar scalability is achieved by processing data over cloud with SmartHub based architecture.The given framework can be used to calculate an adequate number of cameras to be connected to a SmartHub for a given camera placement.For a general placement with consecutive overlapping cameras, it is adequate to have 4 to 15 cameras per SmartHub.In the future, we intend to deploy SmartHub on dedicated hardware units and explore more design decisions.

Figure 1 :Figure 2 :
Figure 1: Functional view of typical video analysis systems.

Figure 5 : 3 Figure 6 :
Figure 5: Cameras capture video and send it to the SmartHub nodes in the cloud.SmartHubs process the data as a service to the central security unit.Based on the situation, the central unit can enable or disable a particular service.

Figure 8 :
Figure 8: Communication overhead versus number of cameras per SmartHub.

Table 1 :
Video analysis applications.

Table 2 :
Brief summary of smart camera works.

Table 3 :
The comparison of SmartHub, smart camera, and centralized systems.

Table 4 :
Approximate current prices of flash and hard disk memory.