A fuzzing-based test-creation approach for evaluating digital TV receivers via transport streams

Digital TV (DTV) receivers are usually submitted to testing systems for conformity and robustness assessment, and their approval implies correct operation under a given DTV specification protocol. However, many broadcasters inadvertently misconfigure their devices and transmit the wrong information concerning data structures and protocol format. Since most receivers were not designed to operate under such conditions, malfunction and incorrect behaviour may be noticed, often recognized as field problems, thus compromising a given system ’ s operation. Moreover, the way those problems are usually introduced in DTV signals presents some randomness, but with known restrictions given by the underlying transport protocols used in DTV systems, which resembles fuzzing techniques. Indeed, everything may happen since any deviation can incur problems, depending on each specific implementation. This error scenario is addressed here, and a novel receiver robustness evaluation methodology based on non-compliance tests using grammar-based guided fuzzing is proposed. In particular, devices are submitted to unforeseen conditions and incorrect configuration. They are created with guided fuzzing based on real problems, protocol structure, and system architecture to provide resources for handling them, thus ensuring correct operation. Experiments using such a fuzzing scheme have shown its efficacy and provided opportunities to improve robustness regarding commercial DTV platforms.

Over the last decades, many countries migrated from analogue TV systems to digital TV (DTV) standards [1].In Europe, Séquentiel couleur à mémoire (SECAM) and phase alternating line (PAL) [2] were replaced by digital video broadcasting terrestrial (DVB-T) [3] and, more recently, DVB-T second-generation (DVB-T2) [4,5].North America moved from national television system committee (NTSC) [2] to advanced television system committee (ATSC) [6], now being updated to ATSC 3.0 [7].Japan adopted the standard named integrated services digital broadcasting -terrestrial (ISDB-T) [8].In addition, most of South America chose its variant known as ISDB-T version B (ISDB-T B ) or Brazilian digital television system (SBTVD) [9], as a substitute to PAL, SECAM and NTSC.Consequently, receivers in those regions should conform to specific DTV standards, usually accomplished with commercial testing systems or proprietary approaches.experimentation and providing verification results.The specific research challenge involved here may be stated as developing a methodology to evaluate DTV receivers under the wrong configuration and unforeseen scenarios while providing means for device enhancement.Additionally, the proposed methodology defines a complete evaluation approach, which includes a testing environment, audio and video verification algorithms, and a strategy for test creation.Regarding the latter, previous field problems, user interfaces of commercial equipment, and sensitive data are used as starting points for guiding and devising tests.However, all that can grow and evolve with time in order to provide a mature testing framework.Consequently, new verification cases are expected as the understanding of the associated environment advances so that the resulting verifier can be up to date and also continuously improved with checks that distantly resemble the initial ones.

| A brief discussion on motivation
As hinted in the Introduction, there is a gap regarding DTV receiver testing, which traditionally does not address misconfiguration in head-end equipment.Indeed, current testing resources only consider the correct formation of DTV signals; however, those systems have an attractive condition that can lead to incorrect configuration data: due to the intersection of different subsystems, such as audio and video source coding, multiplexing, and SI/PSI, the same equipment can be even used for different DTV transmission systems, which human operators then configure.In addition to the enormous configuration options available, this scenario can lead to DTV signals with wrong data, which is not usually expected by commercial receivers, focusing on conformance to what should be transmitted.Consequently, it is not unlikely that wrong configuration is incorrectly handled in receivers, which may lead to non-operating devices or compromised output interfaces (i.e.video and audio) as perceived by end users.
Note that such occurrences then give rise to a chain of events whose peak constitutes the replacement or even the return of a product, bringing manufacturers financial loss.Although one may claim that it is only a matter of 'correcting' the faulty information, such an action is not as easy as initially assumed.Sometimes, small broadcasters can not even be contacted; if they are, they may not know what to do.This way, the onus of such occurrences is often left to manufacturers.
However, by analyzing real field problems and considering how they arise, why not tackle them and anticipate what will probably happen in the field?Moreover, can we create a methodology complementary to usual verification procedures, for example, bounded model checking [43,44] and conformance testing [45,46], so one does not have to wait until something happens on user premises?In addition, given the way field problems arise, that seems to be the case and undoubtedly involves fuzzing techniques to mimic the randomness identified in their constitution.
In summary, the motivation of this work became evident and led to a novel verification framework that addresses a previously ignored condition: creating a scenario where signal configuration is purposefully corrupted to reveal receiver fragility, which is identified using the audio and video outputs available to end users.

| Contributions
This paper makes four main contributions.First, it presents frequent field problems and their root causes, which have never been revealed to the best of the authors' knowledge.Second, a methodology for DTV platform evaluation, providing a complete scenario with tools and infrastructure, is introduced and implemented, which can be applied to DTV systems worldwide.Besides, as extensively known, simple fuzzing usually presents low efficiency.However, based on how field problems usually manifest, a guided-fuzzing approach for test creation was devised, potentially revealing likely problems and dangerous scenarios.Finally, real experiments were performed for seven different receivers from five manufacturers, clarifying their significant weaknesses.

| Paper organization
The remainder of this text is organized as follows.Section 2 reviews testing technologies for MPEG TSs.Then, Section 3 discusses real field problems, while Section 4 presents fuzz testing, along with a brief analysis of the proposed approach regarding the available fuzzing strategies, classifications, and techniques.In Section 5, a methodology for receiver robustness evaluation is proposed, as well as a strategy, based on guided-fuzzing, for error creation.Experiments for validating this methodology and providing statistics are shown in Section 6, some related work is discussed in Section 7, and finally, conclusions are described in Section 8.

| TESTING RESOURCES FOR MPEG TRANSPORT STREAMS
Settings for PAL, SECAM and NTSC signals are restricted, with one video and, usually, two audio signals [2].By contrast, DTV systems provide multiple video and audio streams, electronic program guide (EPG), and interactive applications [47] that, in the case of SBTVD [22], are developed in nested context language (NCL), with scripts in Lua, or Java [22,24,27,48].Moreover, SBTVD and ISDB-T [8,49] allow different simultaneous modulation types.This way, a broad configuration is required, partly set by technicians at DTV head-ends.
Verifying receiver compliance and if MPEG TSs present the correct configuration are two necessary tasks that led to many studies.Regarding the former, southeast Europe has created a conformity assessment specification for DVB [50] to guide member countries through its evaluation.Concerning the latter, specialized analysers verify TS compliance [51].DVB has developed test specifications regarding bit rate, synchronization, and transmission parameters for MPEG TSs [52].It has also devised a measurement and signalling channel [53] that allows the insertion of test data into a specific packet identifier (PID).In Finland, rules for SI tables [54] have been published, which help broadcasters comply with NorDig receivers [55].Finally, a Ginga test suite has been developed [14], and operation guides have been released for DVB-T [56,57], ISDB-T [58] and SBTVD [59][60][61].It is worth mentioning that the Ginga compliance suite could be manually executed or even integrated into automated test equipment (ATE) [21].
In summary, there is no unified TS certification system because its composition may also change on the fly.Furthermore, although there is equipment for shared resources, such as essential TS formation [52], specificinformation devices are rare.Consequently, broadcasters are responsible for verifying TS conformity using automated tools, monitoring equipment, and manual checking.Besides, there is also a lack of devices for performing specific checks for SBTVD, which then forces the adoption of generic [47] or limited tools [62] and further complicates TS certification processes.
Moreover, although field problems are significant and lead to high costs for after-sales departments of manufacturers, they have always been handled in a non-systematical way.Receiver manufacturers try to contact broadcasters, and the latter, when affected (e.g.loss of audience due to non-operating receivers), try to correct its configuration; otherwise, manufacturers create fallbacks on their own to prevent further costs.
Due to that, robust devices seem more convenient for dealing with erroneous information.In this context, a further investigation performed by the same Brazilian TV manufacturer mentioned before led to gathering and analyzing TSs recorded during failures to make its devices more robust.Therefore, our investigation revealed conditions that had not been anticipated, thus inspiring robustness tests that also evaluate non-compliant scenarios.In addition, the present work is a direct consequence of such effort.It seems to be the first to address the mentioned problem and creates a methodology for DTV receiver robustness under configuration errors.Finally, it favoured the creation of verification equipment capable of being applied to real DTV signals.

| FIELD-PROBLEMS ANALYSIS
Formally, a field problem is an issue that happens outside a laboratory and after the development phases of a specific device, that is, during its after-sale period.It means field problems are first noticed by final users or retailers when trying to operate a given product.It happens mainly because such occurrences usually compromise audio and video outputs and even menu graphics and auxiliary information (e.g.EPG), thus directly affecting the resulting user experiences.
Field problems can be classified into system-, audio-, video-, application-and data-related occurrences.Further, these same categories can be divided into incapacitating and non-incapacitating problems.The first classification level is straightforward and identifies the specific part most affected by a given problem.In contrast, the second level refers to its severity, that is, if the associated device can still be operated.
It is worth noticing that DTV field problems are usually caused by implementation errors or wrong configuration sent by broadcasters, the latter being the focus of the present work.In the following sections, some problems that happened in Brazil are presented, which already include their classification as explained above.Moreover, not all categories are represented as the problems explained here were chosen only for being relatively important, that is, they are frequent or ultimately compromise receivers, and they are typically identified in roughly 60% of field problems routinely found, as reported by the R&D department of the previously mentioned TV manufacturer.In addition, to clarify some aspects specific to digital televisions, a brief description is first provided in Section 3.1.

| Basic aspects of the digital televisions systems
A digital television signal consists of programs multiplexed in a TS, as illustrated in Figure 1.Each program includes component streams, which can be audio, video, data (e.g.subtitles, specific tables and stream events), applications and synchronization information that is known as program clock reference (PCR).Audio and video are sent in compressed format [63], due to bandwidth restrictions.Additionally, data are packed depending on their use [64], and interactive applications are transmitted with objects or data carousels [65].Lastly, PCR information provides synchronization with a given transmitter, ensuring a suitable decoding rate.The latter is of paramount importance, given that it allows general and inter-media synchronization, with a 27 MHz clock, and even generates frequencies for output interfaces [64].
Usually, in SBTVD, two programs are sent: a high-definition one, also known as primary, identified with xx.1 [24], and targeted on fixed receivers; and a low-definition one, identified with xx.31 [24] and targeted on mobile receivers [24].Depending on specific cases, each program may contain more than one audio or video stream.Usually, only one video stream is transmitted.However, there may be one, two, or even more additional audio streams, which can bear, for instance, an original program's audio, in its original language, and audio description, for visually impaired people, which is composed of its original sound and scene description [66].Nonetheless, to notify receivers about primary streams, the field component_tag, in the stream identifier descriptor [67], is used.For instance, in SBTVD, a primary audio stream presents component_tag equal to 0x10, while the others follow the sequence 0x11, 0x12 and so forth [61].Finally, other types of streams may be present, such as subtitles, object carousels, and even Internet protocol (IP) packets [64,68].

| Video-related and incapacitating problems
This section presents field problems that primarily compromise the video output.This means the video subsystem processes the erroneous elements, leading to a condition where the affected device becomes non-operating.

| Wrong length of supplemental enhanced information
In TV services, the essential data are video, audio, and subtitles/closed caption (CC).On the one hand, the first two are usually sent as independent packetized elementary streams (PESs) [23].On the other hand, subtitles/CC may also be borne in MPEG version 2 (MPEG-2) video [6] and H.264 [69,70] streams, as happens in ATSC.This information is vital because many DTV platforms are employed worldwide, with differences only regarding air interfaces, which leads to the use of the same development platforms.
In one specific field problem, receivers malfunctioned when tuned to a broadcaster.After a first decoding attempt, it involved a lack of response to remote control, graphical user interface (GUI) slowness, and video absence.
A deep analysis of the related TS revealed that supplemental enhancement information (SEI) packets of type user_-data_registered_itu_t_t35 [69] presented wrong size in headers of H.264 network abstraction layer (NAL) units [69].In SBTVD, such messages are used for active format description (AFD) [25], but the faulty stream bore CC content [70,71], with a divergence of 3 bytes between message length and CC-data size.Consequently, parsing procedures were unable to identify subsequent NAL units.This problem occurred due to faulty professional equipment, which resulted in a fix request to the respective manufacturer and CC-feature disabling.Moreover, it could have been avoided if the actual data size (cc_count) [71] had also been taken into account.
One may notice that this problem could have happened in other DTV systems, such as ATSC, DVB-T and ISDB-T, because they also adopt H.264 and use commercial platforms.Moreover, even if transmitted in ATSC networks, such length information would still be wrong.Finally, the DTV markets in the United States and Europe also employ the platform that presented this behaviour.

| System-related and non-incapacitating problems
This section presents field problems that primarily compromise the whole DTV system in receivers, affecting all subsystems.However, the associated devices can still be operated.

| Clock-reference values smaller than what was expected
This problem is very aggressive and can affect many DTV subsystems in a commercial receiver, thus deeply compromising its operation.
A broadcaster sent a primary service that caused halting and slow video, making it impossible for end-users to watch the respective channel.A brief analysis of the referred TS revealed that some PCR values [23,64] were odd.In particular, there were two PCR lines: one adequate regarding the available presentation time stamps (PTSs) [23,64], and another one presenting values lower than expected.Besides, there was no PCR discontinuity [23].
From a receiver's point of view, PCR values must increase and, ideally, match a local 27 MHz clock.However, when sudden smaller values appear, traditional algorithms identify that the current clock is too fast.The associated phase-locked loop (PLL) [64] then tries to get synchronized, reducing its current frequency.Consequently, a generalized malfunction occurs, affecting video and audio reproduction, stream synchronization, and output interfaces [2,64].
Further investigation revealed that a new multiplexer was being tested, including PCR information into the same packets used by the original equipment.In addition, in another instance of the same problem, the same multiplexer generated two different time bases and added them to packets with the same PID.It can be seen in Figure 2, where one may notice two distinct PCR lines: upper and bottom parts of that figure, with red squares (other colors mean data related to PTS and decoding time stamp -DTS).The y axis shows PCR values, in milliseconds, with the two mentioned distinct lines, and the x one shows PCR packets along time, also in milliseconds.If the upper line were the only one, the internal PLL in receivers would lock on it with a frequency that matched PCR values from left to right (e.g. from 100 to 200 ms and so on).However, the lower line provides sudden changes in its frequency (e.g. from 100 to 50 ms and then back to 200 ms), which tampers with the underlying synchronization algorithm.
The problem explained above is tough because PCR is used for synchronizing local 27 MHz clocks [23]; however, wrong values could be ignored with filtering [72].Finally, PCR is present in any standard based on MPEG TSs, including ATSC, DVB-T and ISDB-T.

| Time gaps larger than what is recommended between two table sections
Usually, DTV standards define transmission rates for PSI/SI tables [24,60,61], such as program association table (PAT) and network information table (NIT), in order to make related information available according to its relevance.PAT sections, for instance, are usually advised to be transmitted in intervals of 100ms, while time offset table (TOT) ones may be delivered every 5 s.
During a particular glitch, television sets suddenly suffered a channel change when tuned to a specific broadcaster.Indeed, if its physical and virtual channels [24] were YY and X X , respectively, then DTV devices presented changes from X X :1 to Y Y :1 and back.
When reproduced in the laboratory with the faulty TS, an issue related to a non-conformity regarding NIT was discovered.Although virtual channel information in TS descriptors was sent as indicated by standard ABNT NBR 15603 [24], NIT sections were sometimes transmitted according to a time interval slightly higher than 10 seconds, which is the limit imposed by the same standard.Such a configuration caused filter timeouts in receivers, and, in the absence of that table, the respective physical channel was used as a fallback procedure.So, if a timeout happened, the physical channel was used; otherwise, the virtual channel was displayed in an alternating pattern.Given that such a handler incurred a channel modification, end-users assumed that a real channel change was in course.
One may notice that changing a channel was a specific feature of the identified DTV receiver model; however, ABNT NBR 15603 [24] indicated a mandatory transmitting period no longer than 10 s.Besides, NBR 15608 later suggested the transmission of NIT sections every second, which is usually done by several broadcasters [59,61].Finally, this is an example where developers strictly follow what is indicated by a standard, without error margins, a more elegant approach, or a more sophisticated fallback procedure.
In summary, this problem was a combination of wrong configuration and fragile software implementation, which is very common in DTV environments.Finally, the table transmission rate is a concern in every DTV system, and platforms must provide a way of handling wrong setups.

| Data-related and incapacitating problems
This section presents field problems that primarily compromise the auxiliary data sent in a TS.The erroneous elements are processed by the SI/PSI subsystem and lead to a condition where the affected device becomes non-operating.

| Conditional access information transmitted in a free-to-air DTV channel
Conditional access information is usually employed in cable and satellite TV systems to restrict access to entitled subscribers.In summary, such a piece of information is split into a specific table, called conditional access table (CAT), related descriptors, entitlement management messages (EMMs), and entitlement control messages (ECMs).However, due to their characteristics, those are not usually employed in free-to-air and terrestrial television systems.
In this particular problem, from time to time, TV sets did not display the primary service of a broadcaster.When that happened, they crashed, and a complete system reset was required to recover regular operation.An investigation conducted on a TS presenting this problem revealed conditional access information in CAT sections, conditional access descriptors carried by program association table (PMT) sections, and fields conditional_access_mode in service description table (SDT) sections [24], even though TS packets with the PID associated to EMMs available in CAT sections were not present.Consequently, DTV receivers tried to initialize their decryption circuits without real encrypted data, which eventually caused the mentioned behaviour.
Indeed, information related to conditional access should be ignored in free-to-air terrestrial transmissions, at least in Brazil and many other countries.However, DTV platforms are used in many conditions and markets (horizontal and vertical ones), leading to code for that scenario being run in non-compliant conditions, which may also happen in other DTV standards.

| Non-existent services specified in NIT
DTV systems deeply rely on PSI/SI tables; however, not all tables must be transmitted, but a subset of them usually includes PAT, PMT, NIT, SDT and event information table (EIT) sections.In particular, NIT sections bear essential information regarding services and their classification, which usually drives structure creation in a receiver's memory.
During a specific problem, which repeats itself once in a while, receivers usually have their databases corrupted and consequently stop working when they are tuned to a given channel or do not display services.Following the acquisition of the respective TS, it was noticed that NIT sections carried a service list descriptor with several 'ghost' services, which were not present in SDT and PAT sections [24].
After contacting technicians, the latter informed us that many tests with different DTV services had been performed, and the related information was never removed.Although SBTVD and other DTV standards (e.g.DVB-T and ISDB-T) report that PSI/SI tables must be consistent, receivers may perform cross-checking to use only data that seems correct.
As a final remark regarding DTV systems, much information is often shared among different tables, which must be consistent and checked before transmission.Moreover, information that presents many configuration possibilities is highly prone to error and should be carefully cross-checked.

| Data-related and non-incapacitating problems
This section presents field problems that primarily compromise auxiliary data sent in a TS.The SI/PSI subsystem processes the erroneous elements, but the affected device can still be operated.

| Inconsistent encoding of audio and video streams
In DTV, there are two layers of media information: multiplex, in PSI/SI tables and usually employed for decoder configuration, and media, in elementary streams (ESs) and accessed only by decoders.A disagreement between those is a classic problem that may lead to audio and video situations.
Many transmitters used in Brazil have been configured with information used in Japan, where the audio data transport stream (ADTS) container [73] was initially adopted.Consequently, low-overhead MPEG-4 audio transport multiplex (LATM) [74] streams are sometimes transmitted with ADTS specified in PMT sections (field stream_type), which may be wrongly performed by technicians [75].Then, decoders may be erroneously configured, or audio may not even be decoded due to a lack of support to ADTS or algorithm allocation inconsistent with LATM.
Indeed, as technicians can provide information regarding elementary-stream format, other DTV systems may also suffer from that.In contrast, decoding is possible if an implementation is used as the required information is available on ESs.Finally, a receiver could perform decoding attempts with all available formats as a last resource.In the past, a similar problem occurred in DVB-T receivers: MPEG version 4 (MPEG-4) part 10 video being tagged as MPEG-2.

| Audio PID specified in PMT sections but not present in a given TS
In summary, TSs are collections of packets with different PIDs, which are multiplexed by time division [64].Nonetheless, no central element exists that provides information regarding all of them.However, instead, that is split across PSI/SI tables, while other present fixed PIDs, such as PAT, with PID equal to 0x00, and NIT, with PID equal to 0x10 [23,24].
In this field problem, which is very common and happens almost every year, DTV receivers were not presenting audio when tuned to a specific broadcaster, even with four audio streams being transmitted.Indeed, the audio stream with component_tag equal to 16, which should be the primary one [20], was not present, that is, TS packets tagged with its respective PID were not transmitted.Besides, crash events may even be perceived, depending on the assigned PID and its absence or different content.
PIDs announced in PMT sections can be configured in different ways, which must be consistent with what is currently provided by a multiplexer.If PIDs in PMT sections change for some reason, that information will not be referenced in a TS, that is, those packets will be present, but receivers will be unable to recognize them.Again, that is a common problem in almost all DTV systems worldwide, which are based on MPEG-2 TSs.
When an audio or even a video stream can not be decoded or nothing comes out of its PID filter, one possible action could be to try a stream with the next component_tag or, if that information is not sent in the stream identifier descriptor [67], the next stream specified in a given PMT section, in a subsequent table loop.

| Closed caption stream with wrong component descriptor
PSI/SI tables present fixed and dynamic information, with descriptors and table fields, respectively [23,67].The former is used for special or temporary cases without fixed information and tools for quick and transparent standard updates.
In Brazil, often, TV sets do not present CC when tuned to some channels.A brief analysis showed that component descriptors [24] were sent in PMT sections; however, the fields stream_content and stream_type were set with values reserved for future use, according to the DVB's standards [67].
There is an intersection between DTV systems, as most use MPEG TSs and PSI/SI tables; however, each has specific definitions.Consequently, many devices usually come with default configurations for a different system.Besides, many technicians do not analyse the SBTVD's standards and only use default settings.Moreover, fields with many options may easily lead to the wrong attribution, even with automated tools, due to implementation errors or misinterpretation.
Reserved values should be selectively used or rejected in robust DTV receivers to avoid functionality impairment.Finally, this same problem could also happen in ISDB-T networks.

| Service description table with incorrect section number
PSI/SI tables can be very long and are divided into sections of at most 1 or 4KB [22,67].Two table fields report their organization: section_number, with the current section, and last_section_number, with the last set section.
When tuning to digital channels, DTV sets store either the high-definition or the low-definition [20] program, whose choice depends on the moment of a scan.As a result, several televisions end up tuning to the mobile signal and displaying low-resolution video.Moreover, most DTV sets block mobile signals due to their low resolution.This issue leads receivers to ignore the mobile-signal blocking, which exemplifies the extent of a given field problem.
A deep analysis revealed that the associated SDT [24] specified two services in two different sections (section_number¼ 0/1), but informed only one section (last_section_number ¼ 0).Some devices would then show content from the first received section and ignore the other.Indeed, this problem happens due to a combination of receiver fragility and faulty human operation.The latter is because one decided to split service descriptions into different sections, which may happen to other tables.
One way to increase receiver robustness would be to avoid using the field last_section_number [24] directly and increase the priority of section_number: the final table would always be the union of all sections, no matter the last one.Finally, such a problem can affect any DTV system based on MPEG TSs, due to the extensive use of SDT.

| Different frequencies using the same virtual channel
Virtual channels may cause problems as they access DTV services, and SBTVD standards do not clearly explain this part.DTV broadcasters from the very high frequency (VHF) bands should send virtual channels in their ultra-high frequency (UHF) transmissions, with the same VHF number [20].In contrast, new stations should send virtual channels set with their physical ones.Different operators sometimes set the same virtual channel numbers, which makes them absent while browsing DTV programming in receivers.Indeed, the preferred index key for navigation is a physical channel number, which is unique.However, that is implementation-related and should be encoded in software.
Finally, this error is fascinating because it goes beyond a single DTV operator and involves an entire network.Although transmission monitoring is performed in Brazil, it is an arduous, highly open, and dynamic task.

| Application-related and non-incapacitating problems
This section presents field problems that primarily compromise the application subsystem.Consequently, a middleware module processes the erroneous elements, and the affected device can still be operated.

| Compressed Ginga application with incorrect descriptors
One essential aspect of DTV systems regards distributing and running interactive applications [9,10], usually sent in object or data carousels [10], compressed or not.The latter is also informed in PSI/SI tables and an object carousel.
Sometimes, DTV sets identify a Ginga application [27] in a TS, but its loading procedure never completes.As a result, it is never executed, although some televisions keep announcing its presence.
After analyzing object carousels, it was found that applications were in compressed format, even though that was not shown in the related carousel identifier [65] and compression type [76] descriptors.As a result, some receivers attempted to run them in compressed format.
Although compression was not informed, some devices did run those applications.When its entry point was not found, they tried to decompress it, as a fallback procedure.Further analysis revealed that there was a compressed module descriptor [65], which is specific to DVB, but not the compression type one [76], as used for ISDB-T and SBTVD.Again, it could also affect other DTV systems, such as ISDB-T.

| Discussion
All field problems presented in this paper are real and were identified and characterized during terrestrial transmissions in Brazil.Simultaneously, many of them were also noticed in networks and receivers conforming to other standards.Besides, all analysis results reported here were confirmed and led to changes in transmission-equipment configuration and receiver software.Moreover, some also resulted in bug fixing related to head-end equipment's source code.
Those non-compliance events indicate errors caused by malpractice in configuration, impairing fragile DTV receivers.Nonetheless, there were also problems caused by faulty equipment, which is not usually expected, and incorrectly-or carelessly-set table fields, which present many options and versions.
Many DTV subsystems may present wrong information, mainly regarding ESs; synchronization and timestamps; PSI/SI tables, including descriptors, the relation among streams, and multiplex configuration; interactive applications; and interrelation among fields.In addition, their emergence presents some randomness, which arises from unforeseen interactions among configurations available for manual and even automatic setting, receiver-code development, transmission equipment development, standard restrictions, and configuration-data structuring.It became apparent with the analysis performed due to the field problems shown above, which also shed some light regarding ways of predicting them and providing handling algorithms.Finally, such problems usually lead to noticeable symptoms: the absence of audio or video; video freezing, where images do not change; video flickering, where images are suddenly black or white; frame skipping, where frames are lost, artefacts and audio discontinuities.Consequently, device robustness improves after-sales cost reduction if fragility is probed, monitored, and then handled.
Concerning DTV-system robustness, one may argue that anything may happen, following their characteristic randomness.However, it is possible to take real issues and create peripheral or similar non-compliance random events, as related data are also susceptible to error in a guided fashion.Besides, elements in GUI and highlyflexible data are essential and prone to problem emergence.Therefore, to evaluate and improve DTV receivers' robustness, it is necessary to create non-compliant TSs based on the mentioned ones and already incorporate the randomness characteristic of this kind of field problem, which directly leads to fuzzing techniques (see the next section).
Another point regards end-users: do they care about the problems presented here?The answer is: yes, they do.In addition, the explanation is simple.In most problems presented here, an end user could not watch TV due to the problematic video or audio presentation caused by them, which led to massive access to TPV/Envision's after-sales department and product returns both by retailers and end-users.Moreover, when a DTV receiver is not compromised, the impossibility of using affected features leads to a low-quality perception.It was also noticed during the data collection performed in the context of this study.
The previous statements inspired the methodology presented here, which will be introduced and developed.Moreover, a given standard restricts conformity assessment; however, non-conformity takes a vast range of possibilities, even if bounded as suggested: previous problems, user interface elements, and sensitive data, which leads to some automation.Besides, it should be device-independent due to lower costs and higher applicability.
Finally, most problems reported here are not restricted to SBTVD and may happen in other systems, such as ISDB-T, ATSC, and DVB-T.Consequently, a methodology able to assess receiver fragility would present broad applicability.APPROACH Fuzzing is a popular testing technique, developed in the 1990s, that relies on the random generation of data for evaluating computer systems [30].Its simplest version requires little knowledge of target systems, even without the related source code, and can be easily scaled to large arrangements.It can be executed in four steps: test case generation, that is, random input data generation; test case execution; state monitoring; and exception analysis.However, such an original simplicity also presents an important disadvantage: low effectiveness regarding bug identification.Consequently, it has also evolved a lot since its creation, with techniques that include guided testing, mutation, program analysis, machine learning, slicing, and symbolic execution [31].
Test quality, in fuzzing, is of paramount importance, including input format and granularity, which should, with high probability, lead to a program failure.Fuzzing may rely, as inputs, on files, communication data, or even executable binaries; however, in the case of DTV systems, DTV signals, according to a given protocol, take place, which usually converge on the TS level [23].Consequently, fuzzing in DTV networks inevitably involves the generation of broken TSs so that failures in specific subsystems are accessed.Besides, when a test case is run, fuzzers need to monitor program states or outputs to identify exceptions and crashes, where even specific tools for program monitoring can be used [31].Nevertheless, as the present works focus on DTV receivers, every symptom converges to their standard outputs: audio and video.
Fuzzers can also be classified in different ways, according to the aspect in focus.Regarding input creation, Fuzzers can be seen as either generation-or mutant-based elements [36].The former involves transformations over a reference input, while the latter employs system specification.Although simpler, mutant-based fuzzers usually lack coverage and are often less effective, with inputs that largely deviate from what is expected.Nonetheless, as DTV systems already provide deep specifications and reference software, a generation-based fuzzer has everything expected to create test cases and seems feasible.When code dependence comes into play, a fuzzer can be classified as white-, grey-, or black-box.Whitebox Fuzzers have access to source code, while black-box ones do not.In addition, grey-box fuzzers usually employ program analysers to obtain knowledge of a program.That being said, when evaluating DTV receivers, their source code will probably not be available.However, there are open standards, and the way receivers are implemented and their resulting processing chain are well known.Consequently, any analysis can be classified at least as a grey box.Next, program exploration can also be taken into account.Direct fuzzers result in test cases that aim at specific code and paths, that is, known vulnerable parts, while coverage-based ones try to increase code coverage.In the present case, we envision a test targeting the DTV processing chain and, more specifically, fragile parts involved in field problems.Consequently, we look for a coverage-based method [37].Finally, there are dumb and smart fuzzers.The former usually only rely on data range and blindly generate test cases.At the same time, the latter holds a more profound understanding regarding what is being handled, which allows going further in code paths and processing chains.Indeed, one should notice that we can start from knowledge regarding standards and field problems, the latter tending to repeat themselves or occur in similar areas, and the most fragile parts (see Section 3), which can be encoded into our fuzzer, then making it a smart one.
In addition, higher code coverage is also desirable, at least over time and based on new knowledge.Consequently, there must also be a manner of increasing tested parts and modules based on some fragility criteria.This way, we both improve efficacy and coverage.Besides, by taking into account what was presented above and having in mind our target application, that is, DTV receivers, we can summarize the main questions involved in fuzzing and their respective answers in the light of the present problem: • inputs are obtained based on the knowledge of DTV standards, known field problems, and fragile spots, which directly leads to improved efficiency when compared with purely random strategies; • test cases are generated based on the DTV processing chain and known implementation strategies, targeting potentially-vulnerable code parts; and • DTV receivers can be tested by sending signals (TSs) and monitoring their audio and video outputs, with test time and parameters adapted to specific DTV standards.
Moreover, a smart fuzzer designed for DTV purposes can be depicted as shown in Figure 3.A generator, fed with information for smart operation, creates input TSs and presents them to a device under test.Then, an output monitor, focused on audio and video outputs, judges if a violation is present, resulting in other tests or error registration (bug).Besides, the smart information in Figure 3 also includes data for somehow creating malformed structures, in terms of organization, or even hit corner or cryptic cases in DTV specifications, which leads to a grammar-based approach [34,38,77].The latter both focus on specific structures and provide apparently-valid data for the DTV processing chain.For example, the structure of PMT sections is shown in Table 1 [23], which is, undoubtedly, a possible target for the proposed grammar-based guided-fuzzing approach.Furthermore, as can be inferred from the problem explained in Section 3.5.1, the field stream_type, which is 8-bits wide, should undoubtedly be fuzzed, as a myriad of problems may be introduced, such as disagreement between content and signalled encoding and packet absence.Finally, many other essential fields can also be fuzzed, such as descriptor and last_section_number/section_number.
It is worth noticing that a DTV standard is indeed a sort of network protocol, as the global system for mobile communications (GSM) [77].From the receivers' point of view, it is summarized on the structures and information available in TSs, which are then used for the entire DTV processing chain.However, this chain does not include protocol responses on an associated channel as it is a simplex system (there are no responses) [22].Indeed, results are only visible to internal receivers' structures or via their outputs.So, incorrect data can lead to lasting and deep effects that may even echo across different modules, with the potential for compromising the resulting user experience.However, ultimately, those errors can be captured via audio and video interfaces.
As examples of protocol fuzzers, we can cite AutoFuzz [42] and AFLNet [40].The former can learn a protocol and modify communication sessions, while the latter also fuzzes real messages with a learning algorithm.Indeed, protocol fuzzers often present a target area [39,41], but they are not usually focused on specific standards.In addition, they rely on protocol conversations (message exchange) between a client and a server, which is not the case with DTV systems.In contrast, the present approach focuses on the latter.It provides a structure compensating for its simplex construction with real problem scenarios, fragile processing parts regarding the associated data, and head-end configuration aspects.When response messages [40] are unavailable, the associated communication structure and data relation can be better tackled with a tuned approach.
Although the course of action taken here may initially seem a restriction, there is a myriad of different DTV systems around the world that can benefit from its use, which is a significant result: ISDB-T, ATSC, DVB-T, digital terrestrial multimedia broadcast (DTMB) [78], digital multimedia broadcasting (DMB) [79], and SBTVD, at least, as many building blocks across them are the same.In addition, the proposed methodology may also be applied to satellite and cable networks, such as digital video broadcasting satellite (DVB-S) and digital video broadcasting cable (DVB-C) [3].It can also be applied to their second-generation counterparts (DVB-S2 and DVB-C2) [80,81], as they also employ TSs as the transport layer and depend on configuration sent by head-end equipment.Moreover, other systems with similar aspects may also benefit from such a methodology and then have a verification approach with the same essence and basic structures adapted to their contexts.Examples of that are digital radio (DR) systems, such as digital audio broadcasting (DAB) and digital radio mondiale (DRM) [82].This way, as long as signal broadcasting with transmission configuration performed at head-end equipment (e.g. by technicians) that is decoded by receivers in user premises takes place, regardless of the underlying transport layer, the basic methodology employed here can be used.
The following section will introduce the proposed method, based on what was presented, which includes an overall scheme that goes far beyond fuzzing-based test-case generation.It will also indicate the part that includes fuzzing to explain the link with the characteristics raised here correctly.

| THE PROPOSED METHODOLOGY FOR NON-COMPLIANCE TEST EXECUTION BASED ON FUZZING
Section 3 describes the wrong configuration problem, which is often disregarded and may cause completely unexpected behaviours in DTV receivers.This problem happens since there is an understanding that broadcasters ensure correct DTV signals.However, mistakes do occur and may cause substantial losses.Moreover, the resulting considerations provided the basis for a robustness testing methodology to create non-compliant TSs for revealing fragile code and handling strategies in DTV receivers, as follows: • receivers often operate under non-compliant conditions, which appear in a somewhat random fashion, leading to fuzzing-based tests that submit them to those scenarios; • there must be restrictions regarding issues and affected systems to provide a feasible implementation, which leads to a guided approach; • an automated scheme is preferred, given the huge amount of test cases that may be created; • a methodology should not be restricted to proprietary interfaces in order to be widely used and improved; • a procedure should monitor what a user has access to, which leads to audio and video interfaces, in order to avoid errors that are naturally concealed or irrelevant; • the proposed approach should focus on audio and video absence, video freezing, video flickering, artefacts, frame skipping and audio discontinuity; and • the mentioned issues can be detected and are also the main results of almost all problems identified thus far (cf.Section 3).

| General structure targeting DTV receiver testing
This paper proposes a novel robustness evaluation methodology for DTV receivers, with a test-generation core that performs grammar-based guided fuzzing to increase device robustness without compromising user experience.Moreover, in order to provide evaluation for such elements, it also relies on a structured test environment, which is shown in Figure 4.According to a given DTV physical layer, a personal computer (PC), equipped with a controlling-software module, stores and loads MPEG TSs for transmission to devices under test.Besides, its first instance could be the SBTVD's one (i.e.ISDB-T) [8,22].Indeed, it is essential to use a proper interface, so its reception chain is used and problems due to interrelation among different modules are then stimulated and identified.Meanwhile, the same PC software configures a DTV receiver under test through an infrared interface, which then tunes to a service that provides configuration.This way, problems in a subsystem not strictly related to decoding, reception, or configuration but also caused by them may be tackled, such as channel navigation, EPG construction, and CC activation, due to an execution that follows what happens in real operation.
In order to provide stable TSs capable of being correctly transmitted, received, and analysed, they must have a minimum duration to accommodate tuning, pre-rolling, and signal stabilization, which is based on empirical evidence.Indeed, some transitory period is noticed after tuning, which may even take 2-3 s.While signals are decoded, video and audio are captured by a camera and a microphone.Besides, this camera may not be high-quality because the main goal is to identify easily-noticeable problems and general malfunction.The most crucial aspects are correct time resolution for data acquisition and relevant reference data to favour problem identification.
It is essential to mention that one can still argue that audio and video could be directly captured.In this regard, we have many situations, but we will discuss just some of them.First, many TVs do not present video outputs as they are terminal points and are mainly used for viewing.Although some present outputs for monitoring purposes, that can not be taken for granted.Moreover, signals output from those interfaces suffers additional encoding, for example, composite broadcast video signal (CVBS) and YC B C R [2].Audio outputs, in turn, are frequently present due to a simple need: audio enhancement via external sound systems.Even so, that is not always the case for low-end devices.In addition, if some direct interface with processing boards is used, such an approach would compromise device assessment, mainly when that kind of access is unavailable for non-vendors.Consequently, it is interesting to capture signals from the interfaces surely available to users, that is, screen and speakers, which makes it possible to apply the developed tool to every existing device.
Next, captured media is sent to the same PC software (see Figure 4), which uses image and audio processing algorithms to detect incorrect behaviour.In this regard, such algorithms should be continuously executed during an entire testing procedure, given that errors may happen at any moment.Finally, when a problem is identified (e.g.video freezing or audio absence), its presence is indicated, stored, visually informed, and then added to a log file, which can later be evaluated and also included in a test report.
Audio and video assessment is a challenging task that is highly dependent on content, which may be favoured by predetermined signals where differences and deviations are promptly noticed.The chosen video sequence should favour identifying the mentioned problems, including video freezing, video flickering, artefacts and frame skipping.Moreover, a suitable audio sequence could also quickly reveal absence and discontinuity.
In order to define suitable audio and video sequences, some analysis was carried out.SBTVD defines, for fixed reception, that video must be encoded in H.264, restricted to profile high and level 4.0 [25,69] and with maximum resolution of 1920 Â 1080, at 29.97 Hz, or 1280 Â 720, at 59.94 Hz [25].Besides, an empirical evaluation revealed that video freezing might involve as few as two consecutive frames with the same image, while flickering may be restricted to only one incorrect black or white frame between two correct ones.Artefacts may affect only a small part of a frame or Diagram of the test environment employed by the proposed evaluation methodology include many consecutive ones, and frame skipping may involve the loss of many frames, usually less than 1 s apart, which depends on a root problem.
The chosen video sequence should provide a noticeable difference between two consecutive frames, defined as a normalized cross-correlation coefficient [72,83] lower than 0.9, which can be recognized by regular image comparison techniques [83], in frames 1/59.94s apart in time.Besides, such a difference should ideally not present a correlation lower than 0.6, which could compromise the identification of frame skipping or problems with long occurrence periods.As a result, a notification should be triggered when the correlation between two adjacent frames is higher than 0.9.The specific computation for the correlation metric is given by where the i-th cross-correlation coefficient R i is computed through a current captured frame Im i and a previous one Im iÀ1 , both with dimensions M Â N.
The predefined video should have a dynamic structure that provides the expected correlation, with a repetition period higher than 2 s.High correlation must also be avoided, which could mask a problem.In fact, such findings (repetition period and correlation) arose empirically, based on video freezing and flickering, artefacts and frame skipping events noticed in the problems in Section 3.After extensive testing, a compound moving structure with a red square and black circles rotating around a screen's centre was chosen, as illustrated in Figure 5.The red square and black circles rotate around a small central circle, with a period of 2.1 s.If no video exists, correlation is high (>0.95), while freezing also results in a high correlation among frames.Artefacts and flickering cause a noticeable difference between 0.7 and 0.5, and finally, frame skipping usually results in a very low correlation (<0.5).Moreover, a camera capturing the output video should also be fast enough to allow problem recognition.The adopted audio strategy, in turn, is based on transmitting stereo audio with a different frequency tone for each channel: 1 kHz for the right and 5 kHz for the left channel, which provides selective identification of audio absence or discontinuity.Indeed, audio symptoms in the analysed field problems (see Section 3) were restricted to those two possibilities.
It is worth noticing that the chosen audio and video sequences will undoubtedly result in fast algorithms.Otherwise, for instance, if a video verification procedure took more than 1/59.94s, a test sequence might finish while the analysis was still running, which could also be true for audio procedures.Besides, extended analysis periods would also result in more memory requirements.At least another memory chunk of the same size would be necessary due to analysis-time overlap: one for the current and another for the next analysis procedure.That being so, memory needs would be twice the amount initially predicted.A simplified representation of the proposed methodology is shown in Algorithm 1, where eval_video_and_log_errorðÞ includes evaluation procedures for video freezing and flickering, artefacts and frame skipping, while eval_audio_and_log_errorðÞ includes evaluation procedures for audio absence and discontinuity.

| Test generation based on fuzzing
The overall structure of the proposed verification methodology was shown in Section 5.1, but there is a vital aspect still left aside: test creation.That being said, if a non-conformity is anything apart from what is specified and randomly appears, how can one systematically create tests?Indeed, this inquiry already gives a clue of how that can be done: fuzzing, which generates random inputs based on grammars for reaching specific code.In summary, we have employed grammar-based fuzzing, with a vast set of rules for specific parts of interest of a transport stream.However, it is worth noticing that there is no grammar for entirely constructing a transport stream but stead for fields of interest, which are fed to our fuzzing engine.For example, let us analyse the grammar for the field program_number in PMT sections (see Table 2).It is included in the fuzzing model developed for our study and can be described, in extended Backus-Naur form [84] Here, we aim to create a descriptor with a composition that includes values reserved for future use related to the same problem described in Section 3.5.3.In addition, as mentioned, there should be some restrictions for making the whole approach feasible and practical, as SBTVD presents a myriad of information elements with different uses and purely random data over a given structure, without considering the way it is processed, its natural limitations, and an entire TS and structure interrelation can be unacceptably longer and inefficient.For instance, the advanced audio coding (AAC) descriptor, in PMT sections, could have its parameter profile_and_level set with the wrong values [24,67].However, most DTV receivers usually ignore them, as seen in commercial platforms, and use the stream_type information in PMT sections to load the correct decoding library.Moreover, random values beyond the informed limits may also lead to no new condition, as data in TSs occupy a predefined bitwidth, and truncation often takes place.Consequently, no incapacitating problem would probably be revealed, and the resulting tests would be of no or little use.Finally, this understanding leads to another clue: some guidance or restriction during the related fuzzing strategy to tackle what is likely to reveal an existing problem.
Nonetheless, information related to previously identified problems, as presented in Section 3, parameters configured through GUIs of commercial equipment, and critical data with great adjustment flexibility (e.g.table descriptors) represent three crucial groups and recurrent causes of non-compliance.Those three aspects drive the intended grammarbased guided fuzzing approach and provide at least the initial coverage for creating tests.This way, those groups should be used as a basis for test creation since they have the potential to reveal fragility in receivers, as described in Section 3.
Figure 6 illustrates the proposed test-generation methodology with the three mentioned groups: field problems, parameters configured in commercial equipment, and critical data with high flexibility.It summarizes our approach, where problems and grammar are sent to a fuzzing machine and then used to generate robustness tests in the three mentioned groups.The latter covers part of the universe defined by the target standards, which can be increased with new data fed to the mentioned fuzzing machine.Firstly, suppose there exist other data related to a field problem.In that case, it is likely that errors in it also result in similar problems that can be identified in audio and video interfaces, with the help of the techniques exposed in Section 5.1, which was already realized during the analysis of the real field problems in Section 3.This way, data related to a previous problem may also cause malfunction due to shared structures and incorrect or fragile code, which could be considered an 'error region' around an original problem.The more one expands that area, the deeper possibly fragile structures are accessed, with a significant probability of incorrect operation (see the left part of Figure 6.Besides, technicians' parameters configured in GUIs are also sources of errors since DTV standards are extensive, and no double-check is usually performed in receivers.Once again, another 'error region' can be created around a single configuration available in a given GUI, which is even more critical when there is no value restriction.Then, sensitive and highly-flexible data are also prone to errors, given that the slightest deviation may lead receivers' routines to unexpected results (see Section 3.5.3).Indeed, DTV standards are not clear on descriptors and sometimes refer to documents created by other systems.Consequently, implementations do not usually consider all possible setups, and fragile areas are often identified.Finally, as illustrated in Figure 6, error creation, which leads to areas around field problems, sensitive data, and parameters configured in GUIs, is performed with fuzzing.New values are randomly generated with knowledge regarding those three aspects and applicable format specifications, as mentioned here, from scratch.Therefore, 'error regions' centred on errors classified in one of the three mentioned groups could be continuously expanded if the related random process continues.In addition, while new field problems are identified, new GUIs of commercial equipment are developed, and additional or previously non-used descriptors begin to be considered, new 'error regions' could then be fuzzed out.
The field problem in Section 3.2.1 is an example of the first group: SEI messages that carry AFD, closed-caption, and bar data [70,71].Hence, a fuzzer could insert errors in structure identifiers (user_identifier and user_structure), number of CC elements (cc_count), location of bars in their specific structure (bar_data), and specific AFD information (e.g.afd_data) [71], thus creating an 'error region'.Moreover, other fields in NAL units could also be fuzzed, for example, seq_parameter_set_rbsp() and pic_parameter_set_rbsp() [69].Indeed, this approach has already been confirmed: TPV/Envision Brazil analysed an occurrence in the same SEI messages, but now related to its marker bits [70,71].Besides, by that time, such a test had already been devised.

F I G U R E 6 The proposed strategy for test creation
The second non-compliance group can be represented by the information sent in the field stream_type of PMT sections [23,24], as in Section 3.5.1.It is usually configured in GUIs of head-end equipment, as illustrated in Figure 7, which was reproduced from real equipment: Video Stream Type and Audio Stream Type can be freely set.The configured PID is recorded in the correct fields.However, no check regarding formats is performed.Other examples include application-profile information, such as full-seg profile A -FSA [85], and component_tag for media and object-carousel streams [20].
Descriptors [24,67], which are dynamic information in PSI/SI tables [23,24], are natural members of the third group: sensitive data with extensive configuration.For instance, services can be informed in three distinct locations: PAT, SDT, and NIT, the latter providing classification (e.g.partial reception) and transmission parameters.Thus, since such data are used to store and classify services, fuzzing them may compromise device operation.Besides, services reported in those three tables should not present different numbers.
Following this methodology, the regions illustrated in Figure 6 may grow around an initial problem, depending on evaluation needs.However, tests created through the mentioned steps are expected to cover many non-compliance types.Besides, such an evaluation system can be enhanced over time while new field problems are reported and sensitive data are tackled.Indeed, the proposed methodology gives rise to a system that is not born in its final form but evolves with time and can result in a comprehensive set of test cases.The following principles should also be considered: • non-conformity tests in subsystem-oriented groups, such as PSI/SI; audio and video; and Ginga; • only one non-compliance per TS so that the cause of malfunctions is isolated and identified; • RF transmission is employed to use the reception chain.
Finally, an important notice: any evaluation should start later and end earlier, compared to a TS's life cycle, to avoid instability.

| Tools for video analysis
Regarding video analysis, it is interesting to clarify some points.Video freezing, video flickering, artefacts and frame skipping should be continuously checked with image processing algorithms during a given test.The necessary techniques for such tasks are simple and can be implemented through known techniques, such as histograms, correlation [72,83] and structural similarity index (SSIM) evaluation [86,87].Indeed, the proposed image analysis algorithm employs, to some extent, freezing, flickering, artefacts frame skipping identification, normalized cross-correlation, and SSIM.In summary, each algorithm's results are suitably combined to produce a robust merit figure.Finally, a camera must be in front of a TV display, which should be recognized to select only its screen, where video symptoms happen.
Consequently, a simple screen detection was devised based on known image processing algorithms and illustrated in Figure 8.Initially, input images are re-scaled [83], given that the only goal is to detect a screen without specific object segmentation or more sophisticated evaluation.Then, some pre-processing is carried out, aiming at noise reduction [83].Lastly, the developed algorithm tries to find a contour delimiting a screen, with the expected aspect ratio and shape, through edge detection [88] and corner identification, that is, the latter being implemented through simple line intersection procedures.If that is the case for the current contour, it defines the screen area to be analysed and, finally, tangential distortion is corrected [89], if present; otherwise, other contours are verified in search of four corners of a suitable rectangle, that is, the one with correct aspect ratio and overall shape.Moreover, a camera should stay fixed in front of a TV display during a test sequence.Thus, once a TV screen is recognized, which happens only at the beginning of a test sequence, the same screen region is used for each non-conformity test until the last.

| Tools for audio analysis
Once again, simple techniques can be used to provide detection of audio absence and discontinuities.In the present case, finite impulse response (FIR) filters [72] tuned to the chosen frequencies (i.e. 1 and 5 kHz) can be used, whose outputs are continuously monitored through analysis of epochs of 20ms, which are sampled at 44.1 kHz.This way, no filter output during an entire test means absence, while periods of no output mean discontinuity.Besides, a more straightforward approach could be adopted based on audio amplitude, which undoubtedly requires noise floor calibration.Finally, in a more sophisticated analysis, tones distributed across the available frequency band could also be employed, as already mentioned.

| The complete evaluation procedure
The complete evaluation procedure includes receiver setup, signal transmission, channel acquisition, content decoding, video and audio evaluation and report provision, as follows: 1. a camera and a microphone are correctly placed; The algorithm developed for detecting a TV's screen 2. all devices are fixed and then the screen detection algorithm is run to define the respective screen's position; 3. a test sequence is loaded, along with the expected evaluation for each one, that is, video and/or audio; 4. the device is configured with infrared commands; 5. a dummy transmission is sent to allow a DTV receiver to scan and store the intended channel; 6. a specific test, with fuzzed data, is transmitted and the requested evaluation procedures are performed during the entire transmission; 7. the corresponding results are presented and stored; 8. steps 6 and 7 are repeated, until the last test; and 9. after the last test, a complete report is issued, informing failure or success, along with results for each evaluation.
Finally, it is worth noticing that, in our approach, we first perform fuzzing and then store the resulting (broken) TSs for transmission.Although on-the-fly TS creation is possible, online multiplexing may primarily increase testing periods with no readily identifiable gain.In addition, TSs should also be available during correction and analysis phases as resources employed for error identification and handling.

| EXPERIMENTAL EVALUATION
In order to evaluate the proposed methodology, a complete implementation was carried out, including testing environment, generation of non-conformant TSs (fuzzing), and experiment execution.In addition, to the best of the authors' knowledge, this is the first tool based on fuzzing that was developed for verifying non-conformance in DTV receivers.Moreover, it aims at non-compliance and configuration errors, which were never tackled.Consequently, there are no candidates we can directly compare with, in terms of performance, and no recent studies provide a clear research direction.We could even provide a simple comparison regarding fuzzing machines, using a general-purpose tool.However, with the present work, we hope to motivate more studies focused on robustness in DTV platforms and related networks.
In this regard, we are mainly interested in the following research questions: RQ1 (application feasibility) Can the proposed evaluation methodology be used in practical evaluation processes and scenarios?RQ2 (commercial platform evaluation) Can the proposed evaluation methodology provide a profile regarding the current installed-receivers base?
Regarding RQ1, we are interested in determining if the proposed methodology can be applied to real devices, which can be unfolded into two aspects: feasibility related to the evaluation of commercial platforms and its effectiveness.Besides, RQ2 is related to its result towards broadcasters, that is, if the condition of the available commercial platforms can be accessed and considered during head-end configuration.

| Description of the non-compliance test tool
The proposed methodology was implemented in Python and C++, running on a PC with Ubuntu 16.04.The resulting software can be divided into two parts: TS generation, according to Section 5.2, and automated test tool, which implements the control software illustrated in Figure 4.
The multiplexing of non-compliant TSs was implemented with OpenCaster [90], where PSI/SI tables and descriptors were coded (and fuzzed) in Python.As a result, their fields can be modified according to our fuzzing-based test-creation strategy.Besides, structures in audio and video ESs were fuzzed, including wrong profiles, levels, and descriptions regarding several entities, such as LATM audio element headers [74] and H.264 NAL unit headers and slice types [69].Moreover, PCR and PTS information and PES headers were also fuzzed.Next, audio, video, synchronization information, interactive application, and general data were multiplexed.One may notice that tests involve fuzzed Ginga applications and related structures, including digital storage media command and control (DSM-CC) carousels and associated signalling.Three categories were created: PSI/SI tables, audio and video, and Ginga, with tests according to the three groups mentioned in Section 5.
The complete evaluation tool is composed of modules that implement the proposed methodology.The image processing module is responsible for detecting and isolating a TV screen captured by a webcam, which records video at approximately 60 frames per second (FPS) and analyses freezing, flickering, artefacts and frame skipping.Freezing is detected using SSIM [86] to recognize frame sequences with a correlation larger than 95%.Flickering detection uses a combination of SSIM and normalized cross-correlation, while artefacts and frame skipping expect correlation in the range 50%-70% and lower than 50%, that is, 0:5 ≤ R i ≤ 0:7 and R i < 0:5, respectively.Finally, all evaluation algorithms were implemented with OpenCV [91] and some proprietary approaches specifically developed for them.
The transmission module was implemented with two devices DekTec DTU-215 [92] to emulate channel change, channel-associated information, and adjacent-channel reception.The remote control module uses an infrared transmitter managed by the Linux infrared remote control (LIRC) [93].
The audio analysis module uses FIR filters tuned to 1 and 5 kHz and amplitude-based algorithms.Audio is captured with the advanced sound Linux architecture (ALSA) and then used for checking discontinuity and absence.The duration of each TS was empirically standardized in 30 s, from which the first and the last 3 s are discarded (cf.Section 5).Test cases are described in extensible markup language (XML) files [9] and fed to the developed tool.
The evaluation tool includes a configuration, control, and test-sequence monitoring GUI.There are screens for test configuration, test-sequence monitoring, and final result presentation.As an operator may only be interested in F I G U R E 9 Screen for selecting specific tests to be executed The non-nompliance test tool executing tests executing a subset of the available tests, a configuration screen was designed, where each test case can be enabled or disabled, as illustrated in Figure 9. Indeed, tests or parts of them (e.g.audio and video analysis) can be selectively activated or not, using the same GUI.
Finally, Figures 10 and 11 show snapshots of two other screens.The first identifies the tests being run and the status of a test sequence.For instance, if some crash or major problem occurs, an in-depth verification of the corresponding log file is required to isolate a group of suspect TSs.If, after a specific test, all subsequent results show aberrant behaviour, it should indicate the boundary of a test sequence for reproducing an incapacitating event.The second shows all data stored in a log file, which can further analyse a faulty execution.
The proposed evaluation methodology was implemented to assess and apply to real DTV receivers.Moreover, TPV/Envision Brazil will start using this tool to verify its receivers and provide better and more robust devices to the Brazilian market.

| Objectives
The performed evaluation used the implementation described in Section 6.1, with the following experimental goals: EG1 (application feasibility) Show that the proposed methodology can be applied to commercial platforms, without massive effort for test execution and analysis.EG2 (commercial platform evaluation) Provide an overview of commercial platforms available in the Brazilian market through the assessment of popular models from major multinational and national manufacturers.EG3 (methodology assessment) Show that the proposed methodology can submit receivers to the mentioned scenarios and consequently reveal fragile implementations.
It is worth noticing that such experimental goals are closely related to the research questions raised at the beginning of Section 6.Specifically, EG1 and EG3 intend to answer RQ1, while EG2 is directly derived from RQ2.
The manufacturers responsible for such platforms are leaders in Brazil and represent 80% of its TV market.The F I G U R E 1 1 Log screen shown at the end of a test-sequence environment illustrated in Figure 4 was assembled, and all steps described in Section 5.5 were performed.Test results are expressed in Table 2, where three categories can be identified: PSI and SI tables (PSI/SI); audio and video (A/V); and Ginga applications (Ginga).The configuration aspects tackled by each category are informed in Table 3.The first four and the last two platforms are off-the-shelf ones; however, Platf orm 5 was still under development.Such a compound intends to provide a comparison among mature and in-development platforms.
Regarding A/V, Platf orm 2 presented a fail rate of 100%, while the others provided between 10% and 17.45%, which includes errors in audio and video ESs.For instance, there are H.264 streams with profile high and level 4.0, in their headers, although being encoded with high 10@5.0[69].Besides, there are many other tests regarding SEI, wrong structures, incorrect sequence, and picture headers [69,70].Audio streams also contain errors, such as the wrong number of channels, frequencies, and even compression tools (e.g.incorrect indication of spectral band replication) [74].Such a result for Platf orm 2, produced by a high-quality multinational manufacturer, shows its fragility, which is particularly important when faulty commercial equipment is used.One may also argue that some early tests may have caused this, but that was not the case.Platf orm 7 is also produced by a multinational manufacturer, but it is still different from what was provided by the others, that is, an average of 11.28%.Moreover, manufacturers of Platf orms 2 and 7 represent around 7% of the entire Brazilian TV market.
Platf orm 2 expects flawless ESs, which are not very common in real environments.The other platforms' failure rates are somewhat expected because some inherent robustness in DTV device development usually exists.Nonetheless, there still exists room for improvement, mainly related to the most severe tests.For instance, platforms often anticipate audio issues, but problems associated with H.264 are usually left aside.Finally, even Platf orm 5, still under development, presented a failure rate for A/V of 12.08%, which can be explained with code reuse.Another significant result regards PSI/SI tests, including table parsing, descriptor decoding, and data classification and use.Indeed, source code related to PSI/SI tables is deeply revised due to its frequent use.Besides, this module is usually the primary target when a platform is customized for a DTV system.In that context, Platf orms 1 and 4 presented low failure rates, which should be expected as most field problems are related to PSI/SI tables and usually lead to constant revision.Platf orm 5 presented a higher rate, which is also expected, as it is a new model still under construction and did not use mature code.Moreover, this model presents enhancements and uses new libraries, which must still be tested and revised.Platf orms 2, 3, 6 and 7 surprisingly presented fragile code, even being manufactured by companies with a long history in DTV.Platf orm 2 is a model from 2016, while Platf orms 6 and 7 were released in 2017 and will probably be used for at least six more years.Platf orm 3 was released in 2013, will probably be used for four more years, and is no longer sold.Thus, their impact seems severe, mainly because Platf orm 3's fragility may be present in other models, and its manufacturer holds a market share of around 36.6%.Besides, Platf orm 7, which presented the highest failure rate, is from a manufacturer with around 3% of market share.In summary, such tests reveal that if a DTV receiver project is not carefully executed and inherently robust, it may suffer from wrong data in head-end equipment and compromise DTV networks due to changes in transmission and new operators with sloppily configured equipment.
Finally, the lowest average failure rates regard interactive applications, which indicates a lot of development effort.Indeed, that is expected because Ginga-related standards are developed by the F orum SBTVD, a committee dedicated to DTV standardization in Brazil.Moreover, this part of the SBTVD's standards was enhanced over the last decade, which usually triggers constant review and assessment.The highest failure rates were obtained with Platf orms 5, 6, and 7, the former also going through Ginga porting.That means Ginga is being interfaced with that platform's application programming interface, libraries, hypertext markup language (HTML) version 5.0 resident engine, and zapper module (the receiver-controlling application), which usually causes temporary instability.Consequently, it is expected that the same robustness level of Platf orm 4, which is produced by the same manufacturer, is ultimately achieved.Nonetheless, the other two are commercial platforms, which may seem surprising.A possible explanation is a considerable change in the system and software stack, as is the case with Platf orm 6, which may have caused undetected non-conformity.Besides, Ginga standards had just gone a review at that time, leaving many features unclear.
Although more sophisticated algorithms were developed and employed, amplitude detection was enough for audio malfunctions.Besides, all video artefacts and frame skipping events were also detected as flickering, which means this analysis would cover all the identified problems together with the freezing one.Consequently, audio amplitude and video freezing, and flickering reports may be enough.
As one can notice from Table 2, a total of 902 tests were performed, which is not feasible with a manual approach.Indeed, on average, those 902 tests took 7.57 h: 3.35 for Ginga, 0.9 for A/V, and 3.32 for PSI/SI, with the proposed methodology.Moreover, no human operator was required, which resulted in time for development and code enhancement while waiting for new evaluation results and fulfils EG1.
In addition, the need for software testing professionals is reduced, as one only needs to prepare the test environment and run a test sequence.Moreover, professional test suites usually provide thousands of tests, which indicates that the 902 tests presented here are just an initial set.
The chosen platforms are popular devices produced by market leaders, where software reuse is standard practice.Thus, a conclusion related to EG2 is that receivers are affected mainly by non-conforming data related to PSI/SI tables and media ESs, which should be taken into account by broadcasters during equipment configuration, thus guiding review and consistency checking.Therefore, any test round followed by 'fix' phases should involve those groups, with fallback procedures and handling code also targeted on them.
In order to provide a complete picture, we have chosen two tests for further analysis.In the first, which was part of the PSI/SI category, a service of type reserved [24], that is, 0x4A when fuzzed, was transmitted, which was identified with video freezing.In this case, receivers restricted processing and storage to services correctly classified in the service descriptor [24] in SDT sections, even though the field program_number in PAT and PMT was correctly configured.This problem affected almost all test platforms, which is of paramount importance.
The second one consisted of a Ginga application in a compressed carousel with a broken (fuzzed) compression type descriptor [76], which is very similar to the one in Section 3.6.1.As a result, this application was not loaded in most test platforms since the NCL entry point was not found.Only Platf orm 1 was able to deal with it, which is very concerning.
Regarding the overall test results, As already mentioned, there is no similar tool we can use in comparison with our approach.However, in an attempt to fill this gap, we decided to compare only fuzzing machines.Indeed, even that is difficult to perform as most fuzzing tools freely available are focused on specific contexts, such as network, operating systems, or drivers [39].In that sense, general-purpose ones seem viable.Among them, we were able to identify Peach [39], which is well known, effective, and flexible.However, it is also time-consuming to build its input format file using the indicated syntax.As one may notice, entire tools can even be built over Peach as long as the necessary definitions are provided.In our case, we have created the syntax of PMT sections, as shown in Table 1, and then manually built ninety-five associated PMT sections with fuzzed data, which were later manually integrated into TSs.In contrast, we have also generated ninety-five test TSs with the tool developed here, based on the proposed methodology, and fed both sets to Platf orm 2 and Platf orm 3, resulting in Table 4.The latter presents columns with meanings similar to what is found in Table 2.Those platforms were chosen based three reasons: they are popular models from multinational manufacturers, presented many problems, and are provided by market leaders.As one can notice, from Table 4, TSs generated with the proposed methodology helped uncover more fragile parts than Peach.Indeed, that is expected due to the test-creation strategy adopted here, which is focused on the most likely ways problems occur.Again, this comparison considers only the underlying fuzzing machines, with favourable results for the proposed methodology.
It is worth mentioning that the results provided here can be promptly used for improving the evaluated platforms.Indeed, when a product is being developed, it is submitted to laboratory-created and few real DTV signals (TSs) recorded in the field (e.g.user premises and public spaces), whose content may reveal a problem during device operation.Consequently, software fixes involve two primary artefacts: problem description and associated TS, which are the same as the proposed methodology, and problem description comes from the identified behaviour in the audio or video output.However, such TSs recorded in the field are not numerous and usually contain only what is expected and dictated by standards, with no purposefully incorrect data.Consequently, as the latter is the very focus of this work, a new aspect regarding a receiver's processing chain is then tackled, and that has the potential to reveal completely unexpected behaviours.
It is also important to clarify that the proposed methodology aims to reveal fragility.That being said, we know in advance that the created TSs bear non-conforming data and are wrong from the viewpoint of what is dictated by the associated standards.This way, the target is to verify if a given receiver is ready to handle such problems due to a previous fallback procedure, unintended consequences, restricted state spaces, or will present abnormal behaviour.Moreover, all results presented here were manually confirmed, which, during our initial tests, revealed some false negatives from video evaluation but no false positives at all.Later, a brief analysis showed the need for adjustments in some correlation thresholds and the standard test duration, avoiding those occurrences.Although that is true for the current composition of the developed tool, new checks or evaluation groups may cause additional occurrences as other symptoms that do not taper down towards the known ones may happen.
Another fair question regards cost: is a new testing tool worthwhile?To answer that, some relative figures are required.In TPV/Envision, we have isolated the average yearly cost relative to the kind of field problem presented in Section 3, as provided by the after-sales department, which is around 20% of its total amount, the latter in the order of a few million dollars.Moreover, this value, around one million dollars, is higher than what was invested in developing the proposed tool.Consequently, the return on investment can come already in the first year of its use.
One question remains of how broad the evaluation performed here is, which is closely related to EG2.Indeed, the Brazilian market provides much more than the seven devices employed here.Nonetheless, although only those platforms were employed, the chosen manufacturers represent five out of the first seven market leaders and amount to 80% of the Brazilian market.Moreover, software reuse is usual, even throughout different models and brands from the same manufacturer, given that a platform maybe even employed for low-, mid-, and high-end products.Consequently, the rough estimates are reasonable, although some discrepancies may be noticed.

| RELATED WORK
As mentioned in Section 6, the present work seems to be the first study to tackle robustness assessment regarding DTV receivers, with fuzzed TSs that aim to discover fragile parts in their processing chains.Indeed, the approach presented here, that is, recognizing that configuration errors take place and that they should be anticipated so that nonconformance is purposefully inserted to evaluate and handle their consequences, is novel and has never been reported in the available literature.Consequently, it is not feasible to compare it with other studies, to define its performance relative to what currently exists.The majority of the current studies on DTV evaluation focus on conformance testing, which is usually spread across different DTV subsystems.Even so, we will provide a review of TV testing in general while trying to identify some similarities with our work.
Tekcan et al. [94] proposed a black-box testing framework with automatic test-case generation, focusing on user interaction and features accessed via GUI.Unlike the scheme proposed here, non-conformance and configuration error test cases are not addressed.Rau [95] tackled only image quality, while Belém et al. [96] restricted their method to evaluating general problems of finished elements in production lines.In turn, Souza Júnior et al. [32] developed a framework for automatic field-testing, focusing on receiver execution and user interaction when handling real signals.Again, although there are slight similarities when interactive applications are involved, their focus is entirely different.Park et al. [97], in turn, tackled different aspects: radiofrequency (RF), RF channels, and signal level.
The studies developed by Pinheiro et al. [15] and Souza Júnior et al. [21], together with the HbbTV test suite [98], for instance, tackle interactive applications and indeed present some similarity with the work proposed here.However, their focus is on conformity, as is the case with the work developed by Flores-Guridi et al. [99], which aims to provide an approval protocol for DTV receivers.Once again, no mention of transport layer, processing chain, robustness, or non-conformance was identified.Even so, Souza Júnior et al. [21] provided some results we can mention.They showed error-profile progression across scrum sprints, with an expected reduction in the corresponding number of bugs.Consequently, during development, it is clear that conformance assessments can easily reveal failure rates as high as 60%, which tends to be reduced over time towards 0, until a final software release is provided.In the case of the present paper, if only the test-group Ginga is considered, the highest found figure was 29.44%, which is surprising, given that Platf orm 6 is a commercial one, and shows the importance of such an evaluation procedure integrated within development cycles.
In summary, there are studies focused on the three main DTV layers: transmission/physical, transport, and media, the latter including audio, video, multimedia, and interactive applications.However, no identifiable work tackles robustness and non-conformance with a focus on DTV networks or related systems with similar protocols (e.g.DR systems), as noticed from the associated transport layer.Furthermore, the underlying fuzzing machine is classified as a generation one and performs better than simple fuzzing.Consequently, the present work configures as a milestone regarding DTV receiver robustness evaluation and tackles an aspect still disregarded: configuration errors originating from broadcasters.

| CONCLUSIONS
This paper proposes a novel robustness testing methodology for evaluating DTV receivers based on non-compliance tests via fuzzing to identify opportunities for receiver improvement.In particular, it presents a collection of real field problems identified in DTV networks and outlines a scheme for non-compliance insertion that performs grammarbased guided fuzzing.The resources tackled here are available in most DTV standards worldwide, and some problems were identified in different DTV systems.Consequently, the findings presented here are novel to system and software test practitioners of DTV receivers and are of general and broad applicability.
The proposed methodology neither indicates test areas nor informs the number of evaluation procedures.Such decisions may result in different implementations for detecting specific problems and tackling completely different scenarios, which also depends on the fuzzing approach itself.Thus, its implementation consists of a system that must be continuously extended.Furthermore, the specific cause of a problem is not identified, as it detects only what is noticed by users and through the available video and audio interfaces.Thus, when a problem is informed, the next step is to analyse the respective TS and the associated receiver to develop a handling procedure.
The proposed methodology was implemented, and experiments allowed us to identify robustness problems related to wrong information in several DTV subsystems, such as media ESs and service classification.The most severe problems seem to be related to media ESs, which caused a failure rate of 100% in one test platform and PSI/SI data.
A significant result resides in the snapshot of the platforms provided by multinational manufacturers.Furthermore, given the software reuse practice, the chosen receivers were fragile and could be extrapolated for many devices and markets.
Another point regards attackers' intentional manipulation of transmitted configuration.Currently, that seems even more possible as modern systems may be primarily online, which is expected, for instance, in the context of pay-TV operators.In contrast, our work is entirely based on mistakes made during equipment configuration or development errors, and the raised matter seems somewhat out of scope.However, intentional errors can also be handled by the proposed methodology as long as they occur in the context of configuration data.
Although the present work was initially focused only on DTV networks, which may seem a restriction, there is a myriad of different broadcasting systems that can benefit from it, including DTV and DR ones and also satellite and cable counterparts, such as DVB-S and DVB-C.However, DVB-S and DVB-C are mainly used in vertical markets; that is, the related operators control both ends of DTV networks: they transmit signals and also provide receivers, the latter usually using similar underlying software systems (e.g.middleware and system software).Moreover, their middleware modules are usually similar or even the same across different devices, with a more homogeneous behaviour.Consequently, they are responsible for their entire networks, a condition that ultimately speeds up correction procedures as they are highly committed to keeping a certain quality of service.DVB-T and terrestrial systems, in turn, are intended for horizontal markets, that is, different broadcasters and a myriad of independent platform manufacturers, the latter having no relation to the former, making everything more complex and delaying or even impeding any action regarding configuration correction.This way, when a problem happens in terrestrial networks, a solution will certainly come much later than in satellite or cable ones, creating a more demanding environment regarding robustness against configuration errors.This understanding thus led the present work to evaluate terrestrial networks with an implementation of the proposed methodology.However, that does not exempt satellite and cable networks from benefiting from robust devices, and the framework developed here may also be of interest to them.Moreover, regarding configuration, robustness under real operation is not often tackled in many communication systems.In contrast, the present work shows a methodology whose essence can be applied to awaken interest in such a context.In addition, it is worth noticing that practical fuzzers usually target specific areas, such as network protocols, which, together with the simplex communication of DTV systems, led to some characteristics of the proposed methodology.
As future work, other problem groups will be developed, such as generic PES structure and CC.In addition, a generic non-conformance testing tool based on the proposed methodology will be made available to enhance the robustness of DTV networks.Moreover, enhancements to the underlying fuzzer will be provided, such as gradients and machine learning algorithms that provide adaptability towards known fragile parts and subsystems.Finally, given that TPV/Envision is a member of the SBTVD's technical committee, the provision of the tool developed here for the general public is being evaluated.

F I G U R E 3
Block diagram for a DTV-oriented smart fuzzer T A B L E 1 Structure of PMT sections

F I G U R E 5
Reference video for automated analysis , as below.In this context, 'original_network_id 0 is obtained from other tables and is a terminal symbol.At the same time, service_type and service_number are non-terminals constructed with information that is not adequate for TV services or exactly ordered as specified by the underlying DTV standards.Another example is the component descriptor, mentioned in the problem field reported in Section 3.5.3and has the grammar shown below.
component_type ¼ 8 * binary_digit; binary_digit ¼ '0'j'1': T A B L E 2 Test results for an implementation of the proposed methodology, regarding seven different DTV platforms [39]e 2 shows that Platf orms 1, 2, 3, 4, 5, 6 and 7 presented failure rates of 2.22, 35.70, 19.52, 2.33, 9.76, 17.51 and 31.71,respectively.It indicates that Platf orms 1 and 4 presented the best overall T A B L E 4 Tests comparing problems uncovered with fuzzed data created with the proposed methodology and Peach[39]and should keep stable during DTV reception in Brazil, but they still need improvements.Enhancements are now being performed in TPV/Envision Brazil's DTV receivers to reduce field-problem occurrence, which fulfils EG3. results