Skip to main content
Log in

Efficient programming paradigm for video streaming processing on TILE64 platform

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Advances at an unprecedented rate in computer hardware and networking technologies have made the many-core computing affordable and readily available in a matter of few years. Nonetheless, it incurs challenges to programmers to build scalable parallel software. Optimizations of parallel programs for a many-core platform are viewed as a multifaceted problem, where system and architectural factors should be taken into account. In this paper, we tackle this problem by implementing parallel programs with different available programming paradigms and evaluate application behaviors on TILE64 many-core platform. That is, we investigate a hybrid producer-write plus consumer-read shared memory programming paradigm for the implementation of master–worker video decoder and encoder in the referred many-core platform. Experimental results show that the proposed implementation has achieved competitive performance speedup, scaling well with the number of available cores and up to four times of performance improvement over other implementations on the decoding of sample 1080P video.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Algorithm 1
Fig. 6
Fig. 7
Algorithm 2
Algorithm 3
Algorithm 4
Algorithm 5
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

References

  1. Borkar S (2007) Thousand core chips: a technology perspective. In: Proceedings of the 44th design automation conf (DAC 07), pp 746–749. doi:10.1145/1278480.1278667

    Chapter  Google Scholar 

  2. Parkhurst J, Darringer J, Grundmann B (2006) From single core to multi-core: preparing for a new exponential. In: Proceedings of the IEEE/ACM int conf computer-aided design (ICCAD 06), pp 67–72. doi:10.1145/1233501.1233516

    Google Scholar 

  3. Karam L, AlKamal I, Gatherer A, Frantz G, Anderson D, Evans B (2009) Trends in multicore DSP platforms. IEEE Signal Process Mag 26(6):38–49. doi:10.1109/MSP.2009.934113

    Article  Google Scholar 

  4. Sutter H (2005) The free lunch is over: a fundamental turn toward concurrency in software. Dr Dobb’s J 30(3):202–210

    Google Scholar 

  5. Chen G, Li F, Son SW, Kandemir M (2008) Application mapping for chip multiprocessors. In: Proceedings of the 45th design automation conf (DAC 08), pp 620–625. doi:10.1145/1391469.1391628

    Chapter  Google Scholar 

  6. Tan G, Sun N, Gao GR (2007) A parallel dynamic programming algorithm on a multi-core architecture. In: Proceedings of the 19th ACM symp parallel algorithms and architectures (SPAA 07), vol 07, pp 135–144. doi:10.1145/1248377.1248399

    Google Scholar 

  7. Bell S, Edwards B, Amann J, Conlin R, Joyce K, Leung V, MacKay J, Reif M, Liewei B, Brown J, Mattina M, Chyi-Chang M, Ramey C, Wentzlaff D, Anderson W, Berger E, Fairbanks N, Khan D, Montenegro F, Stickney J, Zook J (2008) TILE64 processor: a 64-core SoC with mesh interconnect. In: Proceedings of the IEEE intl solid-state circuits conf (ISSCC 08), pp 88–598. doi:10.1109/ISSCC.2008.4523070

    Google Scholar 

  8. Chen S, Chen S, Gu H, Chen H, Yin Y, Chen X, Sun S, Liu S, Wang Y (2010) Mapping of H.264/AVC encoder on a hierarchical chip multicore DSP platform. In: Proceedings of the 12th IEEE int conf high performance computing and communications (HPCC 10), pp 465–470. doi:10.1109/HPCC.2010.82

    Google Scholar 

  9. Boutellier J, Jaaskelainen P, Silven O (2007) Run-time scheduled hardware acceleration of MPEG-4 video decoding. In: Proceedings of the 2007 int symp system-on-chip, pp 1–4

    Chapter  Google Scholar 

  10. Yung NHC, Leung K-K (2001) Spatial and temporal data parallelization of the H.261 video coding algorithm. IEEE Trans Circuits Syst Video Technol 11(1):91–104

    Article  Google Scholar 

  11. Rodriguez-Fernandez D, Vilarino DL, Pardo XM (2009) A pixel-parallel moving object segmentation and tracking algorithm for video surveillance applications. In: Proceedings of the 6th int symp image and signal processing and analysis (ISPA 09), pp 614–619

    Google Scholar 

  12. Berthold J, Dieterle M, Loogen R, Priebe S (2008) Hierarchical master–worker skeletons. In: Proceedings of the 10th int conf practical aspects of declarative languages (PADL 08). Lecture notes in computer science, pp 248–264

    Chapter  Google Scholar 

  13. Benoit A, Marchal L, Pineau JF, Robert Y, Vivien F (2010) Scheduling concurrent bag-of-tasks applications on heterogeneous platforms. IEEE Trans Comput 59(2):202–217. doi:10.1109/TC.2009.117

    Article  MathSciNet  Google Scholar 

  14. Hoffmann H, Wentzlaff D, Agarwal A (2010) Remote store programming. In: Patt Y, Foglia P, Duesterwald E, Faraboschi P, Martorell X (eds) High performance embedded architectures and compilers. Lecture notes in computer science, vol 5952. Springer, Berlin, pp 3–17. doi:10.1007/978-3-642-11515-8_3

    Chapter  Google Scholar 

  15. Awasthi M, Nellans DW, Sudan K, Balasubramonian R, Davis A (2010) Handling the problems and opportunities posed by multiple on-chip memory controllers. In: Proceedings of the 19th int conf parallel architectures and compilation techniques (PACT 10), pp 319–330. doi:10.1145/1854273.1854314

    Chapter  Google Scholar 

  16. Abts D, Jerger NDE, Kim J, Gibson D, Lipasti MH (2009) Achieving predictable performance through better memory controller placement in many-core CMPs. In: Proceedings of the 36th int symp computer architecture (ISCA 09), pp 451–461. doi:10.1145/1555754.1555810

    Chapter  Google Scholar 

  17. Lin X-Y, Huang C-Y, Yang P-M, Lung T-W, Tseng S-Y, Chung Y-C (2011) Parallelization of motion JPEG decoder on TILE64 many-core platform. In: Hsu C-H, Malyshkin V (eds) Methods and tools of parallel programming multicomputers. Lecture notes in computer science, vol 6083. Springer, Berlin, pp 59–68. doi:10.1007/978-3-642-14822-4_7

    Chapter  Google Scholar 

  18. Jackson JD, Hatcher PJ (2011) Efficient parallel execution of sequence similarity analysis via dynamic load balancing. In: Proceedings of the ISCA 3rd int conf bioinformatics and computational biology (BICoB 11), pp 219–224

    Google Scholar 

  19. Goux JP, Kulkarni S, Linderoth J, Yoder M (2000) An enabling framework for master–worker applications on the computational grid. In: Proceedings of the 9th int symp high-performance distributed computing (HDPC 00), pp 43–50

    Chapter  Google Scholar 

  20. Fujimoto RM, Malik AW, Park A (2010) Parallel and distributed simulation in the cloud. SCS M&S Mag 1(3):1–10

    Google Scholar 

  21. Rynge M, Callaghan S, Deelman E, Juve G, Mehta G, Vahi K, Maechling PJ (2012) Enabling large-scale scientific workflows on petascale resources using MPI master/worker. In: Proceedings of the 1st conf extreme science and engineering discovery environment (XSEDE 12), pp 1–8. doi:10.1145/2335755.2335846

    Google Scholar 

  22. Blagojevic F, Nikolopoulos DS, Stamatakis A, Antonopoulos CD (2007) Dynamic multigrain parallelization on the cell broadband engine. In: Proceedings of the 12th ACM SIGPLAN symp principles and practice of parallel programming, pp 90–100. doi:10.1145/1229428.1229445

    Google Scholar 

  23. Zheng G, Meneses E, Bhatelé A, Kalé LV (2010) Hierarchical load balancing for Charm++ applications on large supercomputers. In: Proceedings of the 39th int conf parallel processing workshops (ICPPW 10), pp 436–444. doi:10.1109/ICPPW.2010.65

    Google Scholar 

  24. Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113. doi:10.1145/1327452.1327492

    Article  Google Scholar 

  25. Giseok C, Jeongsoo Y, Jeonghoon C, Jongho N (2007) Design and implementation of a real-time video player on tiled-display system. In: Proceedings of the 7th IEEE int conf computer and information technology (CIT 07), pp 621–626

    Google Scholar 

  26. Nunome T, Tasaka S (2004) Application-level QoS assessment of continuous media multicasting in a wireless ad hoc network. In: Proceedings of the 2004 IEEE int conf communications, pp 2047–2053

    Google Scholar 

  27. Pereira R, Azambuja M, Breitman K, Endler M (2010) An architecture for distributed high performance video processing in the cloud. In: Proceedings of the 3rd IEEE int conf cloud computing (CLOUD 10), pp 482–489

    Google Scholar 

  28. Ali U, Bilal M (2006) Video based parallel face recognition using Gabor filter on homogeneous distributed systems. In: Proceedings of the 2006 IEEE int conf engineering of intelligent systems, pp 1–5

    Chapter  Google Scholar 

  29. MJPEG Tools. http://mjpeg.sourceforge.net

  30. Wang Z, Liang L, Yang G, Zhang X, Sun J, Zhao D, Gao W (2011) A novel macro-block group based AVS coding scheme for many-core processor. J Signal Process Syst 65(1):129–145. doi:10.1007/s11265-010-0543-0

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yeh-Ching Chung.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lin, XY., Lai, KC., Li, KC. et al. Efficient programming paradigm for video streaming processing on TILE64 platform. J Supercomput 65, 823–847 (2013). https://doi.org/10.1007/s11227-012-0867-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-012-0867-6

Keywords

Navigation