Other Advocacy Entities

This section provides a short survey of industry advocacy and activities in support of 3DTV.

[email protected] Consortium

Recently (in 2008) the [email protected] Consortium was formed with the mission to speed the commercialization of 3D into homes worldwide and provide the best possible viewing experience by facilitating the development of standards, roadmaps, and education for the entire 3D industry—from content, hardware, and software providers to consumers.

3D Consortium (3DC)

The 3D Consortium (3DC) aims at developing 3D stereoscopic display devices and increasing their take-up, promoting expansion of 3D contents, improving distribution, and contributing to the expansion and development of the 3D market. It was established in Japan in 2003 by five founding companies and 65 other companies including hardware manufacturers, software vendors, contents vendors, contents providers, systems integrators, image producers, broadcasting agencies, and academic organizations.

European Information Society Technologies (IST) Project ‘‘Advanced Three-Dimensional Television System  Technologies’’ (ATTEST)

This is a project where industries, research centers, and universities have joined forces to design a backwards-compatible, flexible, and modular broadcast 3DTV system. The ambitious aim of the European Information Society Technologies (IST) project ATTEST is to design a novel, backwards-compatible, and flexible broadcast 3DTV system. In contrast to former proposals that often relied on the basic concept of “stereoscopic” video, that is the capturing, transmission, and display of two separate video streams (one for the left eye and one for the right eye), this activity focuses on a data-in-conjunction-with-metadata approach. At the very heart of the described new concept is the generation and distribution of a novel data representation format that consists of monoscopic color video and associated per-pixel depth information. From these data, one or more “virtual” views of a real-world scene can be synthesized in real-time at the receiver side (i.e., a 3DTV STB) by means of the DIBR techniques. The modular architecture of the proposed system provides important features, such as backwards-compatibility to today’s 2D DTV, scalability in terms of receiver complexity, and adaptability to a wide range of different 2D and 3D displays.

3D Content Creation. For the generation of future 3D content, novel three-dimensional material is created by simultaneously capturing video and associated per-pixel depth information with an active range camera such as the so-called ZCamTM developed by 3DV Systems. Such devices usually integrate a high-speed pulsed infrared light source into a conventional broadcast TV camera and they relate the time of flight of the emitted and reflected light walls to direct measurements of the depth of the scene. However, it seems clear that the need for sufficient high-quality, three-dimensional content can only partially be satisfied with new recordings. It will therefore be necessary (especially in the introductory phase of the new broadcast technology) to also convert already existing 2D video material into 3D using so-called “structure from motion” algorithms. In principle, such (offline or online) methods process one or more monoscopic color video sequences to (i) establish a dense set of image point correspondences from which information about the recording camera, as well as the 3D structure of the scene can be derived or (ii) infer approximate depth information from the relative movements of automatically tracked image segments. Whatever 3D content generation approach is used in the end, the outcome in all cases consists of regular 2D color video in European DTV format (720 × 576 luminance pels, 25 Hz, interlaced) and an accompanying depth-image sequence with the same spatiotemporal resolution. Each of these depth-images stores depth information as 8-bit gray values with the gray level 0 specifying the furthest value and the gray level 255 defining the closest value. To translate this data representation format to real, metric depth values (that are required for the “virtual” view generation (and to be flexible with respect to 3D scenes with different depth characteristics, the gray values are normalized to two main depth clipping planes.

3DV Coding. To provide the future 3DTV viewers with threedimensional content, the monoscopic color video and the associated per-pixel depth information have to be compressed and transmitted over the conventional 2D DTV broadcast infrastructure. To ensure the required backwards-compatibility with existing 2D-TV STBs, the basic 2D color video has to be encoded using the standard MPEG-2 as MPEG-4 Visual or AVC tools currently required by the DVB Project in Europe.

Transmission. The DVB Project, a consortium of industries and academia responsible for the definition of today’s 2D DTV broadcast infrastructure in Europe, requires the use of the MPEG-2 systems layer specifications for the distribution of audiovisual data via cable (DVB-C), satellite (DVB-S), or terrestrial (DVB-T) transmitters.

‘‘Virtual’’ View Generation and 3D Display. At the receiver side of the proposed ATTEST system, the transmitted data is decoded in a 3DTV STB to retrieve the decompressed color video- and depth-image sequences (as well as the additional metadata). From this data representation format, a DIBR algorithm generates “virtual” left- and right-eye views for the three-dimensional reproduction of a real-world scene on a stereoscopic or autostereoscopic, singleor multiple-user 3DTV display. The backwards-compatible design of the system ensures that viewers who do not want to invest in a full 3DTV set are still able to watch the two-dimensional color video without any degradations in quality using their existing digital 2DTV STBs and displays.

3D4YOU

3D4YOU7 is funded under the ICT Work Programme 2007–2008, a thematic priority for research and development under the specific program “Cooperation” of the Seventh Framework Programme (2007–2013). The objectives of the project are

  1. to deliver an end-to-end system for 3D high-quality media;
  2. to develop practical multi-view and depth capture techniques;
  3. to convert captured 3D content into a 3D broadcasting format;
  4. to demonstrate the viability of the format in production and over broadcast chains;
  5. to show reception of 3D content on 3D displays via the delivery chains;
  6. to assess the project results in terms of human factors via perception tests;
  7. to produce guidelines for 3D capturing to aid in the generation of 3D media production rules;
  8. to propose exploitation plans for different 3D applications.

The 3D4YOU project aims at developing the key elements of a practical 3D television system, particularly, the definition of a 3D delivery format and guidelines for a 3D content creation process.

The 3D4YOU project will develop 3D capture techniques, convert captured content for broadcasting, and develop 3D coding for delivery via broadcast that is suitable to transmit and make public. 3D broadcasting is seen as the next major step in home entertainment. The cinema and computer games industries have already shown that there is considerable public demand for 3D content but the special glasses that are needed limits their appeal. 3D4YOU will address the consumer market that coexists with digital cinema and computer games. The 3D4YOU project aims to pave the way for the introduction of a 3D TV system. The project will build on previous European research on 3D, such as the FP5 project ATTEST that has enabled European organizations to become leaders in this field.

3D4YOU endeavors to establish practical 3DTV. The key success factor is 3D content. The project seeks to define a 3D delivery format and a content creation process. Establishing practical 3DTV will then be demonstrated by embedding this content creation process into a 3DTV production and delivery chain, including capture, image processing, delivery, and then display in the home. The project will adapt and improve on these elements of the chain so that every part integrates into a coherent interoperable delivery system. A key project’s objective is to provide a 3D content format that is independent of display technology, and backward compatible with 2D broadcasting. 3D images will be commonplace
in mass communication in the near future. Also, several major consumer electronics companies have made demonstrations of 3DTV displays that could be in the market within two years. The public’s potential interest in 3DTV can be seen by the success of 3D movies in recent years. 3D imaging is already present in many graphics applications (architecture, mechanical design, games, cartoons, and special effects for TV and movie production).

In recent years, multi-view display technologies have appeared that improve the immersive experience of 3D imaging that leads to the vision that 3DTV or similar services might become a reality in the near future. In the United States, the number of 3D-enabled digital cinemas is rapidly growing. By 2010, about 4300 theaters are expected to be equipped with 3D digital projectors with the number increasing every month. Also in Europe, the number of 3D theaters is growing. Several digital 3D films will surface in the months and years to come and several prominent filmmakers have committed to making their next productions in stereo 3D. The movie industry creates a platform for 3D movies, but there is no established solution to bring these movies to the domestic market. Therefore, the next challenge is to bring these 3D productions to the living room. 2D to 3D conversion and a flexible 3D format are an important strategic area. It has been recognized that multi-view video is a key technology that serves a wide variety of applications, including free viewpoint and 3DV applications for the home entertainment and surveillance business fields. Multi-view video coding and transmission systems are most likely to form the basis for next-generation TV broadcasting applications and facilities. Multi-view video will greatly improve the efficiency of current video coding solutions performing simulcasts of independent views. This project builds on the wealth of experience of the major players in European 3DTV and intends to bring the date of the start of 3D broadcasting a step closer by combining their expertise to define a 3D delivery format and a content creation process.

The key technical problems that currently hamper the introduction of 3DTV to the mass market are as follows:

  1. It is difficult to capture 3DV directly using the current camera technology. At least two cameras need to operate simultaneously with an adjustable but known geometry. The offset of stereo cameras needs to be adjustable to
    capture depth, both close by and far away.
  2. Stereo video (acquired with 2-cameras) is currently not sufficient input for glasses-free, multi-view autostereoscopic displays. The required processing, such as disparity estimation, is noise-sensitive resulting in low 3D picture quality.
  3. 3D postproduction methods and 3DV standards are largely absenterimmature.

The 3D4YOU project will tackle these three problems. For instance, a creative combination of two or three high-resolution video cameras with one or two lowresolution depth range sensors may make it possible to create 3DV of good quality without the need for an excessive investment in equipment. This is in contrast to installing, say, 100 cameras for acquisition where the expense may hamper the introduction of such a system.

Developing tools for conversion of 3D formats will stimulate content creation companies to produce 3DV content at acceptable cost. The cost at which 3DV should be produced for commercial operation is not yet known. However, currently, 3DV production requires almost per frame user interaction in the video, which is certainly unacceptable. This immediately indicates the issue that needs to be solved: currently, fully automated generation of high 3DV is difficult; in the future it needs to be fully, or at least semi-automatic with an acceptable minimum of manual supervision during postproduction. 3D4YOU will research how to convert 3D content into a 3D broadcasting format and prove the viability of the format in production and over broadcast chains.

Once 3DV production becomes commercially attractive because acquisition techniques and standards mature, then this will impact the activities of content producers, broadcasters, and telecom companies. As a result, one may see that these companies may adopt new techniques for video production just because the output needs to be in 3D. Also, new companies could be founded that focus on acquiring 3DV and preparing it for postproduction. Here, there is room for differentiation since, for instance, the acquisition of a sport event will require large baselines between cameras and real-time transmission, whereas the shooting of narrative stories will require both small and large baselines and allows some manual postproduction for achieving optimal quality. These activities will require new equipment (or a creative combination of existing equipment) and new expertise.

3D4YOU will develop practical multi-view and depth capture techniques. Currently, the stereo video format is the de facto 3D standard that is used by the cinemas. Stereo acquisition may, for this reason, become widespread as an acquisition technique. Cinemas operate with glasses-based systems and can therefore use a theater-specific stereo format. This is not the case for the glasses-free autostereoscopic 3DTV that 3D4YOU foresees for the home. To allow glassesfree viewing with multiple people at home, a wide baseline is needed to cover the total range of viewing angles. The current stereo video that is intended for the cinema will need considerable postproduction to be suitable for viewing on a multi-view autostereoscopic display. Producing visual content will therefore, become more complex and may provide new opportunities for companies currently active in (3D) movie postproduction. According to the Networked and Electronic Media (NEM) Strategic Research Agenda, multi-view coding will form the basis for next-generation TV broadcast applications. Multi-view video has the advantage that it can serve different purposes. On the one hand, the multi-view input can be used for 3DTV. On the other hand, it can be shown on a normal TV where the viewer can select his or her preferred viewpoint of the action. Of course, a combination is possible where the viewer selects his or her preferred viewpoint on a 3DTV. However, multi-view acquisition with 30 views for example, will require 30 cameras to operate simultaneously. This initially requires a large investment. 3D4YOU therefore sees a gradual transition from stereo capture to systems with many views. 3D4YOU will investigate a mixture of 3DV acquisition techniques to produce an extended center view plus depth format (possibly with one or two extra views) that is, in principle, easier to produce, edit, and distribute. The success of such a simpler format relies on the ease (read cost!) at which it can be produced. One can conclude that the introduction of 3DTV to the mass market is hampered by (i) the lack of highquality 3DV content; (ii) by the lack of suitable 3D formats; and (iii) lack of appropriate format conversion techniques. The variety of new distribution media further complicates this.

Hence, one can identify the following major challenges that are expected to be overcome by the project:

  1. Video Acquisition for 3D Content: Here, the practicalities of multi-view and depth capture techniques are of primary importance, the challenge is to find the trade off such as number of views to be recorded, and how to
    optimally integrate depth capture with multi-view. A further challenge is to define which shooting styles are most appropriate.
  2. Conversion of Captured Multi-View Video to a 3D Broadcasting Format: The captured format needs new postproduction tools (like enhancement and regularization of depth maps or editing, mixing, fading, and compositing of V+D representations from different sources) and a conversion step generating a suitable transmission format that is compatible with used postproduction formats before the 3D content can be broadcast and displayed.
  3. Coding Schemes for Compression and Transmission: A last challenge is to provide suitable coding schemes for compression and transmission that are based on the 3D broadcasting format under study and to demonstrate their feasibility in field trials under real distribution conditions.

By addressing these three challenges from an end-to-end systems point of view, the 3D4YOU project aims to pave the way to the definition of a 3D TV system suitable for a series of applications. Different requirements could be set depending on the application, but the basic underlying technologies (capture, format, and encoding) should maintain as much commonality as possible so as to favor the emergence of an industry based on those technologies.

3DPHONE

The 3DPHONE project aims to develop technologies and core applications enabling a new level of user experience by developing end-to-end all-3D imaging mobile phone. Its aim is to have all fundamental functions of the phone—media display, User Interface (UI), and personal information management (PIM) applications—realized in 3D. We will develop techniques for all-3D phone experience: mobile stereoscopic video, 3D UIs, 3D capture/content creation, compression, rendering, and 3D display. The research and development of algorithms for 3D audiovisual applications including personal communication, 3D visualization, and content management will be done.

The 3DPhone Project started on February 11, 2008. The duration of the project is 3 years and there are six participants from Turkey, Germany, Hungary, Spain, and Finland. The partners are Bilkent University, Fraunhofer, Holografika, TAT, Telefonica, and University of Helsinki. 3DPhone is funded by the European Community’s ICT programme in Framework Programme Seven.

The goal is to enable users to

  • capture memories in 3D and communicate with others in 3D virtual spaces;
  • interact with their device and applications in 3D;
  • manage their personal media content in 3D.

The expected outcome will be simpler use and a more personalized look and feel. The project will bring state-of-the-art advances in mobile 3D technologies with the following activities:

  • A mobile hardware and software platform will be implemented with both 3D image capture and 3D display capability, featuring both 3D displays and multiple cameras. The project will evaluate different 3D display
    and capture solutions and will implement the most suitable solution for hardware–software integration.
  • UIs and applications that will capitalize on the 3D autostereoscopic illusion in the mobile handheld environment will be developed. The project will design and implement 3D and zoomable UI metaphors suitable for autostereoscopic displays.
  • End-to-end 3DV algorithms and 3D data representation formats, targeted for 3D recording, 3D playback, and real-time 3DV communication will beinvestigated and implemented.
  • Ergonomics and experience testing to measure any possible negative symptoms, such as eye strain created by stereoscopic content, will be performed. The project will research ergonomic conditions specific to the mobile handheld usage: in particular, the small screen, one hand holding the device, absence of complete keyboard, and limited input modalities.

In summary, the general requirements on 3DV algorithms on mobile phones are as follows:

  • low power consumption,
  • low complexity of algorithms,
  • limited memory/storage for both RAM and mass storage,
  • low memory bandwidth,
  • low video resolution,
  • limited data transmission rates and limited bitrates for 3DV signal.

These strong restrictions derived from terminal capabilities and from transmission bandwidth limitations usually result in relatively simple video processing algorithms to run on mobile phone devices. Typically, video coding standards take care of this by specific profiles and levels that only use a restricted and simple set of video coding algorithms and  low-resolution video. The H.264/AVC Baseline Profile for instance, uses only a simple subset of the rich video coding algorithms that the standard provides in general. For 3DV, the equivalent of such a low-complexity baseline profile for mobile phone devices still needs to be defined and developed. Obvious requirements of video processing and coding apply for 3DV on mobile phones as well, such as

  • high coding efficiency (taking bitrate and quality into account);
  • requirements specific for 3DV that apply for 3DV algorithms on mobile phones including
    • flexibility with regard to different 3D display types,
    • flexibility for individual adjustment of 3D impression.

 

 

Moving Picture Experts Group (MPEG)

Overview

MPEG is a working group of ISO/IEC in charge of the development of standards for coded representation of digital audio and video and related data. Established in 1988, the group produces standards that help the industry offer end users an evermore enjoyable digital media experience. In its 21 years of activity, MPEG has developed a substantive portfolio of technologies that have created an industry worth several hundred billion USD. MPEG is currently interested in 3DV in general and 3DTV in particular. Any broad success of 3DTV/3DV will likely depend on the development and industrial acceptance of MPEG standards; MPEG is the premiere organization worldwide for video encoding and the list of standards that have been produced in recent years is as follows:

MPEG-1 The standard on which such products as video CD and MP3 are based

MPEG-2 The standard on which such products as digital television set-top boxes and DVDs are based

MPEG-4 The standard for multimedia for the fixed and mobile web

MPEG-7 The standard for description and search of audio and visual content

MPEG-21 The multimedia framework

MPEG-A The standard providing application-specific formats by integrating multiple MPEG technologies

MPEG-B A collection of systems-specific standards

MPEG-C A collection of video-specific standards

MPEG-D A collection of audio-specific standards

MPEG-E A standard (M3W) providing support to download and execute multimedia applications

MPEG-M A standard (MXM) for packaging and reusability of MPEG technologies

MPEG-U A standard for rich media user interface

MPEG-V A standard for interchange with virtual worlds

provides a more detailed listing of activities of MPEG groups in the area of video.

Completed Work

As we have seen in other parts of this text, currently there are a number of different 3DV formats (either already available and/or under investigation), typically related to specific types of displays (e.g., classical two-view stereo video, multiview video with more than two views, V+D, MV+D, and layered depth video). Efficient compression is crucial for 3DV applications and a plethora of compression and coding algorithms are either already available and/or under investigation for the different 3DV formats (some of these are standardized e.g., by MPEG, others are proprietary). A generic, flexible, and efficient 3DV format that can serve a range of different 3DV systems (including mobile phones) is currently being investigated by MPEG.

As we noted earlier in this text, MPEG standards now already support 3DV based on V+D. In 2007 MPEG specified a container format “ISO/IEC 23002-3 Representation of Auxiliary Video and Supplemental Information” (also know  as MPEG-C Part 3) that can be utilized for V+D data. Transport of this data is defined in a separate MPEG systems specification “ISO/IEC 13818-1:2003 Carriage of Auxiliary Data”

In 2008 ISO approved a new 3DV project in 2008 under ISO/IEC JTC1/SC29/WG11 (ISO/IEC JTC1/SC29/WG11, MPEG2008/N9784). The
JVT of ITU-T and MPEG has devoted its recent efforts to extend the widely deployed H.264/AVC standard for MVC to support MV+D (and also V+D). MVC allows the construction of bitstreams that represent multiple views. The MPEG standard that emerged, MVC, provides good robustness and compression performance for delivering 3DV by taking into account of the inter-view dependencies of the different visual channels. In addition, its backwards-compatibility with H.264/AVC codecs makes it widely interoperable in environments having both 2D and 3D capable devices. MVC supports an MV+D (and also V+D) encoded representation inside the MPEG-2 transport stream. The MVC standard was developed by the JVT of ISO/IEC MPEG

Activities of MPEG Groups in the Area of Video

Activities of MPEG Groups in the Area of Video

Activities of MPEG Groups in the Area of VideoActivities of MPEG Groups in the Area of Video

Activities of MPEG Groups in the Area of Video

Activities of MPEG Groups in the Area of Video

Activities of MPEG Groups in the Area of Video

Activities of MPEG Groups in the Area of VideoActivities of MPEG Groups in the Area of Video

and ITU-T Video Coding Experts Group (VCEG; ISO/IEC JTC1/SC29/WG11 and ITU-T SG16 Q.6). MVC was originally an addition to H.264/MPEG-4 AVC video compression standard that enables efficient encoding of sequences captured simultaneously from multiple cameras using a single video stream.

At press time, MVC was the most efficient way for stereo and multi-view video coding; for two views, the performance achieved by H.264/AVC Stereo SEI message and MVC are similar. MVC is also expected to become a new MPEG video coding standard for the realization of future video applications such as 3DTV and FTV. The MVC group in the JVT has chosen the H.264/AVC-based MVC method as the MVC reference model, since this method showed better coding efficiency than H.264/AVC simulcast coding and the other methods that were submitted in response to the call for proposals made by the MPEG.

New Initiatives

ISO MPEG has already developed a suite of international standards to support 3D services and devices, and in 2009 initiated a new phase of standardization to be completed by 2011

  • One objective is to enable stereo devices to cope with varying display types and sizes, and different viewing preferences. This includes the ability to vary the baseline distance for stereo video to adjust the depth perception that could help to avoid fatigue and other viewing discomforts.
  • MPEG also envisions that high-quality autostereoscopic displays will enter the consumer market in the next few years. Since it is difficult to directly provide all the necessary views due to production and transmission constraints, a new format is needed to enable the generation of many high-quality views from a limited amount of input data such as stereo and depth.

ISO’s vision is now a new 3DV format that goes beyond the capabilities of existing standards to enable both advanced stereoscopic display processing and improved support for autostereoscopic N -view displays, while enabling interoperable 3D services. The new 3DV standard aims to improve rendering capability of 2D+Depth format while reducing bitrate requirements relative to existing standards, as noted earlier in this Section 6.

3DV supports new types of audiovisual systems that allow users to view videos of the real 3D space from different user viewpoints. In an advanced application of 3DV, denoted as FTV, a user can set the viewpoint to an almost arbitrary location and direction that can be static, change abruptly, or vary continuously, within the limits that are given by the available camera setup. Similarly, the audio listening point is changed accordingly. The first phase of 3DV development is expected to support advanced 3D displays, where M dense views must be generated from a sparse set of K transmitted views (typically K ≤ 3) with associated depth data. The allowable range of view synthesis will be relatively narrow (20◦ view angle from leftmost to rightmost view).

Example of an FTV system and data format.

The MPEG initiative notes that 3DV is a standard that targets serving a variety of 3D displays. It is the first phase of FTV, that is a new framework that includes a coded representation for multi-view video and depth information to support the generation of high-quality intermediate views at the receiver. This enables free viewpoint functionality and view generation for automultiscopic displays [7].
Figure 6.1 shows an example of an FTV system that transmits multi-view video with depth information. The content may be produced in a number of ways; for example, with multicamera setup, depth cameras or 2D/3D conversion processes. At the receiver, DIBR could be performed to project the signal to various types of displays.

The first focus (phase) of ISO/MPEG standardization for FTV is 3DV [8]. This means video for 3D displays. Such displays focus present N views (e.g., N = 9) simultaneously to the user (Fig. 6.2). For efficiency reasons, only a lower number K of views (K = 1, 2, 3) shall be transmitted. For those K views additional depth data shall be provided. At the receiver side, the N views to be displayed are generated from the K transmitted views with depth by DIBR. This is illustrated in Fig. 6.2.

This application scenario imposes specific constraints such as narrow angle acquisition (<20◦). Also there should be no need (cost reasons) for geometric rectification at the receiver side, meaning if any rectification is needed at all, it should be performed on the input views already at the encoder side.

Example of generating nine outputs views (N = 9) out of three input views with depth (K = 3).

Some multi-view displays are, for example, based on LCD screens with a sheet of transparent lenses in front. This sheet sends different views to each eye, and so a person sees two different views; this gives the person a stereoscopic viewing experience. The stereoscopic capabilities of these multi-view displays are limited by the resolution of the LCD screen (currently 1920 × 1080). For example, for a nine-view system where the cone of nine views is 10◦ (Cone Angle—CA), objects are limited to ±10% (Object Range—OR) of the screen
width to appear in front or behind the screen. Both OR and CA will improve with time (determined by economics) as the number of pixels of the LCD screen goes up.

In addition, other types of stereo displays appear now in the market in large numbers. The ability to generate output views at arbitrary positions at the receiver, is attractive even in the case of N = 2 (i.e., simple stereo display). If, for example, the material has been produced for a large cinema theater, direct usage of that stereo signal (two fixed views) with relatively small home-sized 3D displays will yield a very different stereoscopic viewing experience (e.g., strongly reduced depth effect). With a 3DV signal as illustrated in Fig. 6.3 a new stereo pair can be generated that is optimized for the given 3D display.

With a different initiative, ISO previously looked at auxiliary video data representations. The purpose of ISO/IEC 23002-3 Auxiliary Video Data Representations is to support all those applications where additional data needs to be

Example of lenticular autostereoscopic display requiring nine views (N = 9).

efficiently attached to the individual pixels of a regular video. ISO/IEC 23002-3 describes how this can be achieved in a generic way by making use of existing (and even future) video codecs available within MPEG. A good example of an application that requires additional information associated with the individual pixels of a regular (2D) video stream is stereoscopic video presented on an autostereoscopic single- or multiple-user display. At the MPEG meeting in Nice, France (October 2005), the arrival of such displays on the market had
been stressed, and several of them were even shown and demonstrated. Because different display realizations vary largely in (i) the number of views that are represented; and (ii) the maximum parallax that can be supported, an input format is required that is flexible enough to drive all possible variants. This can be achieved by supplying a depth or parallax values with each pixel of a regular video stream,
and by generating the required stereoscopic views at the receiver side. The standardization of a common depth, in the parallax format within ISO/IEC 23002-3 Auxiliary Video Data Representations will thus enable interoperability between content providers, broadcasters, and display manufacturers. ISO/IEC 23002-3 is flexible enough to easily add other types of auxiliary video data in the future. One example could be the annotation of temperature maps coming from an infrared camera to regular video coming from a regular camera

The Auxiliary Video Data format defined in ISO/IEC 23002-3 consists of an array of N -bit values that are associated with the individual pixels of a regular video stream. These data can be compressed like conventional luminance signals using already existing (and even future) MPEG video codecs. The format allows for optional subsampling of the auxiliary data in both, the spatial and temporal domains. This can be beneficial depending on the particular application and its requirements and allowing for very low bitrates for the auxiliary data. The specification is very flexible in the sense that it defines a new 8-bit code word aux_video_type that specifies the type of the associated data; for example, currently a value of 0 × 10 signals a depth map, a value of 0 × 11 signals a parallax map. New values for additional data representations can be easily added to fulfill future demands.

The transport of auxiliary video data within an MPEG-2 transport or program stream is defined in an amendment to the MPEG-2 systems standard. It specifies new stream_id_extension and stream_type values that are used to signal an auxiliary video data stream. An additional auxiliary_video_data_descriptor is utilized in order to convey in more detail how the data should to be interpreted by the
application that uses them. Metadata associated with the auxiliary data is carried on system level, allowing the use of unmodified video codecs (no need to modify silicon).

In conclusion, ISO/IEC 23002-3 Auxiliary Video Data Representations provides a reasonably efficient approach for attaching additional information such as depth values and parallax values to the individual pixels of a regular video stream and to signal how these associated data should be interpreted by the application that uses them.

More Advanced Methods

Other methods have been discussed in the industry, known generally as 2D in conjunction with metadata (2D + M). The basic concept here is to transmit 2D images and to capture the stereoscopic data from the “other eye” image in the form of an additional package, the metadata; the metadata is transmitted as part of the video stream (Fig. 3.12). This approach is consistent with MPEG multiplexing; therefore, to a degree, it is compatible with embedded systems. The requirement to transmit the metadata increases the bandwidth needed in the channel: the added bandwidth ranges from 60%–80% depending on quality goals and techniques used. As implied, a set-top box employed in a traditional 2D environment would be able to use the 2D content, ignoring the metadata, and properly display the 2D image; in a 3D environment the set-top box would be able to render the 3D signal.

Some variations of this scheme have already appeared. One approach is to capture a delta file that represents the difference between the left and right images.

2D in conjunction with metadata.

A delta file is usually smaller than the raw file because of intrinsic redundancies. The delta file is then transmitted as metadata. Companies such as Panasonic and TDVision use this approach. This approach can also be used for stored media. For example, Panasonic has advanced (and the Blu-ray Disc Association is studying), the use of metadata to achieve a full-resolution 3D Blu-ray Disc standard. A 1920 × 1080p 24 fps resolution per eye is achievable. This standard would make Blu-ray Disc a high-quality 3D content (storage) system. The goal was to agree to the standard by early 2010 and have 3D Blu-ray Disk players emerge by the end-of-year shopping season 2010. Another approach entails transmitting the 2D image in conjunction with a depth map of each scene.

Video Plus Depth (V + D)

As noted above, many 3DTV proposals often rely on the basic concept of “stereoscopic” video, that is, the capture, transmission, and display of two separate video streams (one for the left eye and one for the right eye). More recently, specific proposals have been made for a flexible joint transmission of monoscopic color video and associated per-pixel depth information [24, 25]. The concept of V + D representation is the next notch up in complexity.

From this data representation, one or more “virtual” views of the 3D scene can then be generated in real-time at the receiver side, by means of Depth- Image-Based Rendering (DIBR) techniques [26]. A system such as this provides important features, including backwards compatibility to today’s 2D digital TV; scalability in terms of receiver complexity; and easy adaptability to a wide range of different 2D and 3D displays. DIBR is the process of synthesizing “virtual” views of a scene from still or moving color images and associated per-pixel depth information. Conceptually, this novel view generation can be understood as the following two-step process: at first, the original image points are re-projected into the 3D world, utilizing the respective depth data; thereafter, these 3D space points are projected into the image plane of a “virtual” camera that is located at the required viewing position. The concatenation of re-projection (2D to 3D) and subsequent projection (3D to 2D) is usually called 3D image warping in the Computer Graphics (CG) literature and will be derived mathematically in the following paragraph. The signal processing and data transmission chain of this kind of 3DTV concept is illustrated in Fig. 3.13; it consists of four different functional building blocks: (i) 3D content creation, (ii) 3D video coding, (iii) transmission, and (iv) “virtual” view generation and 3D display.

As it can be seen in Fig. 3.14, a video signal and a per-pixel depth map is captured and eventually transmitted to the viewer. The per-pixel depth data can be considered a monochromatic luminance signal with a restricted range spanning
the interval [Znear, Zfar] representing, respectively, the minimum and maximum distance of the corresponding 3D point from the camera. The depth range is quantized with 8 bit, with the closest point having the value 255 and the most distant point having the value 0. Effectively, the depth map is specified as a grayscale image; these values can be supplied into the luminance channel of a video signal and the chrominance can be set to a constant value. In summary, this representation uses a regular video stream enriched with so-called depth maps providing a Z -value for each pixel. Note that V + D enjoys backward compatibility because a 2D receiver will display only the V portion of the V + D signal. Studies by

Depth-image-based rendering (DIBR) system.Video plus depth (V + D) representation for 3D video.

Regeneration of stereo video from V + D signals.

the European ATTEST (Advanced Three Dimensional Television System Technologies) project indicate that depth data can be compressed very efficiently and still be of good quality; namely, that it needs only around 20% of the bitrate
that would otherwise be needed to encode the color video (the qualitative results were confirmed by means of subjective testing). This approach can be placed in the category of Depth-Enhanced Stereo (DES).

A stereo pair can be rendered from the V + D information, by 3D warping at the decoder. A general warping algorithm takes a layer and deforms it in many ways: for example, twists it along any axis, or bends a layer around itself or adds
arbitrary dimension with a displacement map. The generation of the stereo pair from a V + D signal at the decoder as illustrated in Fig. 3.15. This reconstruction affords extended functionality compared to CSV because the stereo image can be adjusted and customized after transmission. Note that in principle, more than two views can be generated at the decoder thus enabling support of multi-view displays (and head motion parallax viewing within reason).

V + D enjoys backwards compatibility, compression efficiency, extended functionality, and the ability to use existing coding algorithms. It is only necessary to specify high-level syntax that allows a decoder to interpret two incoming video streams correctly as color and depth. The specifications “ISO/IEC 23002-3 Representation of Auxiliary Video and Supplemental Information” and “ISO/IEC 13818-1:2003 Carriage of Auxiliary Data” enable 3D video-based V + D to be
deployed in a standardized fashion by broadcasters interested in adopting this method.

It should be noted however, that the advantages of V + D over CSV entail increased complexity for both, sender and receiver. At the receiver side, view synthesis has to be performed after decoding to generate the second view of the
stereo pair. At the sender (capture) side, the depth data have to be generated before encoding can take place. This is usually done by depth/disparity estimation from a captured stereo pair; these algorithms are complex and still error
prone. Thus in the near future, V + D might be more suitable for applications with playback functionality, where depth estimation can be performed offline on powerful machines, for example in a production studio or home 3D editing suite,
enabling viewing of downloaded 3D video clips and 3DTV broadcasting [16].

Multi-View Video Plus Depth (MV + D)

There are some advanced 3D video applications that are not properly supported by any existing standards and where work by the ITU-R or ISO/MPEG is needed. Two such applications are given below:

  • wide range multi-view autostereoscopic displays (say, nine or more views);
  • FVV (environment where the user can chose his/her own viewpoint).

These 3D video applications require a 3D video format that allows rendering a continuum and/or large number of output views at the decoder. There really are no available alternatives: MVC discussed above does not support a continuum
and becomes inefficient for a large number of views; and, we noted that V + D could in principle generate more than two views at the decoder but in practice, it supports only a limited continuum around the original view (artifacts increase
significantly with the distance of the virtual viewpoint). In response, MPEG started an activity to develop a new 3D video standard that would support these requirements.

The MV + D concept is illustrated in Fig. 3.16. MV + D involves a number of complex processing steps where (i) depth has to be estimated for the N views at the capture point, and then (ii) N color with N depth video streams have to

Multi-view video plus depth (MV + D) concept.

be encoded and transmitted. At the receiver, the data have to be decoded and the virtual views have to be rendered (reconstructed).

As was implied just above, MV + D can be used to support multi-view autostereoscopic displays in a relatively efficient manner. Consider a display that supports nine views (V1–V9) simultaneously (e.g., with a lenticular display manufactured by Philips; Fig. 3.17). From a specific position a viewer can see

Multi-view autostereoscopic displays based on MV + D.

only a stereo pair of views, depending on the viewer’s position. Transmitting nine display views directly (e.g., by using MVC) would be taxing from a bandwidth perspective; in this illustrative example only three original views (views V1,
V5, and V9) along with corresponding depth maps D1, D5, and D9 are in the decoded stream—the remaining views can be synthesized from these decoded data by using DIBR techniques.

Layered Depth Video (LDV)

LVD is a derivative and also an alternative to MV + D. LDV is believed to be more efficient than MV + D because less information has to be transmitted; however, additional error-prone vision processing tasks are required that operate
on partially unreliable depth data. These efficiency assessments remain to be fully validated as of press time.

LVD uses (i) one-color video with associated depth map and (ii) a background layer with associated depth map; the background layer includes image content that is covered by foreground objects in the main layer. This is illustrated in
Figs 3.18 and 3.19. The occlusion information is constructed by warping two or

Layered depth video (LDV) concept.

Layered depth video (LDV) example.

more neighboring V + D views from the MV + D representation onto a defined center view. The LDV stream or substreams can then be encoded by a suitable LDV coding profile.

Note that LDV can be generated from MV + D by warping the main layer image onto other contributing input images (e.g., an additional left and right view). By subtraction, it is then determined which parts of the other contributing
input images are covered in the main layer image; these are then assigned as residual images and transmitted while the rest is omitted [16].

Figure 3.18 is based on a recent presentation at the 3D Media Workshop, Heinrich Hertz Institut (HHI) Berlin, October 15–16, 2009 [27, 28]. LDV provides a single view with depth and occlusion information. The goal is to achieve automatic acquisition of 3DTV content, especially to obtain depth and occlusion information from video and to extrapolate a new view without error.

Table 3.2, composed from technical details in Ref. [29] provides a summary of the issues associated with the various representation methods.

Summary of Formats

Summary of Formats