Blu-ray Disc Association (BDA)

The BDA announced the finalization and release of the Blu-ray 3D specification at the end of 2009. The specification for 3D-enhanced Blu-ray video is titled “Blu-ray 3D.” The specification, embodying the work of the leading Hollywood studios and consumer electronic and computer manufacturers, will enable the home entertainment industry to bring the stereoscopic 3D experience into consumers’ living rooms, on BDs, but will require consumers to acquire new players, HDTVs, and shutter glasses. The specification allows every Blu-ray 3D player and movie to deliver full HD 1080p resolution (1920 × 1080, progressive scan) to each eye, thereby maintaining the industry’s leading image quality, which further distances Blu-ray from high-definition options provided by Internet-based services. The release of a final specification based on H.264 should allow professional video editing tools such as Avid, Final Cut Studio, and Premiere author 3DV in a routine fashion. Note: Although announced at the end of 2009, the specification will actually be finalized in 2010.

The Blu-ray 3D specification is display-agnostic, meaning that Blu-ray 3D products will deliver the 3D image to any compatible 3D display, regardless of whether that display uses LCD, OLED, plasma, or other technology, and regardless of what 3D technology the display uses to deliver the image to the viewer’s eyes. The compulsory aspect for stereoscopic 3D is that those screens should support 120 Hz or higher refresh rate. The specification supports playback of 2D discs in forthcoming 3D players and can enable 2D playback of Blu-ray 3D discs on the large installed base of BD players currently in homes around the world. The Blu-ray 3D specification will encode 3DV using the MVC codec, an extension to the ITU-T H.264 AVC codec currently supported by all BD players. MPEG4-MVC compresses both left- and right-eye views with a typical 50% overhead compared to equivalent 2D content, according to BDA; and can provide full 1080p-resolution backward compatibility with current 2D BD players. The specification also incorporates enhanced graphic features for 3D. These features provide a new experience for users, enabling navigation using 3D graphic menus and displaying 3D subtitles positioned in 3DV.

By press time, observers were expecting to see demos of 3DTV sets using content from stereo–3D enabled Blu-ray players utilizing prototype implementations of the Blu-ray 3D. However, most of the players and many of the TVs will not be available until sometime later when new chips for the specifications are available.

Society of Motion Picture and Television Engineers (SMPTE) 3D Home Entertainment Task Force

There is a need for a single mastering standard for viewing stereo 3D content on TVs, PCs, and mobile phones, where the content could originate from optical disks, broadcast networks, or the Internet. To that end, SMPTE formed a 3D Home Entertainment Task Force in 2008 to work the issue and a standards effort was launched in 2009 via an SMPTE 3D Standards Working Group to define a content format for stereo 3D. The SMPTE 3D Standards Working Group had about 200 participants at press time; the Home Master standard was expected to become available in mid-2010. The group is in favor of a mastering standard for the Home Master specification based on 1920 × 1080 pixel resolution at 60 fps/eye. The specification is expected to support an option for falling back to a 2D image. The standard is also expected to support hybrid products, such as BDs that can support either 2D or stereo 3D displays.

SMPTE’s 3D Home Master defines high-level image formatting requirements that impact 3DTV designs, but the larger bulk of the 3DTV standards for hardware are expected to come from other organizations, such as CEA. Studios or game publishers would deliver the master as source material for uses ranging from DVD and BD players to terrestrial and satellite broadcasts and Internet downloadable or streaming files

As we have seen throughout this text, 3DTV systems must support multiple delivery channels, multiple coding techniques, and multiple display technologies. Digital cinema, for example, is addressed with a relatively simple left–right sequence approach; residential TV displays involve a greater variety of technologies necessitating more complex encoding. Content transmission and delivery is also supported by a variety of physical media such as BDs as well as broadcasting, satellite, and cable delivery. The SMPTE 3D Group has been considering what kind of compression should be supported. One of the key goals of the standardization process is defining and/or identifying schemes that minimize the total bandwidth required to support the service; the MVC extension to MPEG- 4/H.264 discussed earlier is being considered by the group. Preliminary studies have shown, however, that relatively little bandwidth may be saved when compared to simulcast because high-quality images require 75–100% overhead and images of medium quality require 65–98% overhead. In addition to defining the representation and encoding standards (which clearly drive the amount of channel bandwidth for the additional image stream), 3DTV service entails other requirements; for example, there is the issue of graphics overlay, captions and subtitles, and metadata. 3D programming guides have to be rethought, according to industry observers; the goal is to avoid floating the guide in front of the action and instead, to push the guide behind the screen and let the action play over it because practical research shows that people found it jarring when the programming guide is brought to the forefront of 3DV images [13]. The SMPTE Group is also looking at format wrappers, such as Material eXchange Format (MXF; a container format for professional digital video and audio media defined by a set of SMPTE standards), whether an electrical interface should be specified, and if depth representation is needed for an early version of the 3DTV service, among other factors [14]. As we have noted earlier in the text, 3DTV has the added consideration of physiological effects because disjoint stereoscopic images can adversely impact the viewer.


MPEG Industry Forum (MPEGIF)

Moving Pictures Expert Group Industry Forum (MPEGIF) is an advocacy group for standards-based DTV technologies. The group is an independent and platformneutral not-for-profit organization representing more than 20 international companies and organizations with the goal to facilitate and further the widespread adoption and deployment of MPEG and related standards in next-generation digital media services. MPEGIF is among the consortiums focused on standardizing technology and methods for delivering 3DV/3DTV.

MPEGIF announced in December 2009 the formation of the 3DTV Working Group and launch of the “3D over MPEG” campaign. The new working group and campaign continue MPEGIF’s work in furthering the widespread adoption and deployment of MPEG-related standards including MPEG-4 AVC/H.264. The chair of the newly formed 3DTV Working Group stated that “3DTV is of keen interest to everyone in the video creation and delivery industries. The challenge we all face is that of sorting through the myriad technical options. Our common goal is to create a 3DTV ecosystem that delivers great new experiences to consumers. The 3DTV Working Group and the ‘3D over MPEG’ campaign are designed to provide focus and clear information to decision makers. 3DTV can be distributed today using MPEG-related standards. Existing broadband and broadcast services and infrastructures are 3D-ready, and ongoing works by standards bodies provide a compelling path for the future evolution of 3DTV . . . 3D video is showing distinct commercial promise in theatrical releases and could thus transition to the advanced living room to follow High-Definition and Surround Sound. As a result there is a growing array of competing technologies and work from various standards bodies. It has therefore become a major theme of the next MPEG Industry Forum Master Class being held at CES 2010 in Las Vegas in January 2010.” About 30 industry participants joined the 3D Working Group at the launch.

The 3DTV Working Group aims at providing a forum for free exchange of information related to this emerging technology, an industry voice advocating the adoption of standards and for consolidating the overall direction of the 3DTV industry. Its focus and constituency will be derived from video service providers, consumer electronics manufacturers, content owners, equipment manufacturers, system integrators, software providers, as well as industry advocacy groups, industry analysts, financial institutions, and academic institutes.

MPEG-4 MVC is being given consideration. As we have seen, MPEG-4 MVC can be used, among other more sophisticated applications, to handle simple transmission of independent, left-eye/right-eye views, which is considered to be the viable early commercial approach, at least in the United States. An arrangement called by some “frame-packing arrangement and SEI message” enables the encoder to signal the decoder how to extract two distinct views from a single decoded frame; this could be in the form of side-by-side, or over–under images.

Moving Picture Experts Group (MPEG)


MPEG is a working group of ISO/IEC in charge of the development of standards for coded representation of digital audio and video and related data. Established in 1988, the group produces standards that help the industry offer end users an evermore enjoyable digital media experience. In its 21 years of activity, MPEG has developed a substantive portfolio of technologies that have created an industry worth several hundred billion USD. MPEG is currently interested in 3DV in general and 3DTV in particular. Any broad success of 3DTV/3DV will likely depend on the development and industrial acceptance of MPEG standards; MPEG is the premiere organization worldwide for video encoding and the list of standards that have been produced in recent years is as follows:

MPEG-1 The standard on which such products as video CD and MP3 are based

MPEG-2 The standard on which such products as digital television set-top boxes and DVDs are based

MPEG-4 The standard for multimedia for the fixed and mobile web

MPEG-7 The standard for description and search of audio and visual content

MPEG-21 The multimedia framework

MPEG-A The standard providing application-specific formats by integrating multiple MPEG technologies

MPEG-B A collection of systems-specific standards

MPEG-C A collection of video-specific standards

MPEG-D A collection of audio-specific standards

MPEG-E A standard (M3W) providing support to download and execute multimedia applications

MPEG-M A standard (MXM) for packaging and reusability of MPEG technologies

MPEG-U A standard for rich media user interface

MPEG-V A standard for interchange with virtual worlds

provides a more detailed listing of activities of MPEG groups in the area of video.

Completed Work

As we have seen in other parts of this text, currently there are a number of different 3DV formats (either already available and/or under investigation), typically related to specific types of displays (e.g., classical two-view stereo video, multiview video with more than two views, V+D, MV+D, and layered depth video). Efficient compression is crucial for 3DV applications and a plethora of compression and coding algorithms are either already available and/or under investigation for the different 3DV formats (some of these are standardized e.g., by MPEG, others are proprietary). A generic, flexible, and efficient 3DV format that can serve a range of different 3DV systems (including mobile phones) is currently being investigated by MPEG.

As we noted earlier in this text, MPEG standards now already support 3DV based on V+D. In 2007 MPEG specified a container format “ISO/IEC 23002-3 Representation of Auxiliary Video and Supplemental Information” (also know  as MPEG-C Part 3) that can be utilized for V+D data. Transport of this data is defined in a separate MPEG systems specification “ISO/IEC 13818-1:2003 Carriage of Auxiliary Data”

In 2008 ISO approved a new 3DV project in 2008 under ISO/IEC JTC1/SC29/WG11 (ISO/IEC JTC1/SC29/WG11, MPEG2008/N9784). The
JVT of ITU-T and MPEG has devoted its recent efforts to extend the widely deployed H.264/AVC standard for MVC to support MV+D (and also V+D). MVC allows the construction of bitstreams that represent multiple views. The MPEG standard that emerged, MVC, provides good robustness and compression performance for delivering 3DV by taking into account of the inter-view dependencies of the different visual channels. In addition, its backwards-compatibility with H.264/AVC codecs makes it widely interoperable in environments having both 2D and 3D capable devices. MVC supports an MV+D (and also V+D) encoded representation inside the MPEG-2 transport stream. The MVC standard was developed by the JVT of ISO/IEC MPEG

Activities of MPEG Groups in the Area of Video

Activities of MPEG Groups in the Area of Video

Activities of MPEG Groups in the Area of VideoActivities of MPEG Groups in the Area of Video

Activities of MPEG Groups in the Area of Video

Activities of MPEG Groups in the Area of Video

Activities of MPEG Groups in the Area of Video

Activities of MPEG Groups in the Area of VideoActivities of MPEG Groups in the Area of Video

and ITU-T Video Coding Experts Group (VCEG; ISO/IEC JTC1/SC29/WG11 and ITU-T SG16 Q.6). MVC was originally an addition to H.264/MPEG-4 AVC video compression standard that enables efficient encoding of sequences captured simultaneously from multiple cameras using a single video stream.

At press time, MVC was the most efficient way for stereo and multi-view video coding; for two views, the performance achieved by H.264/AVC Stereo SEI message and MVC are similar. MVC is also expected to become a new MPEG video coding standard for the realization of future video applications such as 3DTV and FTV. The MVC group in the JVT has chosen the H.264/AVC-based MVC method as the MVC reference model, since this method showed better coding efficiency than H.264/AVC simulcast coding and the other methods that were submitted in response to the call for proposals made by the MPEG.

New Initiatives

ISO MPEG has already developed a suite of international standards to support 3D services and devices, and in 2009 initiated a new phase of standardization to be completed by 2011

  • One objective is to enable stereo devices to cope with varying display types and sizes, and different viewing preferences. This includes the ability to vary the baseline distance for stereo video to adjust the depth perception that could help to avoid fatigue and other viewing discomforts.
  • MPEG also envisions that high-quality autostereoscopic displays will enter the consumer market in the next few years. Since it is difficult to directly provide all the necessary views due to production and transmission constraints, a new format is needed to enable the generation of many high-quality views from a limited amount of input data such as stereo and depth.

ISO’s vision is now a new 3DV format that goes beyond the capabilities of existing standards to enable both advanced stereoscopic display processing and improved support for autostereoscopic N -view displays, while enabling interoperable 3D services. The new 3DV standard aims to improve rendering capability of 2D+Depth format while reducing bitrate requirements relative to existing standards, as noted earlier in this Section 6.

3DV supports new types of audiovisual systems that allow users to view videos of the real 3D space from different user viewpoints. In an advanced application of 3DV, denoted as FTV, a user can set the viewpoint to an almost arbitrary location and direction that can be static, change abruptly, or vary continuously, within the limits that are given by the available camera setup. Similarly, the audio listening point is changed accordingly. The first phase of 3DV development is expected to support advanced 3D displays, where M dense views must be generated from a sparse set of K transmitted views (typically K ≤ 3) with associated depth data. The allowable range of view synthesis will be relatively narrow (20◦ view angle from leftmost to rightmost view).

Example of an FTV system and data format.

The MPEG initiative notes that 3DV is a standard that targets serving a variety of 3D displays. It is the first phase of FTV, that is a new framework that includes a coded representation for multi-view video and depth information to support the generation of high-quality intermediate views at the receiver. This enables free viewpoint functionality and view generation for automultiscopic displays [7].
Figure 6.1 shows an example of an FTV system that transmits multi-view video with depth information. The content may be produced in a number of ways; for example, with multicamera setup, depth cameras or 2D/3D conversion processes. At the receiver, DIBR could be performed to project the signal to various types of displays.

The first focus (phase) of ISO/MPEG standardization for FTV is 3DV [8]. This means video for 3D displays. Such displays focus present N views (e.g., N = 9) simultaneously to the user (Fig. 6.2). For efficiency reasons, only a lower number K of views (K = 1, 2, 3) shall be transmitted. For those K views additional depth data shall be provided. At the receiver side, the N views to be displayed are generated from the K transmitted views with depth by DIBR. This is illustrated in Fig. 6.2.

This application scenario imposes specific constraints such as narrow angle acquisition (<20◦). Also there should be no need (cost reasons) for geometric rectification at the receiver side, meaning if any rectification is needed at all, it should be performed on the input views already at the encoder side.

Example of generating nine outputs views (N = 9) out of three input views with depth (K = 3).

Some multi-view displays are, for example, based on LCD screens with a sheet of transparent lenses in front. This sheet sends different views to each eye, and so a person sees two different views; this gives the person a stereoscopic viewing experience. The stereoscopic capabilities of these multi-view displays are limited by the resolution of the LCD screen (currently 1920 × 1080). For example, for a nine-view system where the cone of nine views is 10◦ (Cone Angle—CA), objects are limited to ±10% (Object Range—OR) of the screen
width to appear in front or behind the screen. Both OR and CA will improve with time (determined by economics) as the number of pixels of the LCD screen goes up.

In addition, other types of stereo displays appear now in the market in large numbers. The ability to generate output views at arbitrary positions at the receiver, is attractive even in the case of N = 2 (i.e., simple stereo display). If, for example, the material has been produced for a large cinema theater, direct usage of that stereo signal (two fixed views) with relatively small home-sized 3D displays will yield a very different stereoscopic viewing experience (e.g., strongly reduced depth effect). With a 3DV signal as illustrated in Fig. 6.3 a new stereo pair can be generated that is optimized for the given 3D display.

With a different initiative, ISO previously looked at auxiliary video data representations. The purpose of ISO/IEC 23002-3 Auxiliary Video Data Representations is to support all those applications where additional data needs to be

Example of lenticular autostereoscopic display requiring nine views (N = 9).

efficiently attached to the individual pixels of a regular video. ISO/IEC 23002-3 describes how this can be achieved in a generic way by making use of existing (and even future) video codecs available within MPEG. A good example of an application that requires additional information associated with the individual pixels of a regular (2D) video stream is stereoscopic video presented on an autostereoscopic single- or multiple-user display. At the MPEG meeting in Nice, France (October 2005), the arrival of such displays on the market had
been stressed, and several of them were even shown and demonstrated. Because different display realizations vary largely in (i) the number of views that are represented; and (ii) the maximum parallax that can be supported, an input format is required that is flexible enough to drive all possible variants. This can be achieved by supplying a depth or parallax values with each pixel of a regular video stream,
and by generating the required stereoscopic views at the receiver side. The standardization of a common depth, in the parallax format within ISO/IEC 23002-3 Auxiliary Video Data Representations will thus enable interoperability between content providers, broadcasters, and display manufacturers. ISO/IEC 23002-3 is flexible enough to easily add other types of auxiliary video data in the future. One example could be the annotation of temperature maps coming from an infrared camera to regular video coming from a regular camera

The Auxiliary Video Data format defined in ISO/IEC 23002-3 consists of an array of N -bit values that are associated with the individual pixels of a regular video stream. These data can be compressed like conventional luminance signals using already existing (and even future) MPEG video codecs. The format allows for optional subsampling of the auxiliary data in both, the spatial and temporal domains. This can be beneficial depending on the particular application and its requirements and allowing for very low bitrates for the auxiliary data. The specification is very flexible in the sense that it defines a new 8-bit code word aux_video_type that specifies the type of the associated data; for example, currently a value of 0 × 10 signals a depth map, a value of 0 × 11 signals a parallax map. New values for additional data representations can be easily added to fulfill future demands.

The transport of auxiliary video data within an MPEG-2 transport or program stream is defined in an amendment to the MPEG-2 systems standard. It specifies new stream_id_extension and stream_type values that are used to signal an auxiliary video data stream. An additional auxiliary_video_data_descriptor is utilized in order to convey in more detail how the data should to be interpreted by the
application that uses them. Metadata associated with the auxiliary data is carried on system level, allowing the use of unmodified video codecs (no need to modify silicon).

In conclusion, ISO/IEC 23002-3 Auxiliary Video Data Representations provides a reasonably efficient approach for attaching additional information such as depth values and parallax values to the individual pixels of a regular video stream and to signal how these associated data should be interpreted by the application that uses them.

Additional Details on Video Encoding Standards

Efficient video encoding is required for 3DTV/3DV and for FVT/FVV. 3DTV/3DV support 3D depth impression of the observed scenery, while FVT/FVV additionally allow for an interactive selection of viewpoint and direction within a certain operating range. Hence, a common feature of 3DV and FVV systems is the use of multiple views of the same scene that are
transmitted to the user. Multi-view 3D video can be encoded implicitly in the V + D representation or, as is more often the case, explicitly.

In implicit coding one seeks to use (implicit) shape coding in combination with MPEG-2/MPEG-4. Implicit shape coding could mean that the shape can be easily extracted at the decoder, without explicit shape information present in the bitstream. These types of image compression schemes do not rely on the usual additive decomposition of an input image into a set of predefined spanning functions. These schemes only encode implicit properties of the image and reconstruct
an estimate of the scene at the decoding end. This has particular advantages when one seeks very low bitrate perceptually oriented image compression [32]. The literature on this topic is relatively scanty. Chroma Key might be useful in this context: Chroma Key, or green screen, allows one to put a subject anywhere in a scene or environment using the Chroma Key as the background. One can then import the image into the digital editing software, extract the Chroma Key and replace with another image or video. Chroma Key shape coding for implicit shape coding (for medium quality shape extraction) has been proposed and also demonstrated in the recent past.

On the other hand, there are a number of strategies for explicit coding of multiview video: (i) simulcast coding, (ii) scalable simulcast coding, (iii) multi-view coding, and (iv) Scalable Multi-View Coding (SMVC).

Simulcast coding is the separate encoding (and transmission) of the two video scenes in the CSV format; clearly the bitrate will typically be in the range of double that of 2DTV. V + D is more bandwidth efficient not only in the abstract,
but also in practice. At the practical level, in a V + D environment the quality of the compressed depth map is not a significant factor in the final quality of the rendered stereoscopic 3D video. This follows from the fact that the depth
map is not directly viewed, but is employed to warp the 2D color image to two stereoscopic views. Studies show that the depth map can typically be compressed to 10%–20% of the color information.

V + D (also called 2D plus depth, or 2D + depth, or color plus depth) has been standardized in MPEG as an extension for 3D filed under ISO/IEC FDIS 23002-3:2007(E). In 2007, MPEG specified a container format “ISO/IEC 23002-3 Representation of Auxiliary Video and Supplemental Information” (also known as MPEG-C Part 3) that can be utilized for V + D data. 2D + depth, as specified by ISO/IEC 23002-3 supports the inclusion of depth for generation of an increased number of views. While it has the advantage of being backward compatible with legacy devices and is agnostic of coding formats, it is capable of rendering only a limited depth range since it does not directly handle occlusions [33]. Transport of this data is defined in a separate MPEG systems specification “ISO/IEC 13818-1:2003 Carriage of Auxiliary Data.”

There is also major interest in MV + D. Applicable coding schemes of interest here include the following:

  • Multiple-view video coding (MVC)
  • Scalable Video Coding (SVC)
  • Scalable multi-view video coding (SMVC)

From a test/test-bed implementation perspective, for the first two options, each view can be independently coded using the public-domain H.264 and SVC codecs respectively. Test implementations for MVC and for preliminary implementations of an SMVC codec have been documented recently in the literature.

Multiple-View Video Coding (MVC)

It has been recognized that MVC is a key technology for a wide variety of future applications including FVV/FTV, 3DTV, immersive teleconference and surveillance, and other applications. An MPEG standard, “Multi-View Video Coding
(MVC),” to support MV + D (and also V + D) encoded representation inside the MPEG-2 transport stream has been developed by the JVT of ISO/IEC MPEG and ITU-T VCEG (ISO/IEC JTC1/SC29/WG11 and ITU-T SG16 Q.6). MVC
allows the construction of bitstreams that represent multiple views [34]; MVC supports efficient encoding of video sequences captured simultaneously from multiple cameras using a single video stream. MVC can be used for encoding
stereoscopic (two-view) and multi-view 3DTV, and for FVV/FVT.

MVC (ISO/IEC 14496-10:2008 Amendment 1 and ITU-T Recommendation H.264) is an extension of the AVC standard that provides efficient coding of multi-view video. The encoder receives N temporally synchronized video streams and generates one bitstream. The decoder receives the bitstream, decodes and outputs the N video signals. Multi-view video contains a large amount of inter-view statistical dependencies, since all cameras capture the same scene from different viewpoints. Therefore, combined temporal and inter-view prediction is the key for efficient MVC. Also, pictures of neighboring cameras can be used for efficient prediction [35]. MVC supports the direct coding of multiple views and exploits inter-camera redundancy to reduce the bitrate. Although MVC is more efficient than simulcast, the rate of MVC encoded video is proportional to the number of views.

The MVC group in the JVT has chosen the H.264/MPEG-4 AVC-based multi-view video method as its MVC video reference model, since this method supports better coding efficiency than H.264/AVC simulcast coding. H.264/MPEG-4 AVC was developed jointly by ITU-T and ISO through the JVT in the early 2000s (the ITU-T H.264 standard and the ISO/IEC MPEG-4 AVC, ISO/IEC 14496-10-MPEG-4 Part 10 are jointly maintained to retain identical technical content). H.264 is used with Blu-ray Disc and videos from the iTunes Store. The standardization of H.264/AVC was completed in 2003, but additional extensions have taken place since then; for example, SVC as specified in Annex G of H.264/AVC added in 2007.

Owing to the increased data volume of multi-view video, highly efficient compression is needed. In addition to the redundancy exploited in 2D video for compression, the common idea for MVC is to further exploit the redundancy
between adjacent views. This is because multi-view video is captured by multiple cameras at different positions and significant correlations exist between neighbor views [36]. As hinted elsewhere, there is interest in being able to synthesize novel views from the virtual cameras in multi-view camera configurations; however, the occlusion problem can significantly affect the quality of virtual view rendering [37]. Also, for FVV, the depth map quality is important because it is used to render virtual views that are further apart than with the stereoscopic case: when the views are further apart, the distortion in the depth map has a greater effect on the final rendered quality—this implies that the data rate of the depth map has to be higher than in the CSV case.

Note: Most existing MVC techniques are based on the traditional hybrid DCTbased video coding schemes. These neither fully exploit the redundancy among different views nor provide an easy way of implementation for scalabilities. In
addition, all the existing MVC schemes mentioned above use DCT-based coding. A fundamental problem for DCT-based block coding is that it is not convenient to achieve scalability, which has become a more and more important feature for video coding and communications. As a research topic, wavelet-based image and video coding has been proved to be a good way to achieve both, good coding performance and full scalabilities including spatial, temporal, and Signal-To-Noise Ratio (SNR) scalabilities. In the past, MVC has been included in several video coding standards such as MPEG-2 MVP, and MPEG-4 MAC (Multiple Auxiliary Component). More recently, an H.264-based MVC scheme has been developed that utilizes the multiple reference structure in H.264. Although this method does exploit the correlations
between adjacent views through inter-view prediction, it has some constraints for practical applications compared to a method that uses, say, wavelets [36].

As just noted, MPEG has developed a suite of international standards to support 3D services and devices. In 2009 MPEG initiated a new phase of standardization to be completed by 2011. MPEG’s vision is a new 3DV format that goes beyond the capabilities of existing standards to enable both, advanced stereoscopic display processing and improved support for autostereoscopic N -view displays, while enabling interoperable 3D services. 3DV aims to improve rendering capability of 2D + depth format while reducing bitrate requirements relative to simulcast and MVC. Figure B3.1 illustrates ISO MPEG’s target of 3DV format illustrating limited camera inputs and constrained rate transmission

Target of 3D video format for ongoing MPEG standardization initiatives.

according to a distribution environment. The 3DV data format aims to be capable of rendering a large number of output views for autostereoscopic N -view displays and support advanced stereoscopic processing. Owing to limitations in
the production environment, the 3DV data format is assumed to be based on limited camera inputs; stereo content is most likely, but more views might also be available. In order to support a wide range of autostereoscopic displays, it should be possible for a large number of views to be generated from this data format. Additionally, the rate required for transmitting the 3DV format should be fixed to the distribution constraints; that is, there should not be an increase in the rate simply because the display requires a higher number of views to cover a larger viewing angle. In this way, the transmission rate and the number of output views are decoupled. Advanced stereoscopic processing that requires view generation at the display would also be supported by this format [33].

Compared to the existing coding formats, the 3DV format has several advantages in terms of bit rate and 3D rendering capabilities; this is also illustrated in Fig. B3.2 [33].

  • 2D + depth, as specified by ISO/IEC 23002-3, is only capable of rendering a limited depth range since it does not directly handle occlusions. The 3DV format is expected to enhance the 3D rendering capabilities beyond this format.
  • MVC is more efficient than simulcast but the rate of MVC encoded video is proportional to the number of views. The 3DV format is expected to significantly reduce the bitrate needed to generate the required views at the receiver.

Illustration of 3D rendering capability versus bit rate for different formats.

Scalable Video Coding (SVC)

The concept of the SVC scheme is to enable the encoding of a video stream that contains one (or several) subset bitstream(s) of a lower spatial or temporal resolution (that is, lower quality video signal)—each separately or in
combination—compared to the bitstream it is derived from (e.g., the subset bitstream is typically derived by dropping packets from the larger bitstream), that can itself (themselves) be decoded with a complexity and reconstruction quality
comparable to that achieved by using the existing coders (e.g., H.264/MPEG-4 AVC) with the same quantity of data as in the subset bitstream. A standard for SVC was recently being worked on by the ISO MPEG Group, and was completed in 2008. The SVC project was undertaken under the auspices of the JVT of the ISO/IEC MPEG and the ITU-T VCEG. In January 2005, MPEG and VCEG agreed to develop a standard for SVC, to become as an amendment of the H.264/MPEG-4 AVC standard. It is now an extension, Annex G, of the H.264/MPEG-4 AVC video compression standard.

A subset bitstream may encompass a lower temporal or spatial resolution (or possibly a lower quality video signal, say with a camera of lower quality) as compared to the bitstream it is derived from.

  • Temporal (Frame Rate) Scalability: the motion compensation dependencies are structured so that complete pictures (specifically packets associated with these pictures) can be dropped from the bitstream. (Temporal scalability is already available in H.264/MPEG-4 AVC but SVC provides supplemental information to ameliorate its usage.)
  • Spatial (Picture Size) Scalability: video is coded at multiple spatial resolutions. The data and decoded samples of lower resolutions can be used to predict data or samples of higher resolutions in order to reduce the bitrate to code the higher resolutions.
  • Quality Scalability: video is coded at a single spatial resolution but at different qualities. In this case the data and samples of lower qualities can be utilized to predict data or samples of higher qualities—this is done in order to reduce the bitrate required to code the higher qualities.

Products supporting the standard (e.g., for video conferencing) started to appear in 2008.

Scalable Multi-View Video Coding (SMVC).

Although there are many approaches published on SVC and MVC, there is no current work reported on scalable multi-view video coding (SMVC). SMVC can be used for transport of multi-view video over IP for interactive 3DTV by dynamic adaptive combination of temporal, spatial, and SNR scalability according to network conditions [38].


Table B3.1 based on Ref. [39] indicates how the “better-known” compression algorithms can be applied, and what some of the trade-offs in quality are (this study was done in the context of mobile delivery of 3DTV, but the concepts are similar in general). In this study, four methods for transmission and compression/ coding of stereo video content were analyzed. Subjective ratings show that the mixed resolution approach and the video plus depth approach do not impair
video quality at high bitrates; at low bitrates simulcast transmission is outperformed by the other methods. Objective quality metrics, utilizing the blurred or rendered view from uncompressed data as reference, can be used for optimization of single methods (they cannot be used for comparison of methods since they have a positive or negative bias). Further research of individual methods will include combinations like inter-view prediction for mixed resolution coding and depth representation at reduced resolution.

In conclusion, the V + D format is considered by researchers to be a good candidate to represent stereoscopic video that is suitable for most of the 3D displays currently available; MV + D (and the MVC standard) can be used for holographic displays and for FVV, where the user, as noted, can interactively select his or her viewpoint and where the view is then synthesized from the closest spatially located captured views [40]. However, for the initial deployment one will likely see (in order of likelihood).

  • spatial compression in conjunction with MPEG-4/AVC;
  • H.264/AVC stereo SEI message;
  • MVC, which is an H.264/MPEG-4 AVC extension.

Application of Compression Algorithms


Short-term Approach for Signal Representation and Compression

In summary, stereoscopic 3D will be used in the short term. Broadcasters appear to be rallying around top/bottom spatial compression; however, trials are still ongoing. Other approaches involve some form of compression including checkerboard (quincunx filters), side-by-side or interleaved rows or columns [30]. Spatial compression can operate on the same channel capacity as an existing TV channel but with a compromise in resolution. Stereoscopic 3D is the de facto standard from 3D cinema; note that this approach is directly usable for glasses-based displays, but it does not allow for scaling of depth. It is also not usable for non-glasses-based displays [29]. (Preferably, a 3D representation format must be
generic for all display types—stereoscopic displays and multi-view displays—the long-term approaches we listed above will support that goal.)

For compression, one of the following four may find use in the short term: (i) ITU-T Rec. H.262/ISO/IEC 13818-2 MPEG-2 Video (MVP); or (ii) H.264/AVC with SEI; or (iii) H264/AVC can be used for each view independently; or (iv) the MVC extension of H.264/AVC (Amendment 4).

More Advanced Methods

Other methods have been discussed in the industry, known generally as 2D in conjunction with metadata (2D + M). The basic concept here is to transmit 2D images and to capture the stereoscopic data from the “other eye” image in the form of an additional package, the metadata; the metadata is transmitted as part of the video stream (Fig. 3.12). This approach is consistent with MPEG multiplexing; therefore, to a degree, it is compatible with embedded systems. The requirement to transmit the metadata increases the bandwidth needed in the channel: the added bandwidth ranges from 60%–80% depending on quality goals and techniques used. As implied, a set-top box employed in a traditional 2D environment would be able to use the 2D content, ignoring the metadata, and properly display the 2D image; in a 3D environment the set-top box would be able to render the 3D signal.

Some variations of this scheme have already appeared. One approach is to capture a delta file that represents the difference between the left and right images.

2D in conjunction with metadata.

A delta file is usually smaller than the raw file because of intrinsic redundancies. The delta file is then transmitted as metadata. Companies such as Panasonic and TDVision use this approach. This approach can also be used for stored media. For example, Panasonic has advanced (and the Blu-ray Disc Association is studying), the use of metadata to achieve a full-resolution 3D Blu-ray Disc standard. A 1920 × 1080p 24 fps resolution per eye is achievable. This standard would make Blu-ray Disc a high-quality 3D content (storage) system. The goal was to agree to the standard by early 2010 and have 3D Blu-ray Disk players emerge by the end-of-year shopping season 2010. Another approach entails transmitting the 2D image in conjunction with a depth map of each scene.

Video Plus Depth (V + D)

As noted above, many 3DTV proposals often rely on the basic concept of “stereoscopic” video, that is, the capture, transmission, and display of two separate video streams (one for the left eye and one for the right eye). More recently, specific proposals have been made for a flexible joint transmission of monoscopic color video and associated per-pixel depth information [24, 25]. The concept of V + D representation is the next notch up in complexity.

From this data representation, one or more “virtual” views of the 3D scene can then be generated in real-time at the receiver side, by means of Depth- Image-Based Rendering (DIBR) techniques [26]. A system such as this provides important features, including backwards compatibility to today’s 2D digital TV; scalability in terms of receiver complexity; and easy adaptability to a wide range of different 2D and 3D displays. DIBR is the process of synthesizing “virtual” views of a scene from still or moving color images and associated per-pixel depth information. Conceptually, this novel view generation can be understood as the following two-step process: at first, the original image points are re-projected into the 3D world, utilizing the respective depth data; thereafter, these 3D space points are projected into the image plane of a “virtual” camera that is located at the required viewing position. The concatenation of re-projection (2D to 3D) and subsequent projection (3D to 2D) is usually called 3D image warping in the Computer Graphics (CG) literature and will be derived mathematically in the following paragraph. The signal processing and data transmission chain of this kind of 3DTV concept is illustrated in Fig. 3.13; it consists of four different functional building blocks: (i) 3D content creation, (ii) 3D video coding, (iii) transmission, and (iv) “virtual” view generation and 3D display.

As it can be seen in Fig. 3.14, a video signal and a per-pixel depth map is captured and eventually transmitted to the viewer. The per-pixel depth data can be considered a monochromatic luminance signal with a restricted range spanning
the interval [Znear, Zfar] representing, respectively, the minimum and maximum distance of the corresponding 3D point from the camera. The depth range is quantized with 8 bit, with the closest point having the value 255 and the most distant point having the value 0. Effectively, the depth map is specified as a grayscale image; these values can be supplied into the luminance channel of a video signal and the chrominance can be set to a constant value. In summary, this representation uses a regular video stream enriched with so-called depth maps providing a Z -value for each pixel. Note that V + D enjoys backward compatibility because a 2D receiver will display only the V portion of the V + D signal. Studies by

Depth-image-based rendering (DIBR) system.Video plus depth (V + D) representation for 3D video.

Regeneration of stereo video from V + D signals.

the European ATTEST (Advanced Three Dimensional Television System Technologies) project indicate that depth data can be compressed very efficiently and still be of good quality; namely, that it needs only around 20% of the bitrate
that would otherwise be needed to encode the color video (the qualitative results were confirmed by means of subjective testing). This approach can be placed in the category of Depth-Enhanced Stereo (DES).

A stereo pair can be rendered from the V + D information, by 3D warping at the decoder. A general warping algorithm takes a layer and deforms it in many ways: for example, twists it along any axis, or bends a layer around itself or adds
arbitrary dimension with a displacement map. The generation of the stereo pair from a V + D signal at the decoder as illustrated in Fig. 3.15. This reconstruction affords extended functionality compared to CSV because the stereo image can be adjusted and customized after transmission. Note that in principle, more than two views can be generated at the decoder thus enabling support of multi-view displays (and head motion parallax viewing within reason).

V + D enjoys backwards compatibility, compression efficiency, extended functionality, and the ability to use existing coding algorithms. It is only necessary to specify high-level syntax that allows a decoder to interpret two incoming video streams correctly as color and depth. The specifications “ISO/IEC 23002-3 Representation of Auxiliary Video and Supplemental Information” and “ISO/IEC 13818-1:2003 Carriage of Auxiliary Data” enable 3D video-based V + D to be
deployed in a standardized fashion by broadcasters interested in adopting this method.

It should be noted however, that the advantages of V + D over CSV entail increased complexity for both, sender and receiver. At the receiver side, view synthesis has to be performed after decoding to generate the second view of the
stereo pair. At the sender (capture) side, the depth data have to be generated before encoding can take place. This is usually done by depth/disparity estimation from a captured stereo pair; these algorithms are complex and still error
prone. Thus in the near future, V + D might be more suitable for applications with playback functionality, where depth estimation can be performed offline on powerful machines, for example in a production studio or home 3D editing suite,
enabling viewing of downloaded 3D video clips and 3DTV broadcasting [16].

Multi-View Video Plus Depth (MV + D)

There are some advanced 3D video applications that are not properly supported by any existing standards and where work by the ITU-R or ISO/MPEG is needed. Two such applications are given below:

  • wide range multi-view autostereoscopic displays (say, nine or more views);
  • FVV (environment where the user can chose his/her own viewpoint).

These 3D video applications require a 3D video format that allows rendering a continuum and/or large number of output views at the decoder. There really are no available alternatives: MVC discussed above does not support a continuum
and becomes inefficient for a large number of views; and, we noted that V + D could in principle generate more than two views at the decoder but in practice, it supports only a limited continuum around the original view (artifacts increase
significantly with the distance of the virtual viewpoint). In response, MPEG started an activity to develop a new 3D video standard that would support these requirements.

The MV + D concept is illustrated in Fig. 3.16. MV + D involves a number of complex processing steps where (i) depth has to be estimated for the N views at the capture point, and then (ii) N color with N depth video streams have to

Multi-view video plus depth (MV + D) concept.

be encoded and transmitted. At the receiver, the data have to be decoded and the virtual views have to be rendered (reconstructed).

As was implied just above, MV + D can be used to support multi-view autostereoscopic displays in a relatively efficient manner. Consider a display that supports nine views (V1–V9) simultaneously (e.g., with a lenticular display manufactured by Philips; Fig. 3.17). From a specific position a viewer can see

Multi-view autostereoscopic displays based on MV + D.

only a stereo pair of views, depending on the viewer’s position. Transmitting nine display views directly (e.g., by using MVC) would be taxing from a bandwidth perspective; in this illustrative example only three original views (views V1,
V5, and V9) along with corresponding depth maps D1, D5, and D9 are in the decoded stream—the remaining views can be synthesized from these decoded data by using DIBR techniques.

Layered Depth Video (LDV)

LVD is a derivative and also an alternative to MV + D. LDV is believed to be more efficient than MV + D because less information has to be transmitted; however, additional error-prone vision processing tasks are required that operate
on partially unreliable depth data. These efficiency assessments remain to be fully validated as of press time.

LVD uses (i) one-color video with associated depth map and (ii) a background layer with associated depth map; the background layer includes image content that is covered by foreground objects in the main layer. This is illustrated in
Figs 3.18 and 3.19. The occlusion information is constructed by warping two or

Layered depth video (LDV) concept.

Layered depth video (LDV) example.

more neighboring V + D views from the MV + D representation onto a defined center view. The LDV stream or substreams can then be encoded by a suitable LDV coding profile.

Note that LDV can be generated from MV + D by warping the main layer image onto other contributing input images (e.g., an additional left and right view). By subtraction, it is then determined which parts of the other contributing
input images are covered in the main layer image; these are then assigned as residual images and transmitted while the rest is omitted [16].

Figure 3.18 is based on a recent presentation at the 3D Media Workshop, Heinrich Hertz Institut (HHI) Berlin, October 15–16, 2009 [27, 28]. LDV provides a single view with depth and occlusion information. The goal is to achieve automatic acquisition of 3DTV content, especially to obtain depth and occlusion information from video and to extrapolate a new view without error.

Table 3.2, composed from technical details in Ref. [29] provides a summary of the issues associated with the various representation methods.

Summary of Formats

Summary of Formats

3D Mastering Methods

For the purpose of this discussion we define a mastering method as the mechanism used for representing a 3D scene in the video stream that will be compressed, stored, and/or transmitted. Mastering standards are typically used in this process.

As alluded to earlier, a 3D mastering standard called “3D Master” is being defined by SMPTE. The high-resolution 3D master file is one that is used to generate other files appropriate for various channels; for example, theater release, media (DVD, Blu-ray Disc) release, and broadcast (e.g., satellite, terrestrial broadcast, cable TV, IPTV, and/or Internet distribution). The 3D Master is comprised of two uncompressed files (left- and right-eye files), each of which has the same file size as a 2D video stream. Formatting and encoding procedures have been developed to be used in conjunction with already-established techniques, to deliver 3D programming to the home over a number of distribution channels.

In addition to normal video encoding, 3D mastering/transmission requires additional encoding/compression, particularly when attempting to use legacy delivery channels. Additional encoding schemes for CSV include the following [6]: (i) spatial compression and (ii) temporal multiplexing.

Frame Mastering for Conventional Stereo Video (CSV)

CSV is the most well-developed and the simplest 3D video representation. This approach deals only with (color) pixels of the video frames captured by the two cameras. The video signals are intended to be directly displayed using a 3D display system. Figure 3.5 shows an example of a stereo image pair: the same scene is visible from slightly different viewpoints. The 3D display system ensures that a viewer sees only the left view with the left eye and the right view with the right eye to create a 3D depth impression. Compared to the other 3D video formats, the algorithms associated with CSV are the least complex.

A straightforward way to utilize existing video codecs (and infrastructure) for stereo video transmission is to apply one of the interleaving approaches illustrated in Fig. 3.6. A practical challenge is that there is no de facto industry standard
available (so that any downstream decoder knows what kind of interleaving was used by the encoder). However, there is an industry movement toward using an over/under approach (also called top/bottom spatial compression).

A stereo image pair. (Note: Difference in left-eye/right-eye views is greatly exaggerated in this and pictures that follow for pedagogical purposes.)

Stereo interleaving formats: (a) time multiplexed frames; (b) spatial multiplexed as side-by-side; and (c) spatial multiplexed as over/under.

Spatial Compression. When an operator seeks to deliver 3D content over a standard video distribution infrastructure, spatial compression is a common solution. Spatial compression allows the operator to deliver a stereo 3D signal (now called frame-compatible) over a 2D HD video signal making use of the same amount of channel bandwidth. Clearly, this entails a loss of resolution (for both the left and the right eye). The approach is to pack two images into a single frame of video; the receiving device (e.g., set-top box) will, in turn, display the content in such a manner that a 3D effect is perceived (these images cannot be viewed in a standard 2D TV monitor). There are a number of ways of combining two frames; the two most common are the side-by-side combination and the over/under combination. As can be seen there, the two images are reformatted at the compression/mastering point to fit into that standard frame. The combined frame is then compressed by standard methods and delivered to a 3D-compatible TV, where it is reformatted/rendered for 3D viewing.

The question is how to take two frames, a left frame and a right frame, and reformat them to fit side-by-side or over/under in a single standard HD frame. Sampling is involved, but as noted, with some loss of resolution (50% to be
exact). One approach is to take alternative columns of pixels from each image and then pack the remaining columns in the side-by-side format. Another approach is to take alternative rows of pixels from each image and then pack the remaining rows in the above/under format (Fig. 3.7).

Studies have shown that the eye is less sensitive to loss of resolution along a diagonal direction in an image than in the horizontal or vertical direction. This allows the development of encoders that optimize subjective quality by sampling
each image in a diagonal direction. Other encoding schemes are also being developed to attempt to retain as much of the perceived/real resolution as possible. One approach that has been studied for 3D is quincunx filtering. A quincunx is a geometric pattern comprised of five coplanar points, four of them forming a square (or rectangle) and a point fifth at its center, like a checkerboard. Quincunx filter banks are 2D two-channel nonseparable filter banks that have been shown to be an effective tool for image coding applications. In such applications, it is desirable for the filter banks to have perfect reconstruction, linear phase, high coding gain, good frequency selectivity, and certain vanishing-moment properties
[7–12]. Almost all hardware devices for digital image acquisition and output use square pixel grids. For this reason and for the ease of computations, all current image compression algorithms (with the exception of mosaic image compression for single-sensor cameras) operate on square pixel grids. It turns out that the optimal sampling scheme in the two-dimensional image space is claimed to be the hexagonal lattice; unfortunately, a hexagonal lattice is not straightforward in terms of hardware and software implementations. A compromise, therefore, is to use the quincunx lattice; this is a sublattice of the square lattice, as illustrated in Fig. 3.7. The quincunx lattice has a diamond tessellation that is closer to optimal hexagon tessellation than square lattice, and it can be easily generated by down-sampling conventional digital images without any hardware change. Because of this, quincunx lattice is widely adopted by single-sensor digital cameras to sample the green channel; also, quincunx partition of an image

Selection of pixels in (a) side-by-side, (b) over/under, and (c) quincunx approaches. (Note: Either black or white dots can comprise the lattice.)

was recently studied as a means of multiple-description coding [13]. When using quincunx filtering, the higher-quality sampled images are encoded and packaged in a standard video frame (either with the side-by-side or over/under arrangement). The encoded and reformatted images are compressed and distributed to the home using traditional means (cable, satellite, terrestrial broadcast, and so on).

Temporal Multiplexing. Temporal (time) multiplexing doubles the frame rate to 120 Hz to allow the sequential repetitive presentation of the left eye and right eye images in the normal 60-Hz time frame. This approach retains full resolution for each eye, but requires a doubling of the bandwidth and storage capacity. In some cases spatial compression is combined with time multiplexing; however, this is more typical of an in-home format and not a transmit/broadcast format. For example, Mitsubishi’s 3D DLP TV uses quincunx sampled (spatially compressed) images that are clocked at 120 Hz as input.

Compression for Conventional Stereo Video (CSV)

Typically, the algorithms to compress act to separately encode and decode the multiple video signals, as shown in Fig. 3.8a. This is also called simulcast. The drawback is the fact that the amount of data is increased compared to 2D video; however, reduction of resolution can be used as needed, to mitigate this requirement. Table 3.1 summarizes the available methods.

It turns out that the MPEG-2 standard includes an MPEG-2 Multi-View Profile (MVP) Coding that allows efficiency to be increased by combining temporal/inter-view prediction as illustrated in Fig. 3.6b.H.264/AVC was enhanced a few years ago with a stereo Supplemental Enhancement Information (SEI) message that can also be used to implement a prediction as illustrated in Fig. 3.8b. Although not designed for stereo-view video coding, the H.264 coding tools can be arranged to take advantage of the correlations between the pair of views of a stereo-view video, and provide very reliable and efficient compression performance as well as stereo/mono-view scalability [14].

For more than two views, the approach can be extended to Multi-view Video Coding (MVC) as illustrated in Fig. 3.9 [15]; MVC uses inter-view prediction by referring to the pictures obtained from the neighboring views. MVC has been standardized in the Joint Video Team (JVT) of the ITU-T Video Coding Experts Group (VCEG) and ISO/IEC MPEG. MVC enables efficient encoding of sequences captured simultaneously from multiple cameras using a single video stream. MVC is currently the most efficient way for stereo and MVC; for two views, the performance achieved by H.264/AVC stereo SEI message and MVC are similar [16]. MVC is also expected to become a new MPEG video coding standard for the realization of future video applications such as 3D Video (3DV) and Free Viewpoint Video (FVV). The MVC group in the JVT has chosen the

Stereo video coding with combined temporal/inter-view prediction. (a) Traditional MPEG-2/MPEG-4 applied to 3DTV; (b) MPEG-2 multi-view profile and H.264/AVC SEI message.

Compression MethodsCompression Methods

H.264/AVC-based MVC method as the MVC reference model, since this method showed better coding efficiency than H.264/AVC simulcast coding and the other methods that were submitted in response to the call for proposals made by the MPEG [15, 17–20].

Some new approaches are also emerging and have been proposed to improve efficiency, especially for bandwidth-limited environments. A new approach uses binocular suppression theory that employs disparate image quality in left- and right-eye views. Viewer tests have shown that (within reason), if one of the images of a stereo pair is degraded, the perceived overall quality of the stereo video will be dominated by the higher-quality image [16, 21, 22]. This concept
is illustrated in Fig. 3.10. Applying this concept, one could code the right-eye image with less than the full resolution of the left eye; for example, downsampling it to half or quarter resolution (Fig. 3.11). Some call this asymmetrical

Multi-view video coding with combined temporal/inter-view prediction.Use of binocular suppression theory for more efficient coding.

quality. Studies have shown that asymmetrical coding with cross-switching at scene cuts (namely alternating the eye that gets the more blurry image) is a viable method for bandwidth savings [23]. In principle this should provide comparable
overall subjective stereo video quality, while reducing the bitrate: if one were to adopt this approach, the 3D video functionality could be added by an overhead of say 25%–30% to the 2D video for coding the right view at quarter resolution.

Installing the Mobile Client Application

This section describes how Tailspin arranges for users to install the mobile client application on their Windows Phone 7 devices. Users can only install applications on their devices from the Windows Marketplace, so Tailspin must first make sure that the application is available there.

Overview of the Solution

To make it easy for users to find, download, and install the mobile client application, Tailspin wanted to provide a link to the mobile client installer from the public Tailspin website with which users may already be familiar. Tailspin provides a Windows Phone 7-friendly page at the same address as the public Tailspin website. Accessing the site with Microsoft® Internet Explorer® from a desktop device shows a list of available surveys; alternatively, accessing the site with Internet
Explorer from a Windows Phone 7 device shows a link to the installer for the mobile client application.

The developers at Tailspin used a Model-View-Controller (MVC) view engine to display a different page based on the type of device making the request.

For more information about MVC, see “ASP.NET MVC 2” on MSDN® (

Inside the Implementation

Now is a good time to walk through the code that implements the Windows Phone 7 web page in more detail. As you go through this section, you may want to download the Microsoft® Visual Studio® development system solution for the Tailspin Surveys application from CodePlex (

To render different pages at the same address based on the type of device, the Tailspin web application uses the WebForm ViewEngine class in the MVC namespace. The application creates a new view engine of type MobileCapableWebFormViewEngine in the Global.asax.cs file. The following code example shows the MobileCapableWebFormViewEngine class in the TailSpin.Web. Survey.Public project.


namespace TailSpin.Web.Survey.Public.Extensions
using System;
using System.Web.Mvc;
public class MobileCapableWebFormViewEngine : WebFormViewEngine
public override ViewEngineResult FindView(
ControllerContext controllerContext, string viewName,
string masterName, bool useCache)
ViewEngineResult result = null
if (this.UserAgentIs(controllerContext, “IEMobile/7”))
result = new ViewEngineResult(new WebFormView(
“~/Views/MobileIE7/Index.aspx”, string.Empty), this);
if (result == null || result.View == null)
result = base.FindView(controllerContext, viewName,
masterName, useCache);
return result;
public bool UserAgentIs(ControllerContext controllerContext,
string userAgentToTest)
return controllerContext.HttpContext.Request
StringComparison.OrdinalIgnoreCase) > 0;

The FindView method checks the user agent to determine the browser type, and then it returns an appropriate ViewEngineResult instance.

For more information about creating websites for mobile devices, see the post, “Mix: Mobile Web Sites with ASP.NET MVC and the Mobile Browser Definition File,” on the blog, Scott Hanselman’s (