Preparing Video

A codec is software used to encode and decode a digital video signal. Engineers try various solutions to maintain video quality while reducing the amount of data, using state-of-the-art compression algorithm design.

A large portion of your work will comprise preparing and testing various configurations.


At the time of this writing, AIR for Android supports codecs for On2 VP6, H.263 (Sorenson Spark), and H.264.

H.264, also called MPEG-4 Part 10 or AVC for Advanced Video Coding, delivers highquality video at lower bit rates than H.263 and On2. It is more complicated to decode, however, and requires native GPU playback or a fast compressor to ensure smooth playback.

H.264 supports the following profiles: Baseline, Extended, Main, and various flavors of High. Test the profiles, as not all of them work with hardware-accelerated media decoding. It appears that only Baseline is using this at the time of this writing.

AAC (Advanced Audio Coding) is the audio codec generally paired with H.264. Nellymoser and Speex are supported, but do not utilize hardware decoding.

MPEG-4 (Moving Picture Experts Group) H.264 is an industry-standard video compression format. It refers to the container format, which can contain several tracks. The file synchronizes and interleaves the data. In addition to video and audio, the container includes metadata that can store information such as subtitles. It is possible to contain more than one video track, but AIR only recognizes one.


You can use Adobe Media Encoder CS5 or a third-party tool such as Sorenson Squeeze or On2 Flix to encode your video.

It is difficult to encode video for every device capacity and display size. Adobe recommends grouping devices into low-end, medium-end, and high-end groups.

If your video is embedded or attached to your application, prepare and provide only one file and use a medium-quality solution to serve all your users. If your video is served over a network, prepare multiple streams.

Gather as much information as possible from the user before selecting the video to play. The criteria are the speed of the network connection and the performance of the device.


Containers are wrappers around video and audio tracks holding metadata. MP4 is a common wrapper for the MPEG-4 format and is widely compatible. F4V is Adobe’s own format, which builds on the open MPEG-4 standard media file format and supports H.264/AAC-based content. FLV, Adobe’s original video container file format, supports codecs such as Sorenson Spark and On2 VP6, and can include an alpha channel and additional metadata such as cue points.

Video decoding is a multithreaded operation. H.264 and AAC are decoded using hardware acceleration on mobile devices to improve frame rate and reduce battery consumption. Rendering is still done in the CPU.

Bit Rate

Bit rate is the number of bits dedicated to the video in one second (measured in kilobits per second or kbps). During the encoding process, the encoder varies the number of bits given in various portions of the video based on how complicated they are, while keeping the average as close to the bit rate you set as possible.

Because the average is calculated on the fly and is not always accurate, it is best to select the two-pass mode even though it takes longer. The first pass analyzes the video and records a statistics log; the second pass encodes the video using the log to stay as close to the desired average bit rate as possible.

Use the network connection speed as a guide for your encoding. The recommendation is to use 80% to 90% of the available bandwidth for video/audio combined, and keep the rest for network fluctuations. Try the following H.264/AAC rates as a starting point:

  • WiFi: 500 to 1,000 kbps, audio up to 160 kbps
  • 3G: 350 to 450 kbps, audio up to 128 kbps
  • 2.5G: 100 kbps, audio up to 32 kbps

Frame Rate

Reduce high frame rates whenever possible. Downsampling by an even factor guarantees a better result. For instance, a film at 30 fps can be downsampled to 15 fps; a film at 24 fps can be downsampled to 12 or 18 fps.

Do not use content encoded at a high frame rate and assume that a lower frame rate in AIR will adjust it. It will not.

If your footage was captured at a frame rate greater than 24 fps and you want to keep the existing frame rate, look at reducing other settings such as the bit rate.

If your video is the only moving content in your application, you can use a frame rate as low as 12 fps because the video plays at its native frame rate regardless of the application’s frame rate. A low frame rate
reduces drain on the battery.


The pixel resolution is simply the width and height of your video. Never use a video that is larger than the intended display size. Prepare the video at the dimension you need.

High resolution has a greater impact on mobile video playback performance than bit rate. A conservative resolution of 480×360 plays very well; 640×480 is still good. A higher resolution will be challenging on most devices and will result in a poor viewing experience on devices that are not using the GPU for decoding or on devices with a 500 MHz CPU. Resolution recommendations are:

  • WiFi or 3G: 480×320
  • 2.5G: 320×240

In fact, you can often encode smaller and scale up without a noticeable decrease in picture quality. The high PPI on most devices will still display a high-quality video.

Decrease your video size by even divisors of 16. MPEG video encoders work by dividing the video frames into blocks of 16 by 16, called macroblocks. If the dimension does not divide into 16 or close to it, the encoder must do extra work and this may impact the overall encoding target. As an alternate solution, resort to multiples of eight, not four. It is an important practice to achieve maximum compression efficiency.

As for all mobile content, get rid of superfluous content. If necessary, crop the video to a smaller dimension or edit its content, such as trimming a long introduction.

For more information on mobile encoding guidelines, read Adobe’s white paper at


Hardware is improving quickly, but each device’s architecture is a little different. If you want to target the high end of the market, you can add such comments when submitting your applications to the Android Market.

In addition to your encoding settings, there are some best practices to obey for optimal video playback. They are all simple to apply:

  • Do not put anything on top of or behind the video, even if it is transparent. This would need to be calculated in the rendering process and would negatively affect video playback.
  • Make sure your video window is placed on a full pixel (no half-pixel boundaries).
  • Do not use bitmap caching on the video or any other objects on the stage. Do not use filters such as drop shadows or pixel benders. Do not skew or rotate the video. Do not use color transformation or objects with alpha.
  • Do not show more than one video at the same time.
  • Stop all other processes unless they are absolutely necessary. If you use a progress bar, only call for progress update using a timer every second, not on the enter frame event.

Additional Details on Video Encoding Standards

Efficient video encoding is required for 3DTV/3DV and for FVT/FVV. 3DTV/3DV support 3D depth impression of the observed scenery, while FVT/FVV additionally allow for an interactive selection of viewpoint and direction within a certain operating range. Hence, a common feature of 3DV and FVV systems is the use of multiple views of the same scene that are
transmitted to the user. Multi-view 3D video can be encoded implicitly in the V + D representation or, as is more often the case, explicitly.

In implicit coding one seeks to use (implicit) shape coding in combination with MPEG-2/MPEG-4. Implicit shape coding could mean that the shape can be easily extracted at the decoder, without explicit shape information present in the bitstream. These types of image compression schemes do not rely on the usual additive decomposition of an input image into a set of predefined spanning functions. These schemes only encode implicit properties of the image and reconstruct
an estimate of the scene at the decoding end. This has particular advantages when one seeks very low bitrate perceptually oriented image compression [32]. The literature on this topic is relatively scanty. Chroma Key might be useful in this context: Chroma Key, or green screen, allows one to put a subject anywhere in a scene or environment using the Chroma Key as the background. One can then import the image into the digital editing software, extract the Chroma Key and replace with another image or video. Chroma Key shape coding for implicit shape coding (for medium quality shape extraction) has been proposed and also demonstrated in the recent past.

On the other hand, there are a number of strategies for explicit coding of multiview video: (i) simulcast coding, (ii) scalable simulcast coding, (iii) multi-view coding, and (iv) Scalable Multi-View Coding (SMVC).

Simulcast coding is the separate encoding (and transmission) of the two video scenes in the CSV format; clearly the bitrate will typically be in the range of double that of 2DTV. V + D is more bandwidth efficient not only in the abstract,
but also in practice. At the practical level, in a V + D environment the quality of the compressed depth map is not a significant factor in the final quality of the rendered stereoscopic 3D video. This follows from the fact that the depth
map is not directly viewed, but is employed to warp the 2D color image to two stereoscopic views. Studies show that the depth map can typically be compressed to 10%–20% of the color information.

V + D (also called 2D plus depth, or 2D + depth, or color plus depth) has been standardized in MPEG as an extension for 3D filed under ISO/IEC FDIS 23002-3:2007(E). In 2007, MPEG specified a container format “ISO/IEC 23002-3 Representation of Auxiliary Video and Supplemental Information” (also known as MPEG-C Part 3) that can be utilized for V + D data. 2D + depth, as specified by ISO/IEC 23002-3 supports the inclusion of depth for generation of an increased number of views. While it has the advantage of being backward compatible with legacy devices and is agnostic of coding formats, it is capable of rendering only a limited depth range since it does not directly handle occlusions [33]. Transport of this data is defined in a separate MPEG systems specification “ISO/IEC 13818-1:2003 Carriage of Auxiliary Data.”

There is also major interest in MV + D. Applicable coding schemes of interest here include the following:

  • Multiple-view video coding (MVC)
  • Scalable Video Coding (SVC)
  • Scalable multi-view video coding (SMVC)

From a test/test-bed implementation perspective, for the first two options, each view can be independently coded using the public-domain H.264 and SVC codecs respectively. Test implementations for MVC and for preliminary implementations of an SMVC codec have been documented recently in the literature.

Multiple-View Video Coding (MVC)

It has been recognized that MVC is a key technology for a wide variety of future applications including FVV/FTV, 3DTV, immersive teleconference and surveillance, and other applications. An MPEG standard, “Multi-View Video Coding
(MVC),” to support MV + D (and also V + D) encoded representation inside the MPEG-2 transport stream has been developed by the JVT of ISO/IEC MPEG and ITU-T VCEG (ISO/IEC JTC1/SC29/WG11 and ITU-T SG16 Q.6). MVC
allows the construction of bitstreams that represent multiple views [34]; MVC supports efficient encoding of video sequences captured simultaneously from multiple cameras using a single video stream. MVC can be used for encoding
stereoscopic (two-view) and multi-view 3DTV, and for FVV/FVT.

MVC (ISO/IEC 14496-10:2008 Amendment 1 and ITU-T Recommendation H.264) is an extension of the AVC standard that provides efficient coding of multi-view video. The encoder receives N temporally synchronized video streams and generates one bitstream. The decoder receives the bitstream, decodes and outputs the N video signals. Multi-view video contains a large amount of inter-view statistical dependencies, since all cameras capture the same scene from different viewpoints. Therefore, combined temporal and inter-view prediction is the key for efficient MVC. Also, pictures of neighboring cameras can be used for efficient prediction [35]. MVC supports the direct coding of multiple views and exploits inter-camera redundancy to reduce the bitrate. Although MVC is more efficient than simulcast, the rate of MVC encoded video is proportional to the number of views.

The MVC group in the JVT has chosen the H.264/MPEG-4 AVC-based multi-view video method as its MVC video reference model, since this method supports better coding efficiency than H.264/AVC simulcast coding. H.264/MPEG-4 AVC was developed jointly by ITU-T and ISO through the JVT in the early 2000s (the ITU-T H.264 standard and the ISO/IEC MPEG-4 AVC, ISO/IEC 14496-10-MPEG-4 Part 10 are jointly maintained to retain identical technical content). H.264 is used with Blu-ray Disc and videos from the iTunes Store. The standardization of H.264/AVC was completed in 2003, but additional extensions have taken place since then; for example, SVC as specified in Annex G of H.264/AVC added in 2007.

Owing to the increased data volume of multi-view video, highly efficient compression is needed. In addition to the redundancy exploited in 2D video for compression, the common idea for MVC is to further exploit the redundancy
between adjacent views. This is because multi-view video is captured by multiple cameras at different positions and significant correlations exist between neighbor views [36]. As hinted elsewhere, there is interest in being able to synthesize novel views from the virtual cameras in multi-view camera configurations; however, the occlusion problem can significantly affect the quality of virtual view rendering [37]. Also, for FVV, the depth map quality is important because it is used to render virtual views that are further apart than with the stereoscopic case: when the views are further apart, the distortion in the depth map has a greater effect on the final rendered quality—this implies that the data rate of the depth map has to be higher than in the CSV case.

Note: Most existing MVC techniques are based on the traditional hybrid DCTbased video coding schemes. These neither fully exploit the redundancy among different views nor provide an easy way of implementation for scalabilities. In
addition, all the existing MVC schemes mentioned above use DCT-based coding. A fundamental problem for DCT-based block coding is that it is not convenient to achieve scalability, which has become a more and more important feature for video coding and communications. As a research topic, wavelet-based image and video coding has been proved to be a good way to achieve both, good coding performance and full scalabilities including spatial, temporal, and Signal-To-Noise Ratio (SNR) scalabilities. In the past, MVC has been included in several video coding standards such as MPEG-2 MVP, and MPEG-4 MAC (Multiple Auxiliary Component). More recently, an H.264-based MVC scheme has been developed that utilizes the multiple reference structure in H.264. Although this method does exploit the correlations
between adjacent views through inter-view prediction, it has some constraints for practical applications compared to a method that uses, say, wavelets [36].

As just noted, MPEG has developed a suite of international standards to support 3D services and devices. In 2009 MPEG initiated a new phase of standardization to be completed by 2011. MPEG’s vision is a new 3DV format that goes beyond the capabilities of existing standards to enable both, advanced stereoscopic display processing and improved support for autostereoscopic N -view displays, while enabling interoperable 3D services. 3DV aims to improve rendering capability of 2D + depth format while reducing bitrate requirements relative to simulcast and MVC. Figure B3.1 illustrates ISO MPEG’s target of 3DV format illustrating limited camera inputs and constrained rate transmission

Target of 3D video format for ongoing MPEG standardization initiatives.

according to a distribution environment. The 3DV data format aims to be capable of rendering a large number of output views for autostereoscopic N -view displays and support advanced stereoscopic processing. Owing to limitations in
the production environment, the 3DV data format is assumed to be based on limited camera inputs; stereo content is most likely, but more views might also be available. In order to support a wide range of autostereoscopic displays, it should be possible for a large number of views to be generated from this data format. Additionally, the rate required for transmitting the 3DV format should be fixed to the distribution constraints; that is, there should not be an increase in the rate simply because the display requires a higher number of views to cover a larger viewing angle. In this way, the transmission rate and the number of output views are decoupled. Advanced stereoscopic processing that requires view generation at the display would also be supported by this format [33].

Compared to the existing coding formats, the 3DV format has several advantages in terms of bit rate and 3D rendering capabilities; this is also illustrated in Fig. B3.2 [33].

  • 2D + depth, as specified by ISO/IEC 23002-3, is only capable of rendering a limited depth range since it does not directly handle occlusions. The 3DV format is expected to enhance the 3D rendering capabilities beyond this format.
  • MVC is more efficient than simulcast but the rate of MVC encoded video is proportional to the number of views. The 3DV format is expected to significantly reduce the bitrate needed to generate the required views at the receiver.

Illustration of 3D rendering capability versus bit rate for different formats.

Scalable Video Coding (SVC)

The concept of the SVC scheme is to enable the encoding of a video stream that contains one (or several) subset bitstream(s) of a lower spatial or temporal resolution (that is, lower quality video signal)—each separately or in
combination—compared to the bitstream it is derived from (e.g., the subset bitstream is typically derived by dropping packets from the larger bitstream), that can itself (themselves) be decoded with a complexity and reconstruction quality
comparable to that achieved by using the existing coders (e.g., H.264/MPEG-4 AVC) with the same quantity of data as in the subset bitstream. A standard for SVC was recently being worked on by the ISO MPEG Group, and was completed in 2008. The SVC project was undertaken under the auspices of the JVT of the ISO/IEC MPEG and the ITU-T VCEG. In January 2005, MPEG and VCEG agreed to develop a standard for SVC, to become as an amendment of the H.264/MPEG-4 AVC standard. It is now an extension, Annex G, of the H.264/MPEG-4 AVC video compression standard.

A subset bitstream may encompass a lower temporal or spatial resolution (or possibly a lower quality video signal, say with a camera of lower quality) as compared to the bitstream it is derived from.

  • Temporal (Frame Rate) Scalability: the motion compensation dependencies are structured so that complete pictures (specifically packets associated with these pictures) can be dropped from the bitstream. (Temporal scalability is already available in H.264/MPEG-4 AVC but SVC provides supplemental information to ameliorate its usage.)
  • Spatial (Picture Size) Scalability: video is coded at multiple spatial resolutions. The data and decoded samples of lower resolutions can be used to predict data or samples of higher resolutions in order to reduce the bitrate to code the higher resolutions.
  • Quality Scalability: video is coded at a single spatial resolution but at different qualities. In this case the data and samples of lower qualities can be utilized to predict data or samples of higher qualities—this is done in order to reduce the bitrate required to code the higher qualities.

Products supporting the standard (e.g., for video conferencing) started to appear in 2008.

Scalable Multi-View Video Coding (SMVC).

Although there are many approaches published on SVC and MVC, there is no current work reported on scalable multi-view video coding (SMVC). SMVC can be used for transport of multi-view video over IP for interactive 3DTV by dynamic adaptive combination of temporal, spatial, and SNR scalability according to network conditions [38].


Table B3.1 based on Ref. [39] indicates how the “better-known” compression algorithms can be applied, and what some of the trade-offs in quality are (this study was done in the context of mobile delivery of 3DTV, but the concepts are similar in general). In this study, four methods for transmission and compression/ coding of stereo video content were analyzed. Subjective ratings show that the mixed resolution approach and the video plus depth approach do not impair
video quality at high bitrates; at low bitrates simulcast transmission is outperformed by the other methods. Objective quality metrics, utilizing the blurred or rendered view from uncompressed data as reference, can be used for optimization of single methods (they cannot be used for comparison of methods since they have a positive or negative bias). Further research of individual methods will include combinations like inter-view prediction for mixed resolution coding and depth representation at reduced resolution.

In conclusion, the V + D format is considered by researchers to be a good candidate to represent stereoscopic video that is suitable for most of the 3D displays currently available; MV + D (and the MVC standard) can be used for holographic displays and for FVV, where the user, as noted, can interactively select his or her viewpoint and where the view is then synthesized from the closest spatially located captured views [40]. However, for the initial deployment one will likely see (in order of likelihood).

  • spatial compression in conjunction with MPEG-4/AVC;
  • H.264/AVC stereo SEI message;
  • MVC, which is an H.264/MPEG-4 AVC extension.

Application of Compression Algorithms


3DV/3DTV Stereoscopic Principles

We start this section with a few additional definitions. Stereo means “having depth, or being three-dimensional” and it describes an environment where two inputs combine to create one unified perception of three-dimensional space. Stereoscopic vision is the process where two eye views combine in the brain to create the visual perception of one 3D image; it is a by-product of good binocular vision. Stereoscopy can be defined as any technique that creates the illusion of

Basic stereoscopic camera configurations: (a) “toed-in” approach, and (b) “parallel” setup.

depth of three-dimensionality in an image. Stereoscopic (literally: “solid looking”) is the term to describe a visual experience having visible depth as well as height and width. The term may refer to any experience or device that is associated with binocular depth perception. Stereoscopic 3D refers to two photographs taken from slightly different angles that appear three-dimensional when viewed together. Autostereoscopic describes 3D displays that do not require glasses to see the stereoscopic image. Stereogram is a general term for any arrangement of left-eye and right-eye views that produces a three-dimensional result that may consist of (i) a side-by-side or over-and-under pair of images; (ii) superimposed images projected onto a screen; (iii) a color-coded composite (anaglyph); (iv) lenticular images; or (v) alternate projected left-eye and right-eye images that fuse by means of the persistence of vision [10]. Stereoplexing (stereoscopic multiplexing) is a mechanism to incorporate information for the left and right perspective views into a single information channel without expansion of the bandwidth.

On the basis of the principles discussed above, a number of techniques for re-creating depth for the viewer of photographic or video content have been developed. Considerable amount of research has taken place during the past 30 or more years on 3D graphics and imaging; most of the research has focused on photographic techniques, computer graphics, 3D movies, and holography (the field of imaging, including 3D imaging relates more to the static or quasi-static
capture/representation—encoding, compression/transmission/display/storage of content, for example, photographs, medical images, CAD/CAM drawings, and so on, especially for high-resolution applications—this topic is not covered here).

Fundamentally, the technique known as “stereoscopy” has been advanced, where two pictures or scenes are shot, one for each eye, and each eye is presented with its proper picture or scene, in one fashion or another (Fig. 2.6). Stereoscopic 3D video is based on the binocular nature of human perception; to generate quality 3D content, the creator needs to control the depth and parallax of the scene, among other parameters. Depth perception is the ability to see in 3D to allow the viewer to judge the relative distances of objects; depth range is a term that applies to stereoscopic images created with cameras. As noted above, parallax is the apparent change in the position of an object when viewed from different points; namely, the visual differences in a scene when

Stereoscopic capture of scene to achieve 3D when scene is seen with appropriate display system. In this figure the separation between the two images is exaggerated for pedagogical reasons (in actual stereo photos the differences are very minute).

Generation of horizontal parallax for stereoscopic displays.

viewed from different points. A 3D display (screen) needs to generate some sort of parallax, which, in turn, creates a stereoscopic sense (Fig. 2.7). Nearby objects have a larger parallax than more distant objects when observed from
different positions; because of this feature, parallax can be used to determine distances. Because the eyes of a person are in different positions on the head, they present different views simultaneously. This is the basis of stereopsis, the process by which the brain exploits the parallax due to the different views from the eye to gain depth perception and estimate distances to objects. 3D depth perception can be supported by 3D display systems that allow the viewer to receive a specific and different view for each eye; such a stereo pair of views must correspond to the human eye positions, thus enabling the brain to compute the 3D depth perception. In recent years, the main means of stereoscopic display has moved over the years from anaglyph to polarization and shutter glasses.

Some basic terms and concepts related to camera management for stereoscopic filming are as follows: interaxial distance is the distance between the left- and right-eye lenses in a stereoscopic camera. Camera convergence is the term used to denote the process of adjusting the ZPS in a stereoscopic camera. ZPS defines the point(s) in 3D space that have zero parallax in the plano-stereoscopic image created; for example, with a stereoscopic camera. These points will be stereoscopically reproduced on the surface of the display screen.

Two simultaneous conventional 2D video streams are produced by a pair of cameras mimicking the two human eyes that see the environment from two slightly different angles. Simple planar 3D films are made by recording separate images for the left eye and the right eye from two cameras that are spaced a certain distance apart. The spacing chosen affects the disparity between the lefteye and the right-eye pictures, and thereby the viewer’s sense of depth. While this technique achieves depth perception, it often results in eye fatigue after watching such a programming for a certain amount of time: within minutes after the onset of viewing, stereoscopy frequently causes eye fatigue and, in some, feelings similar to those experienced during motion sickness [11]. Nevertheless, the technique is widely used for (stereoscopic) photography and moviemaking, and it has been tested many times for television [12].

At the display level, one of these streams is shown to the left eye, and the other one to the right eye. Common means of separating the right-eye and left-eye views include glasses with colored transparencies, polarization filters, and shutter
glasses. Polarization of light is the arrangement of beams of light into separate planes or vectors by means of polarizing filters; when two vectors are crossed at right angles, vision or light rays are obscured. In the filter-based approach, complementary filters are placed jointly over two overlapping projectors (when projectors are used—refer back to Table 1.3) and over the two corresponding eyes (i.e., anaglyph, linear or circular polarization, or the narrow-pass filtering of Infitec) [13]. Although the technology is relatively simple, the necessity of wearing glasses while viewing has often been considered a major obstacle to the wide acceptance of 3DTV. Also, there are some limitations to the approach, such as the need to retain a head orientation that works properly with the polarized light (e.g., do not bend the head 45 degrees side to side), and the need to be within a certain viewing angle. There are a number of other mechanisms to deliver binocular stereo, including barrier filters over LCDs (vertical bars act as a fence, channeling data in specific directions for the eyes).

It should be noted as we wrap up this brief overview of the HVS that individuals vary along a continuum in their ability to process stereoscopic depth information. Studies have shown that a relatively large percentage of the population experience stereodeficiencies in depth discrimination/perception if the display duration is very short, and that a certain percentage of the adult population (about 6%) has persistent deficiencies. Figure 2.8 depicts the results of a study that quantifies these observations [14]. These results indicate that certain fast-cut methods in scenes may not work for all in 3D. Object motion can also create visual problem in stereoscopic 3DTV. Figure 2.9 depicts visual discomfort that has been observed in studies [14]. At the practical level, in the context of cinematography, while new digital 3D technology has made the experience more comfortable for many, for some people with eye problems, a prolonged 3D session may result in an aching head according to ophthalmologists. Some people have very minor eye problems (e.g., a minor muscle imbalance), which the brain deals with naturally under normal circumstances; but in a 3D movie, these people are confronted with an entirely new sensory experience that translates into greater mental effort, making it easier to get a headache. Some people who do not have normal depth perception cannot see in 3D at all. People with eye muscle problems, in which the eyes are not pointed at the same object, have trouble processing 3D images.

Stereo deficiencies in some populations [14].

Visual discomfort caused by motion in a scene [14].

Headaches and nausea are cited as the main reasons 3D technology never took off. However, newer digital technology addresses many of the problems that typically caused 3D moviegoers discomfort. Some of the problems were related to
the fact that the projectors were not properly aligned; systems that use a single digital projector help overcome some of the old problems [15]. However, deeper-rooted issues about stereoscopic display may continue to affect a number of viewers (these problems will be solved by future autostereoscopic systems).

The two video views required for 3DTV can be compressed using standard video compression techniques. MPEG-2 encoding is widely used in digital TV applications today and H.264/MPEG-4 AVC is expected to be the leading video technology standard for digital video in the near future. Extensions have been developed recently to H.264/MPEG-4 AVC and other related standards to support 3DTV; other standardization work is underway. The compression gains and
quality of 3DTV will vary depending on the video coding standard used. While inter-view prediction will likely improve the compression efficiency compared to simulcasting (transmitting the two views end-to-end, and so requiring a doubling of the channel bandwidth), new approaches, such as, but not limited to, asymmetric view coding, video-plus-depth, and layered video, are necessary to reduce bandwidth requirements for 3DTV [16].

There are a number of ways to create 3D content, including: (i) Computer- Generated Imagery (CGI); (ii) stereocameras; and (iii) 2D to 3D conversions. CGI techniques are currently the most technically advanced, with welldeveloped methodologies (and tools) to create movies, games, and other graphical applications—the majority of cinematic 3D content is comprised of animated movies created with CGI. Camera-based 3D is more challenging. A 2-camera approach is the typical approach here, at this time; another approach is to use a 2D camera in conjunction with a depth-mapping system. With the 2-camera approach, the two cameras are assembled with same spatial separation to mimic how the eye may perceive a scene. The technical issues relate to focus/focal length, specifically keeping in mind that these have to be matched precisely to avoid differences in vertical and horizontal alignment and/or rotational differences (lens calibration and motion control must be added to the camera lenses). 2D to 3D conversion techniques include the following:

  • object segmentation and horizontal shifting;
  • depth mapping (bandwidth-efficient multiple images and viewpoints);
  • creation of depth maps using information from 2D source images;
  • making use of human visual perception for 2D to 3D conversion;
  • creation of surrogate depth map (e.g., gray-level intensities of a color component).

Conversion of 2D material is the least desirable but perhaps it is the approach that could generate the largest amount of content in the short term. Some note that it is “easy to create 3D content, but it is hard to create good 3D content” [17].

A practical problem relates to “insertion”. At least early on, 2D content will be inserted into a 3D channel, much the way standard-definition commercials still show up in HD content. A set-top could be programmed to automatically detect
an incoming format and handle various frame packing arrangement to support 2D/3D switching for advertisements [18].

In summary, and as we transition the discussion to autostereoscopic approaches (and in preparation for that discussion), we list below the highlights of the various approaches, as provided in Ref. [19] (refer back to Table 1.1 for definition of terms).

Stereoscopy is the Simplest and Oldest Technique:

  • does not create physical duplicates of 3D light;
  • quality of resultant 3D effect is inferior;
  • lacks parallax;
  • focus and convergence mismatch;
  • mis-alignment is seen;
  • “motion sickness” type of a feeling (eye fatigue) is produced;
  • is the main reason for commercial failure of 3D techniques.

Multi-view video provides some horizontal parallax:

  • still limited to a small angle (∼20–45 degrees);
  • jumping effect observed;
  • viewing discomfort similar to stereoscopy;
  • requires high-resolution display device;
  • leakage of neighboring images occurs.

Integral Imaging adds vertical parallax:

  • gets closer to an ideal light-field renderer as the number of lenses (elemental images) increase: true 3D;
  • alignment is a problem;
  • requires very high resolution devices;
  • leakage of neighboring images occurs.

Holography is superior in terms of replicating physical light distribution:

  • recording holograms is difficult;
  • very high resolution recordings are needed;
  • display techniques are quite different;
  • network transmission is anticipated to be extremely taxing.