Packetized Elementary Stream (PES) Packets and Transport Stream (TS) Unit(s)


International Standard ISO/IEC 13818-1 was prepared3 by JTC ISO/IEC JTC 1, Information technology, Subcommittee SC 29, “Coding of audio, picture, multimedia and hypermedia information,” in collaboration with ITU-T. The identical text is published as ITU-T Rec. H.222.0. ISO/IEC 13818 consists of the following parts, under the general title “Information  technology—Generic coding of moving pictures and associated audio information”:

  • Part 1: Systems
  • Part 2: Video
  • Part 3: Audio
  • Part 4: Conformance testing
  • Part 5: Software simulation
  • Part 6: Extensions for DSM-CC
  • Part 7: Advanced Audio Coding (AAC)
  • Part 9: Extension for real time interface for systems decoders
  • Part 10: Conformance extensions for Digital Storage Media Command and Control (DSM-CC)

The MPEG-2 and/or -4 standard defines three layers: systems, video, and audio [24–26]. The systems layer supports synchronization and interleaving of multiple compressed streams, buffer initialization and management, and time identification. For video and audio, the information is organized into access units, each representing a fundamental unit of encoding; for example, in video, an access unit will usually be a complete encoded video frame. The audio and the video layers define the syntax and semantics of the corresponding  Elementary Streams (ESs). An ES is the output of an MPEG encoder and typically contains compressed digital video, compressed digital audio, digital data, and digital control data. The information corresponds to an access unit (a fundamental unit of encoding), such as a video frame. The compression is achieved using the DCT. Each ES is in turn an input to an MPEG-2 processor that accumulates the data into a stream of PES packets. A PES typically contains an integral number of ESs. Figure A4.1 shows both the multiplex structure and the Protocol Data Unit (PDU) format. A PES packet may be a fixed- or variable-sized block, with up to 65,536 octets per block and includes a 6-byte protocol header.

Combining of Packetized Elementary Streams (PES) into a TS.

PES and TS multiplexing.

As seen in the figure, and more directly in Fig. A4.2, PESs are then mapped to Transport Stream (TS) unit(s). Each MPEG-2 TS packet carries 184 octets of payload data prefixed by a 4-octet (32 bit) header (the resulting 188-byte packet size was originally chosen for compatibility with Asynchronous Transfer Mode (ATM systems). These packets are the basic unit of data in a TS. They consist of a sync byte (0 × 47), followed by flags and a 13-bit Packet Identifier (PID5). This is followed by other (some optional) transport fields; the rest of the packet consists of the payload. Figure A4.3 connects the PES and TS concepts together.

A sequence of PESs leads to a sequence of uniform TS packets.

The PID is a 13-bit field that is used to uniquely identify the stream to which the packet belongs (e.g., PES packets corresponding to an ES) generated by the multiplexer. Each MPEG-2 TS channel is uniquely identified by the PID value carried in the header of fixed length MPEG-2 TS packets. The PID allows the receiver to identify the stream to which each received packet belongs; effectively, it allows the receiver to accept or reject PES packets at a high level without burdening the receiver with extensive processing. Often one sends only one PES (or a part of a single PES) in a TS packet (in some cases, however, a given PES packet may span several TS packets so that the majority of TS packets contain continuation data in their payloads). Each PID contains specific video, audio or data information. Programs are groups of one or more PID streams that are related to each other. For example, a TS used in IPTV could contain five programs, to represent five video channels. Assume that each channel consists of one video stream, one or two audio streams, and metadata. A receiver wishing to
tune to a particular “channel” has to decode the payload of the PIDs associated with its program. It can discard the contents of all other PIDs. The number of TS logical channels is limited to 8192, some of which are reserved; unreserved TS logical channels may be use to carry audio, video, IP datagrams, or other data. Examples of systems using MPEG-2 include the DVB and Advanced Television Systems Committee (ATSC) Standards for Digital Television.

Note 1: Ultimately an IPTV stream consists of packets of fixed size. MPEG (specifically MPEG-4) packets are aggregated into an IP packet then and the IP packet is transmitted using IP Multicast methods. MPEG TS are then typically encapsulated in the UDP and then in IP. In turn, and (only) for interworking with existing MPEG-2 systems already deployed (e.g., satellite systems and associated ground equipment supporting DTH), this IP packet needs further encapsulation, as discussed later. Note that traditional MPEG-2 approaches make use of the PID to identify content, whereas in IPTV applications, the IP Multicast address is used to identify the content; also, the latest IPTV systems make use of MPEG-4-coded PESs.

Note 2: The MPEG-2 standard defines two ways for multiplexing different elementary stream types: (i) Program Stream (PS) and (ii) Transport Stream (TS).

  • An MPEG-2 PS is principally intended for storage and retrieval from storage media. It supports grouping of video, audio, and data ESs that have a common time base. Each PS consists of only one content (TV) program. The PS is used in error-free environments; for example, DVDs use the MPEG-2 PS. A PS is a group of tightly coupled PES packets referenced to the same time base.
  • An MPEG-2 TS combines multiple PESs (that may or may not have common time base) into a single stream and multiplexes these PESs into one stream, along with information for synchronizing between them. At the same time the TS segments the PES into the smaller fixed-size TS packets. An entire video frame may be mapped in one PES packet. PES headers distinguish PES packets of various streams and also contain time stamp information. PESs are generated by the packetization process; the payload consists of the data bytes taken sequentially from the original ES. A TS may correspond to a single TV program; this type of TS is normally called a Single Program Transport Stream (SPTS). In most cases, one or more SPTS streams are combined to form a Multiple Program Transport Stream (MPTS). This larger aggregate also contains the control information (Program Specific Information or PSI) [27].