|Log In | Not a Member?||Contact ADC|
QuickTime Generic RTP Payload Format
This dispatch is a complete description of the payload format that QuickTime uses to stream media data when a custom payload profile is undefined. The dispatch is presented in the form of an IETF RFC, although it is not, in fact, an IETF document. The format presented here is used by QuickTime 4.RTP Payload Format for QuickTime Media Streams Abstract
This document specifies the payload format for encapsulating QuickTime media streams
in the Realtime Transport Protocol (RTP). This specification is intended for QuickTime
media/codec types that are not already handled by other RTP payload specifications. Each
QuickTime media track within a movie is sent over a separate RTP session and
synchronized using standard RTP techniques. A dynamic payload type should be used. A
QuickTime header within the RTP payload is defined to carry the media type and other
media specific information. A packetization scheme is defined for the media data. This
specification is intended for streaming stored QuickTime movies as well as live
This document specifies the payload format for encapsulating QuickTime media streams in the Realtime Transport Protocol (RTP) . RTP is a generic protocol designed to carry realtime media data along with synchronization information over a datagram protocol (mostly UDP over IP). The protocol itself does not address the encapsulation of specific media types, but instead leaves it to various profile specifications. An accompanying RTP profile document  contains various payload specifications to carry audio and video over RTP for conferencing applications and specifies the static payload types for various audio/video compression schemes. Other documents specify the encapsulation format used to carry specific compression schemes such as JPEG, MPEG and H.261 [3,4,5].
The QuickTime file format and architecture support an extensible set of media types and compression schemes. Many of these are not covered by the profile specifications available today. Hence, it is desirable to have an RTP encapsulation scheme that will handle all QuickTime media/codec types that are not covered by specific RTP payload types.
This specification proposes a scheme to carry QuickTime media/codec types over RTP. The
scheme specified here handles all loss-tolerant media and a few loss-intolerant media such as text.
Support for other loss-intolerant media such as MIDI and 3D will be added in future. This
specification is intended for streaming stored QuickTime movies as well as live QuickTime
QuickTime consists of a software architecture for multimedia authoring/playback and a movie file format to store multimedia presentations. These two aspects of QuickTime are independent of each other but are often combined when referring to QuickTime. It is possible to playback/author movies in other file formats such as AVI, AIFF, etc. using QuickTime software. Similarly it is possible to use QuickTime files independent of the software, for example, streaming movies over the Internet. The QuickTime movie file format is specified in . More information on the QuickTime software architecture can be obtained from [7,8,9].
For the purpose of this document we will mostly be concerned with streaming QuickTime content using RTP. "QuickTime content" refers to content as specified in the QuickTime movie file format specification . This does not preclude live QuickTime content. We merely use the file format specification as way to specify the format of the content.
QuickTime movie files contain the media data and synchronization information for the movie. A
movie consists of multiple tracks, each of which contains a specific media type such as video,
sound, MIDI, text, etc. Not all media types are loss-tolerant The loss tolerant media can be carried
over RTP/UDP in classic RTP-style. This will not however work for loss-intolerant data. RTP
over TCP or using the Realtime Streaming Protocol (RTSP)  are some of the options for loss-
intolerant media data. Another option is to achieve semi-reliability through redundant
transmission. This specification uses this latter option to handle QuickTime "text" media over
QuickTime has a concept of timescales. A timescale defines the number of units of time that pass
in every second of real time. Any time value has to be specified with respect to a timescale. A
QuickTime movie has a timescale associated with it. Each of the tracks (medias) have a timescale
associated with them. All of these timescales could be different. The RTP timestamp will be based
on the timescale of the track associated with the RTP session.
Every QuickTime media type has a sample description format associated with it. The sample
description specifies how the sample is interpreted. For example, the video media sample
description specifies the compression scheme, quality, bit depth and other such information. The
sample description may change during the life of a track.
Every QuickTime track has a number of parameters associated with it such as height, width,
transformation matrix, etc. In many cases, these are as important to the presentation as the sample
The encapsulation scheme described here requires that each QuickTime media track within a single movie be sent over a separate RTP session and be synchronized using standard RTP techniques.
The QuickTime information is carried as payload data within the RTP protocol. There is a variable length QuickTime header immediately following the RTP header. The media data is packetized and placed in the RTP packet following the QuickTime header.
The RTP packet is formatted as follows:
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ . . . RTP Header . . . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ . QuickTime Header... . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ . QuickTime Media Data... . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
3.1 RTP Header
The format and general usage of the RTP header fields are described in .
The following fields of the RTP header will be used as specified below:
- The payload type should be one of the dynamic payload types, and should be agreed upon through some non-RTP means. If using SDP to negotiate the dyanamic payload type, the dynamic payload name should be x-quicktime or x-qt. E.g. m=video 1234 99 a=rtpmap:99 x-qt
- The RTP timestamp is based on the timescale specified in the QuickTime header. The timestamp encodes the sampling instant of the first media sample contained in the RTP data packet. Multiple samples may be contained in one RTP packet or a single sample may require multiple RTP packets. The packetization rules are specified in a subsequent section. If a media sample occupies more than one packet, the timestamp will be the same on all of those packets. Packets containing different samples must have different timestamps so that samples may be distinguished by the timestamp. The initial value of the timestamp is random (unpredictable) to make known-plaintext attacks on encryption more difficult, see RTP .
- The marker bit (M-bit) of the RTP header is set to one in the last packet of a sample and
otherwise, must be zero. If one or more samples are fully contained within an RTP packet the
M-bit must be set to one. Thus, it is possible to easily detect that a complete sample has been
received and can be decoded and presented.
The QuickTime Header is defined as follows:
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | VER |PCK|S|Q|L| RES |D| QuickTime Payload ID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ . QuickTime Payload Description ... . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ . Sample Specific Information ... . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The fields in the QuickTime Header have the following meanings:
VER: 4 bits
PCK: 2 bits
S bit: 1 bit
Q bit: 1 bit
L bit: 1 bit
RES: 7 bits
D bit: 1 bit
QuickTime Payload ID: 15 bits
QuickTime Payload Description: variable length
Sample Specific Information: variable length
The QuickTime Payload Description is defined as follows:
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |K|F|A|Z| RES | QuickTime Payload Desc Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ . QuickTime Payload Desc Data ... . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The fields in the QuickTime Payload Description have the following meanings:
K bit: 1 bit
F bit: 1 bit
A bit: 1 bit
Z bit: 1 bit
RES: 12 bits
QuickTime Payload Description Length: 16 bits
QuickTime Payload Description Data: varies
The QuickTime Payload Description Data is defined as follows:
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | QuickTime Media Type | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Timescale | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ . QuickTime TLVs ... . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The fields in the QuickTime Payload Description Data have the following meanings:
QuickTime Media Type: 32 bits
Timescale: 32 bits
QuickTime TLVs: variable length
The sample specific information is defined as follows:
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | RES | Sample-Specific Info Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ . QuickTime TLVs ... . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Reserved: 16 bits
Sample Specific Information Length: 16 bits
QuickTime TLVs: variable length
A QuickTime TLV is formatted as follows:
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | QuickTime TLV Length | QuickTime TLV Type | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ . QuickTime TLV Value ... . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The fields in a QuickTime TLV have the following meanings:
QuickTime TLV Length: 16 bits
QuickTime TLV Type: 16 bits
QuickTime TLV Value: variable length
Note: Some TLVs are mandatory and must be present if the QuickTime Payload Description is being sent. Other TLVs will assume their default values if they are not sent. Any TLV not recognized by a receiver must be ignored and skipped over.
The currently defined TLVs are described below:
Sample Description (mandatory) Type: 'sd' Length: variable length Default: none Media-specific QuickTime sample description. The format for this TLV for each of the currently defined media types can be found in  (starting pg. 59). QuickTime Atom Type: 'qt' Length: variable Default: not applicable This TLV is used to transparently send a QuickTime Atom as defined in  (pg. 3). For example, this can be used to send User Data Atoms, Track Reference Atoms, Track Input Map Atoms, etc. The QuickTime atoms sent depends on the media type associated with the QuickTime payload description. Track ID Type: 'ti' Length: 8 Default: 0 Track ID as defined in  (pg. 18). Layer Type: 'ly' Length: 6 Default: 0 Layer as defined in  (pg. 18). Volume Type: 'vo' Length: 6 Default: 255 Volume as defined in  (pg. 18). Matrix Type: 'mx' Length: 40 Default: identity matrix Matrix as defined in  (pg. 18 and 77). Translation Matrix Type: 'tr' Length: 8 Default: identity matrix v, h -- two 16-bit signed numbers indicating translation values (in pixels).This TLV is sent instead of the Matrix TLV when only translation is required. Note that the order is v, then h. Track Width Type: 'tw' Length: 8 Default: 0 Track Width as defined in  (pg. 19). Track Height Type: 'th' Length: 8 Default: 0 Track Height as defined in  (pg. 19) Language Type: 'la' Length: 6 Default: 0 Language as defined in  (pg. 32 and 75). Rate Type: 'rt' Length: 8 (Fixed) Default: 1.0 Rate of the media. Graphics Mode Type: 'gm' Length: 4 Default: 0x0040 (copy mode) The graphics mode of the stream. [must add where these are defined] Op Color Type: 'oc' Length: 12 (RGBColor) Default: 0x8000 for red, gree, and blue The op color to be used in conjunction with the graphics mode. Clip Region Type: 'cr' Length: variable (RegionHandle) Default: no clip region The clip region to be applied to the visual media. Duration (sample specific only) Type: 'du' Length: 4 Default: unknown duration, or the natural duration of the data Specifies the play duration of the sample(s). For certain media types, e.g. video, this specifies the length of time the sample is to be displayed or rendered. For other media types, e.g. midi, this could specify an edit into the sample. See the discussion under Play Offset. Play Offset (sample specific only) Type: 'po' Length: 4 Default: 0 Specifies the play offset of the sample(s), in the RTP timescale. This, combined with the duration, specifies which portion of the data should be played. For example, suppose midi is being streamed with a timescale of 1000. If this particular sample has a timestamp of 5000 and contains 6 seconds of data, then normally, the midi data in that sample will be played from time 5 seconds to time 11 seconds. If this packet contains a Duration TLV of 3 seconds, and no Play Offset TLV, then the data is played from time 5 seconds to time 8 seconds and the last 2 seconds of data in the sample is not played. If this packet contains a Duration TLV of 3 seconds and a Play Offset of 2 seconds, then at from time 5 seconds the third second of data in the sample will start playing. The Play Offset indicates that the first two seconds of data is not played. At time 8 seconds, the sample will stop playing and the last one second of data in the sample is never played.
3.4 Media Data Packetization
The RTP packetization for QuickTime is designed to take into account the needs of a varied set of media types and compression schemes. Hence, 3 different packetization schemes are defined.
The following pieces of information are required at the transmission end to make packetization decisions:
- Maximum QuickTime Media Data size (MQD) that can be accommodated in a single RTP packet.
- Whether all samples for this media type are of constant size? (CQS)
- Whether all samples for this media type are of constant duration? (CQD)
- Sample size of all samples (when they are constant) (CSS).
- Sample size of a specific sample (SS).
Based on the above pieces of information, one of the following packetization schemes is adopted:
Scheme 1 : (CQS=true) AND (CQD=true) AND (CSS <= 0.5*MQD)
Multiple samples are packed into one RTP packet. The RTP header M-bit is set to one on all packets. The QuickTime header PCK field is set to 1.
Scheme 2: ( (CQS=false) OR (CQD=false) ) AND (SS <= 0.5*MQD)
Multiple samples are packed into the QuickTime Media Data portion of an RTP packet. The RTP header M-bit is set to one in this packet. The QuickTime header PCK field is set to 2.
The samples are packed using the format illustrated below:
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |S| Reserved | Sample Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Sample Timestamp | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ . Sample Data ... . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |S| Reserved | Sample Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Sample Timestamp | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ . Sample Data ... . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ . ...... . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The fields in the QuickTime Media Data have the following meanings:
S bit: 1 bit
Reserved: 15 bits
Sample Length: 16 bits
Sample Timestamp: 32 bits
Sample Data: variable length
All receivers are required to handle this scheme. A transmitter may choose to not implement this scheme in which case it will default to scheme 3.
Note: This scheme leads to more efficient packing than scheme 3 for certain media/codec types. However, there is a trade-off between efficiency and losing multiple samples when a packet is lost.
Scheme 3: Cases not covered by schemes 1 and 2
A single sample is placed in one or more RTP packets. The RTP header M-bit is set to one in the
last packet and is otherwise set to zero. The QuickTime header PCK field is set to 3.
The packetization boundaries may be chosen intelligently to respect the
compression/decompression algorithm requirements. However, this is not a requirement. When
intelligent boundaries are not chosen, a single packet loss will lead to the entire sample being lost
in the case of multi-packet samples.
The QuickTime payload ID identifies the format of the QuickTime media data carried in an RTP session. It associates the QuickTime payload description (that is transmitted periodically) with the QuickTime media data. This identifier is an arbitrary 15-bit number that is changed every time the payload format changes. When streaming QuickTime movie tracks, the payload format changes usually when the sample description changes during the life of the track.
The following restrictions apply when picking payload IDs,
- The payload ID must be unique among all QuickTime RTP sessions originating from a given source canonical name. This is to ensure efficient mapping of payload IDs to payload descriptions using a single receiver-side table per canonical name.
- A payload ID must not be reused for a different payload description during the lifetime of the session. This allows receivers to cache the payload descriptions for the duration of the session.
An exception to the above restrictions are made when the D-bit is set to 1 in the QuickTime payload description. This indicates that the payload IDs might in fact be reused at some time in the future, and allows live broadcasts of arbitrary changing QuickTime data for an indefinite amount of time. Senders must be careful to reuse the ID only when they are reasonably sure that the receiver has received a different ID since it was last used. When the D-bit is set, receivers must not chache the data associated with a QuickTime payload ID once they receive a packet with a different QuickTime payload ID.
The basic algorithm for senders is:
The basic algorithm for receivers is:
The QuickTime payload descriptions are transmitted as part of the QuickTime header. The payload
descriptions specify the format of the QuickTime media data. The information for the specific
fields in a payload description can be found in . These fields do not include all of the
information associated with a QuickTime track. For example, information on transformation
matrices, layers, etc. is not included. This information needs to be communicated through non-
The payload description must be transmitted in the first RTP packet which contains media samples that require the payload description. After the first packet, the payload description must be retransmitted at a periodic interval until the format of the media samples changes. The maximum retransmission interval should be 1 second, unless packets are being transmitted at less than 1 packet/second in which case the payload description must be transmitted with each packet.
The retransmission interval may be negotiated to an arbitrary value through non-RTP means. Note: This includes the case in which the payload descriptions are never sent over RTP, i.e. a retransmission interval of infinity. In this case the payload descriptions are communicated through some non-RTP means.
A transmitter may send an RTP packet that contains only a payload description and no QuickTime
media data. This payload description must be cached by the receiver and used to interpret data that
may arrive in the future.
Loss-intolerant media types can not be easily handled within the standard RTP framework. Hence, we may need to use some non-RTP techniques to transmit these media types. However, some of the media types, notably Text and Tween media can be sent over RTP by the use of redundant transmissions. (Tween media is used to alter the characteristics of other media streams. For example, Tween samples may contain a series of values that change the volume of an audio stream.) The use of this technique is experimental.
The redundant transmission technique is one in which the RTP packet is retransmitted multiple times within the duration of the sample. The RTP packet is resent as a whole with the same RTP sequence number, timestamp and other information, i.e. it is an identical packet when seen on the wire. This technique is not bandwidth friendly when used with high bandwidth media types. Hence it will be used only with the low bandwidth media types such as "text" and "tween" media.
The rationale for using the same RTP sequence numbers in the retransmitted packets is as follows: If the sequence numbers were incremented for each of the retransmitted packets we would require an additional field to identify the duplicate samples. In the proposed scheme, the receiver can discard duplicates by simply keeping track of the sequence numbers of the packets received.
The interval between retransmissions depends on the media type and the current congestion
situation in the network. This interval can be a simple fixed interval, say 4 retransmissions equally
spaced within the duration of the sample, or it could be more complex, say exponentially
increasing intervals within the duration of the sample. This specification does not currently
recommend a preferred scheme to use for determining the retransmission interval.
The following open issues need to be resolved:
- How to handle loss-intolerant media with "key" and "update" samples? Loss-intolerant media samples can be retransmitted multiple times with fixed or variable intervals between transmission. The samples can be classified as key samples and update samples and handled appropriately. Update samples need not be periodically retransmitted. For example, in sprite media, key samples will contain the sprite image and update samples will contain the motion vectors. Whereas, in text media, all samples will be key samples.
- What is the appropriate interval between redundant transmissions for "text" and "tween" media samples?
The authors would like to thank Joe Pallas and all the members of the QuickTime Streaming team, Jay Geagan, Andy Grignon, Sylvain Rouze and Kevin Gong for their valuable input in writing this proposal.
 H. Schulzrinne, et. al., "RTP : A Transport Protocol for Real-Time Applications", IETF RFC 1889, January 1996.
 H. Schulzrinne, et. al., "RTP Profile for Audio and Video Conference with Minimal Control", IETF RFC 1890, January 1996.
 L. Berc, et. al., "RTP Payload Format for JPEG-compressed Video", IETF RFC 2035, October 1996.
 D. Hoffman, et. al., "RTP Payload Format for MPEG1/MPEG2 Video", IETF RFC 2038, October 1996.
 T. Turletti, C. Huitema, "RTP Payload Format for H.261 Video Streams", IETF RFC 2032, October 1996.
 Apple Computer, Inc., "QuickTime File Format Specification", May 1996.
 Apple Computer, Inc., "Inside Macintosh: QuickTime", Addison Wesley Press.
 Apple Computer, Inc., "Inside Macintosh: QuickTime Components", Addison Wesley Press.
 Apple Computer, Inc., "QuickTime 2.5 Developer Guide", Developer Press.
 H. Schulzrinne, et. al., "Real Time Streaming Protocol", IETF Draft ietf-mmusic-rtsp-02.txt, March 24 1994, Expires: August 20 1997.
2/23/00 - aj - First published
Previous | Next