The use of Internet protocols in IP-telephony. IP telephony: from copper wires to digital signal processing

Good afternoon

In summary: the monitoring system is a complex that is connected in a non-intrusive mode to the n-th number of 10-gigabit Ethernet links, which continuously “monitors” the transmission of all RTP video streams present in the traffic and takes measurements at a certain time interval in order to save them later to the base. According to the data from the database, reports are regularly built for all cameras.

And what's so difficult?

In the process of finding a solution, several problems were immediately fixed:

Non-intrusive connection. The monitoring system connects to already working channels, in which most of the connections (via RTSP) are already established, the server and the client already know which ports the exchange takes place on, but we do not know this in advance. There is a well-known port only for the RTSP protocol, but UDP streams can go through arbitrary ports (besides, it turned out that they often violate the SHOULD even/odd port requirement, see rfc3550). How to determine that this or that packet from some IP address belongs to a video stream? For example, the BitTorrent protocol behaves similarly - at the stage of establishing a connection, the client and server agree on ports, and then all UDP traffic looks like “just a bit stream”.
Connected links can contain more than just video streams. There can be HTTP, and BitTorrent, and SSH, and any other protocols that we use today. Therefore, the system must correctly identify the video streams in order to separate them from the rest of the traffic. How to do this in real time with 8 ten-gigabit links? Of course, they are usually not filled to 100%, so the total traffic will not be 80 gigabits / s, but about 50-60, but this is not so little.
Scalability. Where there are already many video streams, there may be even more of them, since video surveillance has long proved itself as an effective tool. This suggests that there should be a margin for performance and a reserve for links.

Looking for a suitable solution...

Naturally, we tried to make the most of our own experience. By the time the decision was made, we already had an implementation of processing ethernet packets on the FPGA-powered device Bercut-MX (more simply - MX). With the help of Bercut-MX, we were able to get the necessary fields for analysis from the headers of Ethernet packets. Unfortunately, we had no experience in processing such a volume of traffic using “regular” servers, so they looked at such a solution with some apprehension ...

It would seem that it only remained to apply the method to RTP packets and we would have the golden key in our pocket, but MX can only process traffic, it does not include the ability to record and store statistics. There is not enough memory in the FPGA to store the found connections (IP-IP-port-port combinations), because in a 2x10-gigabit link entering the input there can be about 15 thousand video streams, and for each you need to “remember” the number of received packets , the number of lost packets, and so on ... Moreover, searching at such a speed and for such an amount of data, subject to lossless processing, becomes a non-trivial task.

To find a solution, we had to “dig deeper” and figure out what algorithms we will use to measure quality and identify video streams.

What can be measured by the fields of an RTP packet?

From the description it can be seen that from the point of view of quality measurements in the RTP packet, we are interested in the following fields:

sequence number - 16-bit counter that increments with each packet sent;
timestamp - timestamp, for h.264 the sample size is 1/90000 s (i.e. corresponds to a frequency of 90 kHz);
Marker bit RFC3550 general view it is described that this bit is intended to designate “significant” events, but in fact cameras most often mark the beginning of a video frame and specialized packets with SPS / PPS information with this bit.

It is quite obvious that sequence number allows you to define the following flow parameters:

packet loss (frame loss);
resending the packet (duplicate);
changing the order of arrival (reordering);
reloading the camera, with a large "gap" in the sequence.

Timestamp allows you to measure:

delay variation (also called jitter). In this case, a 90 kHz counter should work on the receiving side;
in principle, the delay in the passage of the packet. But for this you need to synchronize the camera time with the timestamp, and this is possible if the camera transmits sender reports (RTCP SR), which is generally not true, because in real life, many cameras ignore the RTCP SR message (about half of the cameras with which we had a chance to work).

Well, M-bit allows you to measure the frame rate. True, SPS / PPS-frames of the h.264 protocol introduce an error, because are not video frames. But it can be leveled by using information from the NAL-unit header, which always follows the RTP header.

Detailed algorithms for measuring parameters are beyond the scope of the article, I will not delve into it. If interested, the rfc3550 has an example of loss calculation code and a formula for calculating jitter. The main conclusion is that only a few fields from RTP packets and NAL units are enough to measure the basic characteristics of a transport stream. And the rest of the information is not involved in the measurements and it can and should be discarded!

How to identify RTP streams?

To keep statistics, the information obtained from the RTP header must be “attached” to a certain camera (video stream) identifier. The camera can be uniquely identified by the following parameters:

Source and Destination IP Addresses
Source and Destination Ports
SSRC. It is of particular importance when several streams are broadcast from one IP, i.e. in the case of a multiport encoder.

Interestingly, at first we made camera identification only by source IP and SSRC, relying on the fact that SSRC should be random, but in practice it turned out that many cameras set SSRC to a fixed value (say, 256). Apparently, this is due to saving resources. As a result, we had to add ports to the camera ID. This solved the uniqueness problem completely.

How to separate RTP packets from other traffic?

The question remains: how will Berkut-MX, having accepted the packet, understand that it is RTP? The RTP header does not have such an explicit identification as IP, it does not have a checksum, it can be transmitted over UDP with port numbers that are dynamically selected when the connection is established. And in our case, most of the connections have long been established and you can wait a very long time for reinstallation.

To solve this problem, it is recommended in rfc3550 (Appendix A.1) to check the RTP version bits - this is two bits, and the Payload Type (PT) field - seven bits, which in the case of a dynamic type takes a small range. We have found out in practice that for the set of cameras with which we work, PT fits in the range from 96 to 100.

There is one more factor - the parity of the port, but as practice has shown, it is not always observed, so it had to be abandoned.

Thus, the behavior of Berkut-MX is as follows:

we receive a package, disassemble it into fields;
if the version is 2 and the payload type is within the given limits, then we send the headers to the server.

Obviously, with this approach, there are false positives, because. not only RTP packets can fall under such simple criteria. But it is important for us that we definitely will not miss the RTP packet, and the server will filter out the “wrong” packets.

To filter false cases, the server uses a mechanism that registers the source of video traffic for several sequentially received packets (there is a sequence number in the packet!). If several packets came with consecutive numbers, then this is not a coincidence and we start working with this stream. This algorithm proved to be very reliable.

Moving on…

Realizing that all the information that comes in packets is not needed to measure the quality and identify streams, we decided to transfer all the highload & time-critical work on receiving and isolating the fields of RTP packets to Berkut-MX, that is, to FPGA. It “finds” the video stream, parses the packet, leaves only the required fields, and sends it to a regular server in the UDP tunnel. The server takes measurements for each camera and saves the results to the database.

As a result, the server does not work with 50-60 Gigabit / s, but with a maximum of 5% (this is exactly the proportion of data sent to the average packet size). That is, at the input of the entire system 55 Gigabit / s, and only no more than 3 Gigabit / s gets to the server!

As a result, we got the following architecture:

And we got the first result in this configuration two weeks after setting the initial TOR!

What is the end result of the server?

So what does the server do in our architecture? His tasks:

listen on a UDP socket and read fields with packed headers from it;
parse incoming packets and extract RTP header fields from there along with camera identifiers;
correlate the received fields with those that were received before, and understand whether the packets were lost, whether the packets were sent repeatedly, whether the order of arrival changed, what was the variation in the packet transit delay (jitter), etc .;
record measured data in the database with reference to time;
analyze the base and generate reports, send traps about critical events (high packet loss, loss of packets from some camera, etc.).

Despite the fact that the total traffic at the server input is about 3 Gigabit / s, the server copes even if we do not use any DPDKs, but simply work through a linux socket (after increasing the buffer size for the socket, of course). Moreover, it will be possible to connect new links and MX "s, because the performance margin remains.

This is what the top of the server looks like (this is the top of only one lxc container, reports are generated in another):

It can be seen from it that the entire load for calculating quality parameters and accounting for statistics is evenly distributed over four processes. We managed to achieve such a distribution by using hashing in the FPGA: the hash function is calculated by IP, and the low bits of the received hash determine the number of the UDP port to which the statistics will go. Accordingly, each process listening on its port receives approximately the same amount of traffic.

cons and pros

It's time to brag and admit the shortcomings of the solution.

I'll start with the pros:

no loss at the junction with 10G links. Since the FPGA takes all the “blow”, we can be sure that every packet will be analyzed;
to monitor 55,000 cameras (or more), only one server with one 10G card is required. We are currently using servers based on 2x Xeon with 4 cores of 2400 MHz each. Enough with a margin: in parallel with the collection of information, reports are generated;
monitoring of 8 "dozens" (10G links) fits into only 2-3 units: there is not always a lot of space and power in the rack for the monitoring system;
when connecting links from MXs through the switch, you can add new links without stopping monitoring, because you don’t need to insert any boards into the server and you don’t need to turn it off for this;
the server is not overloaded with data, it receives only what is needed;
headers from MX come in a jumbo Ethernet packet, which means the processor will not be choked with interrupts (besides, we do not forget about interrupt coalescing).

In fairness, I will consider the disadvantages:

due to heavy optimization specific task adding support for new fields or protocols requires changes to the FPGA code. This leads to more time than if we did the same on the processor. Both in development and testing, and during deployment;
video information is not analyzed at all. The camera can shoot an icicle hanging in front of it, or be turned the wrong way. This fact will go unnoticed. Of course, we have provided the ability to record video from the selected camera, but the operator cannot go through all 55,000 cameras!
a server and FPGA-powered devices are more expensive than just one or two servers;)

Summary

In the end, we got a software and hardware complex in which we can control both the part that parses packets on interfaces and the part that keeps statistics. Full control over all nodes of the system literally saved us when the cameras started switching to RTSP/TCP interleaved mode. Because in this case, the RTP header is no longer located in the packet at a fixed offset: it can be anywhere, even on the border of two packets (the first half in one, the second in the other). Accordingly, the algorithm for obtaining the RTP header and its fields has undergone fundamental changes. We had to do TCP reassembling on the server for all 50,000 connections - hence the rather high load in the top.

We have never worked in the field of high-load applications before, but we managed to solve the problem due to our skills in FPGA and it turned out pretty well. There was even a margin - for example, another 20-30 thousand streams can be connected to a system with 55,000 cameras.

Tuning linux subsystems (distribution of queues by interrupts, increase in receive buffers, direct allocation of cores to specific processes, etc.) I left outside the scope of the article, because. this topic is already very well covered.

I have described far from everything, a lot of rakes have been collected, so feel free to ask questions :)

Many thanks to everyone who read to the end!

The most pressing problem is increasingly the lack of address space, which requires changing the address format.

Another problem is the insufficient scalability of the routing procedure - the basis of IP networks. The rapid growth of the network causes an overload of routers, which today are forced to maintain routing tables with tens and hundreds of thousands of entries, as well as solve the problems of packet fragmentation. The work of routers can be facilitated, in particular, by upgrading the IP protocol.

Along with the introduction of new functions directly into the IP protocol, it is advisable to ensure its closer interaction with new protocols by introducing new fields into the packet header.

As a result, it was decided to subject the IP protocol to modernization, pursuing the following main goals:

creation of a new extended addressing scheme;
improving network scalability by reducing the functions of backbone routers;
ensuring data protection.

Address space expansion. The IP protocol solves the potential problem of address shortages by expanding the address length to 128. However, such a significant increase in the length of the address was done largely not to eliminate the problem of address shortages, but to increase the efficiency of networks based on this protocol. The main goal was a structural change in the addressing system, expanding its functionality.

Instead of the existing two levels of the address hierarchy (network number and host number), IPv6 proposes to use four levels, which implies a three-level network identification and one level for node identification.

Now the address is written in hexadecimal, with every four digits separated from each other by a colon, for example:

FEDC:0A96:0:0:0:0:7733:567A.

For networks that support both versions of the protocol IPv4 and IPv6, it is possible to use the traditional decimal notation for the lower 4 bytes, and hexadecimal for the higher ones:

0:0:0:0:FFFF 194.135.75.104.

Within the IPv6 addressing system, there is also a dedicated address space for local use, that is, for networks outside the Internet. There are two varieties local addresses: for non-subnetted "flat" networks (Link-Local) and for subnetted networks (Site-Local), which differ by prefix value.

Changing the format of packet headers. This can be implemented by a new organization scheme of "nested headers", which ensures the division of the header into the main one, which contains the necessary minimum information, and additional ones, which may be missing. This approach opens up rich possibilities for extending the protocol by defining new optional headers, making the protocol open.

The main IPv6 datagram header is 40 bytes long and has the following format (Figure 2.4).

Field Traffic Class equivalent in purpose to the field Type Of Service, and the field Hop Limit- field Time To Live IPv4 protocol.

Field Flow Label allows you to isolate and process individual data streams in a special way without the need to analyze the contents of the packets. This is very important in terms of reducing the load on routers.

Field Next Header is an analog of the Protocol (Protocol) IPv4 field and determines the type of header following the main one. Each next additional header also contains a Next Header field.

2.3.3. TCP protocol

Control Protocol The Transmission Control Protocol (TCP) was developed to support interactive communication between computers. The TCP protocol ensures the reliability and validity of data exchange between processes on computers that are part of a common network.

Unfortunately, the TCP protocol is not suitable for transmitting multimedia information. The main reason is the availability of delivery control. Monitoring takes too long to transmit more delay-sensitive information. In addition, TCP provides rate control mechanisms to avoid network congestion. Audio and video data require, however, strictly defined bit rates, which cannot be changed arbitrarily.

On the one hand, the TCP protocol interacts with the application protocol of the user application, and on the other hand, with the protocol that provides the "low-level" functions of routing and addressing packets, which, as a rule, are performed by IP.

The logical structure of the network software that implements the protocols of the TCP / IP family in each node of the Internet is shown in fig. 2.5.

The rectangles represent the modules that process the data, and the lines connecting the rectangles represent the data transfer paths. Horizontal line at the bottom of the figure indicates an Ethernet network, which is used as an example of a physical medium.

Rice. 2.5.

To establish a connection between two processes on different computers network, you need to know not only the Internet addresses of computers, but also the numbers of those TCP ports (sockets) that processes use on these computers. Any TCP connection on the Internet is uniquely identified by two IP addresses and two TCP port numbers.

The TCP protocol can handle damaged, lost, duplicated, or out-of-order packets. This is achieved through a mechanism for assigning a sequence number to each transmitted packet and a mechanism for checking the receipt of packets.

When TCP transmits a segment of data, a copy of that data is placed in the retransmission queue and an acknowledgment timer is started.

2.3.4. UDP protocol

The User Datagram Protocol (UDP) protocol is intended for the exchange of datagrams between processes of computers located in an integrated system of computer networks.

The UDP protocol is based on the IP protocol and provides application processes with transport services that are slightly different from those of the IP protocol. The UDP protocol provides non-guaranteed data delivery, i.e. does not require confirmation of their receipt; in addition, this protocol does not require the establishment of a connection between the source and receiver of information, i.e. between UDP modules.

2.3.5. RTP and RTCP protocols

Basic concepts

The RTP real-time transport protocol provides end-to-end real-time transmission of multimedia data such as interactive audio and video. This protocol implements traffic type recognition, packet sequencing, timestamping, and transmission control.

The action of the RTP protocol is reduced to assigning each outgoing packet a timestamp. On the receiving side, packet timestamps indicate in what sequence and with what delays they need to be played back. Support for RTP and RTCP allows the receiving host to arrange the received packets in the proper order, reduce the effect of packet delay jitter on the network on signal quality, and restore synchronization between audio and video so that incoming information can be correctly heard and viewed by users.

Note that RTP itself does not have any mechanism to guarantee timely transmission of data and quality of service, but uses underlying services to ensure this. It does not prevent out-of-order packets, but it does not assume that the underlying network is absolutely reliable and transmits packets in the correct sequence. The sequence numbers included in RTP allow the receiver to re-sequence the sender's packets.

RTP protocol supports both bidirectional communication and data transfer to a group of destinations if the multicast is supported by the underlying network. RTP is designed to provide the information required individual applications, and in most cases integrated into the application.

Although RTP is considered a transport layer protocol, it usually functions on top of another transport layer protocol, UDP (User Datagram Protocol). Both protocols contribute to the functionality of the transport layer. It should be noted that RTP and RTCP are independent of the underlying transport and network layers, so the RTP/RTCP protocols can be used with other suitable transport protocols.

Protocol data blocks RTP/RTCP are called packets. Packets generated in accordance with the RTP protocol and used to transmit multimedia data are called information packets or data packets ( data packets ), and packets generated in accordance with the RTCP protocol and used to transmit service information that is required for reliable operation teleconferences are called control packets or service packets ( control packets ). An RTP packet includes a fixed header, an optional variable length header extension, and a data field. An RTCP packet starts with a fixed part (similar to the fixed part of RTP information packets) followed by variable length structural elements.

In order for the RTP protocol to be more flexible and applicable to various applications, some of its parameters are intentionally undefined, but it does include the concept of a profile. Profile (profile) is a set of parameters for RTP and RTCP protocols for a specific class of applications, which determines the features of their functioning. A profile defines: the use of individual packet header fields, traffic types, header padding and header extensions, packet types, communication security services and algorithms, underlying protocol usage considerations, and so on. Each application typically works with only one profile, and the profile type is set by selecting the appropriate application. There is no explicit indication of the profile type by a port number, protocol identifier, etc.

Thus, a complete RTP specification for a particular application must include additional documents, which include a profile description, as well as a traffic format description that defines how a particular type of traffic, such as audio or video, will be processed in RTP.

Group audio conferencing

Group audio conferencing requires a multi-user group address and two ports. In this case, one port is required for the exchange of audio data, and the other is used for control packets of the RTCP protocol. Group address and port information is passed to prospective members teleconferences. If secrecy is required, the information and control packets may be encrypted, in which case a distributed key encryption.

The audio conferencing application used by each participant in a conference sends audio data in small bursts, such as 20 ms. Each piece of audio data is preceded by an RTP header; the RTP header and data are in turn formed (encapsulated) into a UDP packet. The RTP header indicates which type of audio coding (eg, PCM, ADPCM, or LPC) was used to form the data in the packet. This makes it possible to change the coding type during the conference, for example, when a new participant arrives who uses a low bandwidth connection, or during network congestion.

On the Internet, as in other packet-switched data networks, packets are sometimes lost and reordered, and delayed for varying amounts of time. To counteract these events, the RTP header contains a timestamp and sequence number that allow receivers to re-timing so that, for example, portions of an audio signal are played continuously by the speaker every 20 ms. This timing reconstruction is performed separately and independently for each source of RTP packets in teleconferences. The sequence number can also be used by the receiver to estimate the number of lost packets.

Since the participants teleconferences can enter and leave it during it, it is useful to know who participates in it in this moment and how well conference participants receive audio data. For this purpose, each instance of the audio application during the conference periodically issues on the control port (port RTCP ) for applications of all other participants, messages about receiving packets indicating their user name. The receive message indicates how well the current speaker is being heard and can be used to control adaptive encoders. In addition to the username, other identification information for bandwidth control may also be included. When leaving the conference, the site sends an RTCP BYE packet.

Video conferencing

If in teleconferences both audio and video signals are used, they are transmitted separately. For the transmission of each type of traffic, regardless of the other, the protocol specification introduces the concept of an RTP communication session. A session is defined by a specific pair of destination transport addresses (one network address plus a pair of ports for RTP and RTCP). Packets for each type of traffic are transmitted using two different pairs of UDP ports and/or multicast addresses. There is no direct RTP layer connection between audio and video sessions, except that a user participating in both sessions must use the same canonical name in the RTCP packets for both sessions in order for the sessions to be associated.

One reason for this separation is that some conference participants need to be allowed to receive only one type of traffic if they wish to. Despite the separation, synchronous playback of source media data (audio and video) can be achieved using the timing information that is carried in the RTCP packets for both sessions.

The concept of mixers and translators

Not always all sites have the ability to receive multimedia data in the same format. Consider the case where participants from the same locality are connected via a low speed link to the majority of other conference participants who have broadband network access. Instead of forcing everyone to use narrower bandwidth and audio coding with reduced quality, an RTP layer communication facility called a mixer can be placed in a low bandwidth area. This mixer resynchronizes the incoming audio packets to restore the original 20ms intervals, mixes these restored audio streams into a single stream, performs low bandwidth audio encoding, and transmits the packet stream over a low speed link. In this case, packets can be addressed to one recipient or a group of recipients with different addresses. To be able to provide the correct indication at receiving endpoints message source, the RTP header includes means for mixers to identify the sources involved in the formation of the mixed packet.

Some of the participants in the audio conference may be connected by broadband links, but may not be reachable via an IP multicast (IPM) group conference call. For example, they may be behind an application layer firewall that will not allow any transmission of IP packets. For such cases, not mixers are needed, but a different type of RTP layer communication, called translators. Of the two translators, one is installed outside the firewall and externally forwards all multicast packets received over a secure connection to the other translator installed behind the firewall. The translator behind the firewall broadcasts them again as multicast packets to a multi-user group restricted to internal network site.

Mixers and translators can be designed for a number of purposes. Example: A video mixer that scales video images of individuals in independent video streams and composites them into a single video stream, simulating a group scene.

RTCP control protocol

All fields of RTP/RTCP packets are transmitted over the network in bytes (octets); the most significant byte is transmitted first. All header field data is aligned according to its length. Octets marked as optional have a value of zero.

Control Protocol RTCP (RTCP - Real-Time Control Protocol) is based on periodic packet transmission management to all participants in a communication session using the same distribution mechanism as the RTP protocol. The underlying protocol must provide multiplexing of information and control packets, for example, using different UDP port numbers. The RTCP protocol performs four main functions.

The main function is to provide feedback to evaluate the quality of data distribution. This is an inherent function of RTCP as a transport protocol and is related to the flow control functions and overloads of other transport protocols. Feedback can be directly useful for managing adaptive coding, but experiments with IP multicast have shown that feedback with recipients is also important to have to diagnose defects in the dissemination of information. Sending feedback reports on the reception of data to all participants allows, when observing problems, to assess whether they are local or global. With an IPM distribution mechanism for entities such as network service providers, one can also receive feedback and act as a third party monitor when diagnosing network problems. This feedback function is provided by RTCP sender and receiver reports.
RTCP maintains a strong identifier for the RTP data source at the transport layer, called the "canonical name" (CNAME). Because the SSRC identifier can change if a conflict is detected or the program is restarted, recipients need a canonical CNAME to keep track of each member. Recipients also require a CNAME for set mapping flows of information from a given participant to a plurality of related RTP sessions, for example, when synchronizing audio and video.
The first two functions require that all participants send RTCP packets, therefore, RTP must regulate the frequency of such packets in order to allow scaling of the number of participants. When sent by each participant teleconferences control packages to all other participants, each can independently evaluate the total number of participants.
A fourth, optional, feature of RTCP is to provide session control information (eg, participant identification) to be reflected in the user interface. This is most likely to be useful in "loosely managed" sessions, where participants join and leave a group without membership control or parameter negotiation.

Functions one through three are mandatory when RTP is used in IP multicast and recommended in all other cases. RTP application developers are encouraged to avoid mechanisms that are only two-way and do not scale to increase the number of users.

RTCP Packet Rate

The RTP protocol allows an application to automatically scale the representativeness of a communication session from a few participants to several thousand. For example, in an audio conference, the data traffic is essentially self-limiting because only one or two people can talk at a time, and with multicast distribution, the data rate on any given link remains relatively constant regardless of the number of participants. However, control traffic is not self-limiting. If the receive reports from each participant are sent at a constant rate, then the control traffic will increase linearly as the number of participants increases. Therefore, a special mechanism must be provided to reduce the frequency of transmission of control packets.

For each session, the data traffic is assumed to meet an aggregated limit, called session bandwidth, which is shared by all participants. This bandwidth can be reserved and its limit is set by the network. The session bandwidth is independent of the media encoding type, but the choice of encoding type may be limited by the session bandwidth. The session bandwidth setting is expected to be provided by the session management application when it invokes the media application, but media applications may also set a default value based on the single sender data bandwidth for the encoding type selected for a given session.

Bandwidth calculations for control and data traffic are performed considering the underlying transport and network layer protocols (eg, UDP and IP). Data link layer headers (DLHs) are not taken into account in the calculations because a packet may be encapsulated with different DLL headers as it is transmitted.

Control traffic should be limited to a small and known portion of the session bandwidth: small enough that the main function of the transport protocol, data transmission, is not affected; known so that control traffic can be included in the bandwidth specification given to the protocol resource reservation, and so that each participant can independently calculate their share. It is assumed that the portion of session bandwidth allocated to RTCP should be set to 5%. All session participants must use the same amount of RTCP bandwidth, so that the computed control packet transmission interval values are the same. Therefore, these constants must be set for each profile.

The algorithm for calculating the interval between sendings of RTCP multipacks for sharing among the participants of the bandwidth allocated for control traffic has the following main characteristics:

senders collectively use at least 1/4 of the bandwidth of the control traffic, as in sessions with a large number of recipients but with a small number of senders; as soon as a connection is established, the participants receive the CNAME of the transmitting sites within a short period of time;
the estimated interval between RTCP packets is required to be at least more than 5 seconds to avoid bursts of RTCP packets exceeding the allowed bandwidth when the number of participants is small and traffic is not smoothed according to the law of large numbers;
the interval between RTCP packets varies randomly between half and one and a half calculated intervals to avoid unintentional synchronization of all participants. The first RTCP packet sent after entering a session is also delayed randomly (up to half the minimum RTCP interval) if the application is started at multiple sites at the same time, for example, when announcing the start of a session;
to automatically adapt to changes in the amount of transmitted control information, a dynamic estimate of the average size of the RTCP composite packet is calculated using all received and sent packets;
this algorithm can be used for sessions in which packet transmission is allowed for all participants. In this case, the session bandwidth parameter is the product of the individual sender's bandwidth times the number of participants, and the RTP bandwidth relies on the underlying protocol and to provide a length indication. The maximum length of RTP packets is only limited by lower layer protocols.
Multiple RTP protocol packets can be carried in a single underlying protocol data unit, such as a UDP packet. This reduces header redundancy and simplifies synchronization between different streams.

If one day you have to quickly figure out what VoIP (voice over IP) is and what all these wild abbreviations mean, I hope this manual will help. I will immediately note that the issues of configuring additional types of telephony services (such as call transfer, voice mail, conference calls, etc.) are not considered here.

So, what we will deal with under the cut:

Basic concepts of telephony: types of devices, connection schemes
Bundle of SIP/SDP/RTP protocols: how it works
How information about pressed buttons is transmitted
How does voice and fax transmission work?
Digital signal processing and audio quality assurance in IP telephony

1. Basic concepts of telephony

In general, the scheme for connecting a local subscriber to a telephone provider via a regular telephone line is as follows:

On the side of the provider (PBX), a telephone module with an FXS (Foreign eXchange Subscriber) port is installed. A telephone or fax machine with an FXO (Foreign eXchange Office) port and a dialer module is installed at home or in the office.

In appearance, the FXS and FXO ports do not differ in any way, these are ordinary 6-pin RJ11 connectors. But using a voltmeter, it is very easy to distinguish them - there will always be some voltage on the FXS port: 48/60 V when the handset is on-hook, or 6-15 V during a call. On the FXO, if it is not connected to the line, the voltage is always 0.

To transfer data over a telephone line, additional logic is needed on the provider's side, which can be implemented on the SLIC (subscriber line interface circuit) module, and on the subscriber's side - using the DAA (Direct Access Arrangement) module.

Wireless DECT phones (Digital European Cordless Telecommunications) are quite popular now. In terms of device, they are similar to ordinary telephones: they also have an FXO port and a dialer module, but they also have a module wireless communication stations and handsets at a frequency of 1.9 GHz.

Subscribers connect to the PSTN network (Public Switched Telephone Network) - telephone network general use, it is PSTN, PSTN. PSTN network can be organized using different technologies: ISDN, optics, POTS, Ethernet. A special case of PSTN, when a regular analog/copper line is used - POTS (Plain Old Telephone Service) - a simple old telephone system.

With the development of the Internet telephone communications switched to new level. Stationary telephones are used less and less, mainly for official needs. DECT phones are a little more convenient, but limited to the perimeter of the house. GSM-phones are even more convenient, but are limited by the borders of the country (roaming is expensive). But for IP phones, they are also softphones (SoftPhone), there are no restrictions, except for access to the Internet.

Skype is the most famous example of a softphone. It can do a lot of things, but it has two important drawbacks: a closed architecture and wiretapping is known by which authorities. Because of the first, it is not possible to create your own telephone micronetwork. And because of the second - it is not very pleasant when you are spied on, especially in personal and commercial conversations.

Fortunately, there are open protocols for creating your own communication networks with goodies - these are SIP and H.323. There are a few more softphones on the SIP protocol than on H.323, which can be explained by its relative simplicity and flexibility. But sometimes this flexibility can put a big stick in the wheel. Both SIP and H.323 protocols use the RTP protocol to transfer media data.

Consider the basic principles of the SIP protocol to understand how two subscribers connect.

2. Description of the bundle of SIP/SDP/RTP protocols

SIP (Session Initiation Protocol) - a protocol for establishing a session (not just a telephone one) is a text protocol over UDP. It is also possible to use SIP over TCP, but these are rare cases.

SDP (Session Description Protocol) is a protocol for negotiating the type of transmitted data (for sound and video these are codecs and their formats, for faxes - transmission speed and error correction) and their destination addresses (IP and port). It is also a text protocol. SDP parameters are sent in the body of SIP packets.

RTP (Real-time Transport Protocol) is an audio/video data transfer protocol. It is a binary protocol over UDP.

General structure of SIP packets:

Start-Line: A field indicating the SIP method (command) when requested, or the result of executing the SIP method when responding.
headers: Additional Information to the Start-Line, formatted as strings containing pairs of ATTRIBUTE: VALUE.
Body: binary or text data. Typically used to send SDP parameters or messages.

Here is an example of two SIP packets for one common call setup procedure:

On the left is the content of the SIP INVITE packet, on the right is the response to it - SIP 200 OK.

The main fields are framed:

Method/Request-URI contains the SIP method and URI. In the example, the session is established - the INVITE method, the subscriber is called [email protected].
Status-Code - response code for the previous SIP command. In this example, the command was completed successfully - code 200, i.e. Subscriber 555 picked up the phone.
Via - address where subscriber 777 is waiting for an answer. For the 200 OK message, this field is copied from the INVITE message.
From/To - display name and address of the sender and recipient of the message. For the 200 OK message, this field is copied from the INVITE message.
Cseq contains the sequence number of the command and the name of the method to which the given message. For the 200 OK message, this field is copied from the INVITE message.
Content-Type - the type of data that is passed in the Body block, in this case- SDP data.
Connection Information - IP address to which the second subscriber needs to send RTP packets (or UDPTL packets in case of fax transmission via T.38).
Media Description - the port to which the second subscriber must transmit the specified data. In this case, these are audio (audio RTP/AVP) and a list of supported data types - PCMU, PCMA, GSM codecs and DTMF signals.

An SDP message consists of lines containing FIELD=VALUE pairs. The main fields include:

o- Origin, session organizer name and session ID.
With- Connection Information, the field is described earlier.
m- Media Description, the field is described earlier.
a- media attributes, specify the format of the transmitted data. For example, they indicate the direction of sound - reception or transmission (sendrecv), for codecs they indicate the sampling rate and the binding number (rtpmap).

RTP packets contain audio/video data encoded in a specific format. This format specified in the PT (payload type) field. Value lookup table given field See https :// wikipedia org wiki RTP audio video profile for the specific format.

RTP packets also contain a unique SSRC identifier (determines the source of the RTP stream) and a timestamp (timestamp, used to play audio or video evenly).

An example of interaction between two SIP subscribers through a SIP server (Asterisk):

As soon as a SIP phone is started, the first thing it does is register on a remote server (SIP Registar), send it a SIP REGISTER message.

When calling a subscriber, a SIP INVITE message is sent, the body of which contains an SDP message containing the audio/video transmission parameters (which codecs are supported, which IP and port to send audio to, etc.).

When the remote subscriber picks up the phone, we receive a SIP 200 OK message also with SDP parameters, only the remote subscriber. Using the sent and received SDP parameters, you can set up an RTP audio/video session or a T.38 fax session.

If the received SDP parameters did not suit us, or the intermediate SIP server decided not to pass RTP traffic through itself, then the SDP re-negotiation procedure, the so-called REINVITE, is performed. By the way, it is precisely because of this procedure that free SIP proxy servers have one drawback - if both subscribers are on the same local network, and the proxy server is behind NAT, then after redirecting RTP traffic, none of the subscribers will hear another.

After the end of the conversation, the subscriber who hung up sends a SIP BYE message.

3. Transferring information about pressed buttons

Sometimes, after the session is established, during a call, access to additional services (VAS) is required - call hold, transfer, voice mail, etc. - which react to certain combinations of pressed buttons.

So, in a regular telephone line, there are two ways to dial a number:

Pulse - historically the first, was used mainly in phones with a rotary dialer. The dialing occurs due to the sequential closing and opening of the telephone line according to the dialed digit.
Tone - dialing with DTMF codes (Dual-Tone Multi-Frequency) - each button of the phone has its own combination of two sinusoidal signals (tones). By executing the Goertzel algorithm, it is quite easy to determine the pressed button.

During a conversation, the pulse method is inconvenient for transmitting the pressed button. So, it takes approximately 1 second to transmit "0" (10 pulses of 100 ms each: 60 ms - line break, 40 ms - line close) plus 200 ms for a pause between digits. In addition, characteristic clicks will often be heard during pulse dialing. Therefore, in conventional telephony, only the tone mode of access to VAS is used.

In VoIP telephony, information about pressed buttons can be transmitted in three ways:

DTMF Inband - generating an audio tone and transmitting it inside the audio data (current RTP channel) is a normal tone dial.
RFC2833 - a special telephone-event RTP packet is generated, which contains information about the pressed key, volume and duration. The number of the RTP format in which RFC2833 DTMF packets will be transmitted is specified in the body of the SDP message. For example: a=rtpmap:98 telephone-event/8000.
SIP INFO - a SIP INFO packet is formed with information about the pressed key, volume and duration.

DTMF transmission inside audio data (Inband) has several disadvantages - these are overhead resources when generating / embedding tones and when detecting them, limitations of some codecs that can distort DTMF codes, and poor transmission reliability (if some of the packets are lost, detection may occur pressing the same key twice).

The main difference between DTMF RFC2833 and SIP INFO: if the SIP proxy server has the ability to transfer RTP directly between subscribers bypassing the server itself (for example, canreinvite=yes in asterisk), then the server will not notice RFC2833 packets, as a result of which VAS services become unavailable . SIP packets are always transmitted through SIP proxy servers, so VAS will always work.

4. Voice and fax transmission

As already mentioned, the RTP protocol is used to transfer media data. RTP packets always specify the format of the transmitted data (codec).

There are many different codecs for voice transmission, with different ratios of bitrate / quality / complexity, there are open and closed ones. Any softphone must have support for G.711 alaw/ulaw codecs, their implementation is very simple, the sound quality is not bad, but they require a bandwidth of 64 kbps. For example, the G.729 codec requires only 8 kbps, but is very CPU intensive, and it's not free.

For fax transmission, either the G.711 codec or the T.38 protocol is usually used. Sending faxes using the G.711 codec corresponds to sending a fax using the T.30 protocol, as if the fax were sent over a regular telephone line, but at the same time analog signal from the line is digitized according to the alaw/ulaw law. This is also called Inband T.30 faxing.

Faxes using the T.30 protocol negotiate their parameters: transmission speed, datagram size, type of error correction. The T.38 protocol is based on the T.30 protocol, but unlike the Inband transmission, the generated and received T.30 commands are analyzed. Thus, not raw data is transmitted, but recognized fax control commands.

The T.38 command is transmitted using the UDPTL protocol, which is a UDP-based protocol and is only used for T.38. TCP and RTP protocols can also be used to transmit T.38 commands, but they are used much less frequently.

The main advantages of T.38 are reduced network load and greater reliability compared to Inband fax transmission.

The procedure for sending a fax in T.38 mode is as follows:

A normal voice connection is established using any codec.
When paper is loaded in the sending fax machine, it periodically sends a T.30 CNG (Calling Tone) signal to indicate that it is ready to send a fax.
On the receiving side, a T.30 signal CED (Called Terminal Identification) is generated - this is the readiness to receive a fax. This signal is sent either after pressing the "Receive Fax" button or the fax does it automatically.
The CED signal is detected on the sending side and the SIP REINVITE procedure occurs, and the T.38 type is indicated in the SDP message: m=image 39164 udptl t38.

Sending faxes over the Internet preferably in T.38. If the fax needs to be transmitted within the office or between objects that have a stable connection, then Inband T.30 fax transmission can be used. In this case, before sending a fax, the echo cancellation procedure must be turned off so as not to introduce additional distortions.

Very detailed information about faxing is written in the book "Fax, Modem, and Text for IP Telephony" by David Hanes and Gonzalo Salgueiro.

5. Digital signal processing (DSP). Ensuring sound quality in IP telephony, test examples

We have dealt with the protocols for establishing a conversation session (SIP / SDP) and the method of transmitting audio over an RTP channel. There was one important question - sound quality. On the one hand, the sound quality is determined by the selected codec. But on the other hand, additional DSP procedures (DSP - digital signal processing) are still needed. These procedures take into account the peculiarities of VoIP-telephony: a high-quality headset is not always used, there are packet drops on the Internet, sometimes packets arrive unevenly, and the network bandwidth is also not rubber.

Basic procedures that improve sound quality:

VAD(Voice activity detector) - a procedure for determining frames that contain voice (active voice frame) or silence (inactive voice frame). This separation can significantly reduce network load, since the transmission of information about silence requires much less data (it is enough to transmit the noise level or nothing at all).

Some codecs already contain VAD procedures (GSM, G.729), while others (G.711, G.722, G.726) need to implement them.

If the VAD is configured to transmit information about the noise level, then special SID packets (Silence Insertion Descriptor) are transmitted in the 13th CN (Comfort Noise) RTP format.

It is worth noting that SID packets can be dropped by SIP proxy servers, so for verification it is advisable to configure the transmission of RTP traffic past SIP servers.

CNG(comfort noise generation) - a procedure for generating comfort noise based on information from SID packets. Thus, VAD and CNG work in conjunction, but the CNG procedure is much less in demand, since it is not always possible to notice the work of CNG, especially at low volume.

PLC(packet loss concealment) - the procedure for restoring the audio stream in case of packet loss. Even with 50% packet loss, a good PLC algorithm can achieve acceptable speech quality. Distortions, of course, will be, but you can make out the words.

The easiest way to emulate packet loss (on Linux) is to use the tc utility from the iproute package with the netem module. It only performs shaping of outgoing traffic.

An example of running network emulation with 50% packet loss:

Tc qdisc change dev eth1 root netem loss 50%

Disable emulation:

Tc qdisc del dev eth1 root

jitter buffer- a procedure for getting rid of the jitter effect, when the interval between received packets changes very much, and which, in the worst case, leads to an incorrect order of received packets. Also, this effect leads to speech interruptions. To eliminate the jitter effect, it is necessary to implement a packet buffer on the receiving side with a size sufficient to restore the original order of sending packets at a given interval.

You can also emulate the jitter effect using the tc utility (the interval between the expected moment of packet arrival and the actual moment can be up to 500 ms):

tc qdisc add dev eth1 root netem delay 500ms reorder 99%

LEC(Line Echo Canceller) - a procedure for eliminating local echo when the remote subscriber begins to hear his own voice. Its essence is to subtract the received signal from the transmitted signal with a certain coefficient.

Echoes can occur for several reasons:

acoustic echo due to poor-quality audio path (sound from the speaker enters the microphone);
electrical echo due to impedance mismatch between telephone and SLIC module. In most cases, this occurs in circuits that convert a 4-wire telephone line to 2-wire.

Finding out the reason (acoustic or electrical echo) is not difficult: the subscriber on whose side the echo is created must turn off the microphone. If the echo still occurs, then it is electrical.

For more information on VoIP and DSP procedures, see VoIP Voice and Fax Signal Processing. A preview is available on Google Books.

This completes a superficial theoretical overview of VoIP. If you are interested, then an example of the practical implementation of a mini-PBX on a real hardware platform can be considered in the next article.

[!?] Questions and comments are welcome. They will be answered by the author of the article Dmitry Valento, a software engineer at the Promwad electronics design center.

Tags:

for beginners
for newbies

Add tags

RTP and RSVP protocols,

http://www.isuct.ru/~ivt/books/NETWORKING/NET10/269/pa.html

Modern applications cannot tolerate their packets arriving late. Two protocols (RTP and PSVP) ensure timely delivery with quality of service.

The continued growth of the Internet and private networks places new demands on bandwidth. Client-server applications are far superior to Telnet in terms of the amount of data transferred. World wide web led to a gigantic increase in the graph graphic information. Today, in addition, voice and video applications put forward their own specific requirements for already overloaded networks.

In order to satisfy all these demands, one increase in network capacity is not enough. What is really needed are smart efficient methods of schedule management and workload control.

Historically, IP-based networks have provided all applications with only the simplest data delivery service possible. However, needs have changed over time. Organizations that have spent millions of dollars installing an IP-based network to transfer data between local networks, are now faced with the fact that such configurations are not able to efficiently support new multicast real-time multimedia applications.

ATM is the only network technology, which was originally designed to support normal TCP and UDP traffic along with real-time traffic. However, going ATM means either creating a new network infrastructure for real-time traffic or replacing an existing IP-based configuration, both of which are very expensive.

Therefore, the need to support multiple types of traffic with different quality of service requirements within the TCP/IP architecture is very urgent. Two key tools are designed to solve this problem: the Real-Time Transport Protocol (RTP) and the Resource Reservation Protocol (RSVP).

RTP guarantees delivery of data to one or more destinations with a delay within specified limits. This means that the data can be replayed in real time. RSVP allows end systems to redundant network resources to obtain the required quality of service, especially resources for real-time traffic over the RTP protocol.

The most widely used transport layer protocol is TCP. Although TCP can support a wide variety of distributed applications, it is not suitable for real-time applications.

In real-time applications, the sender generates a data stream at a constant rate, and the receiver(s) must provide that data to the application at the same rate. Such applications include audio and video conferencing, live video distribution (for immediate playback), shared workspaces, medical remote diagnostics, computer telephony, distributed interactive simulation, games, and real-time monitoring.

Using TCP as the transport protocol for these applications is not possible for several reasons. First, this protocol only allows connection between two endpoints and is therefore not suitable for multicast. It provides for the retransmission of lost segments arriving at a time when the real-time application is no longer waiting for them. In addition, TCP does not have a convenient mechanism for associating timing information with segments, which is also a requirement for real-time applications.

Another widely used transport layer protocol, UDP does not have the first two

restrictions (point-to-point connection and transmission of lost segments), but it does not provide critical synchronization information. So UDP itself doesn't have any tools general purpose for real time applications.

While each real-time application may have its own mechanisms to support real-time transmission, they share many common features that make defining a single protocol highly desirable. The standard protocol of this kind is RTP, defined in RFC 1889.

In a typical real-time environment, the sender generates packets at a constant rate. They are sent to them at regular intervals, travel through the network, and are received by the receiver, who plays back the data in real time as it is received.

However, due to the variation in latency as packets travel across the network, they arrive at irregular intervals. To compensate for this effect, incoming packets are buffered, held for a while, and then provided at a constant rate. software The that generates the output. To make this scheme work, each packet is timestamped so that the receiver can replay the incoming data at the same speed as the sender.

RTP supports real-time data transfer between multiple participants in a session. (A session is a logical relationship between two or more RTP users that is maintained for the duration of the data transfer. The process of opening a session is outside the scope of RTP.)

While RTP can also be used for real-time unicast, its strength lies in its multicast support. To do this, each RTP data block contains a sender identifier indicating which participant is generating the data. The RTP data blocks also contain a timestamp so that the data can be played back at the correct intervals by the receiving end.

In addition, RTP defines the payload format of the transmitted data. Directly related to this is the concept of synchronization, which is partly the responsibility of the mixer - the RTP translation mechanism. Upon receiving streams of RTP packets from one or more sources, it combines them and sends a new stream of RTP packets to one or more recipients. The mixer can simply combine the data and also change its format.

Mixer Application Example - Combining Multiple Audio Sources. For example, suppose that some of the systems in a given audio session each generate their own RTP stream. Most of the time only one source is active, although sometimes multiple sources are "talking" at the same time.

If new system wants to participate in a session, but its link to the network does not have sufficient accurate capacity to support all RTP streams, then the mixer receives all these streams, merges them into one, and passes the last one to the new session member. When multiple streams are received, the mixer adds the PCM values. The RTP header generated by the mixer includes the identifier(s) of the sender(s) whose data is present in the packet.

A simpler device creates one outgoing RTP packet for each incoming RTP packet. This mechanism, called a translator, can change the format of the data in the packet or use a different set of low-level protocols to transfer data from one domain to another. For example, a potential recipient may not be able to process the high-speed video signal used by other participants in the session. The translator then converts the video to a lower quality format that requires a lower bit rate.

Each RTP packet has a basic header and possibly additional application-specific fields. Rice. 4 illustrates the structure of the main header. The first 12 octets consist of the following fields:

version field (2 bits): Current version- second;
padding field (1 bit): This field signals the presence of padding octets at the end of the payload. (Padding is applied when the application requires the payload size to be a multiple of, for example, 32 bits.) In this case, the last octet indicates the number of padding octets;
header extension field (1 bit): when this field is set, then the main header is followed by an additional one, used in experimental RTP extensions;
sender count field (4 bits): this field contains the number of identifiers of the senders whose data is in the packet, the identifiers themselves following the main header;
marker field (1 bit): The meaning of the marker bit depends on the payload type. The marker bit is typically used to indicate the boundaries of the data stream. In the case of video, it sets the end of the frame. In the case of voice, it specifies the start of speech after a period of silence;
payload type field (7 bits): This field identifies the payload type and data format, including compression and encryption. In the stationary state, the sender uses only one payload type per session, but it can change it in response to changing conditions if signaled by the Real-Time Transport Control Protocol;
sequence number field (16 bits): each source starts numbering packets from an arbitrary number, then increments by one with each RTP data packet sent. This allows you to detect packet loss and determine the order of packets with the same timestamp. Several consecutive packets may have the same timestamp if they are logically generated at the same instant (eg packets belonging to the same video frame);
timestamp field (32 bits): records the point in time when the first octet of payload data was generated here. The units in which the time is specified in this field depend on the type of payload. The value is determined by the sender's local clock;
Sync Source ID field: A randomly generated number that uniquely identifies the source during a session.

The main header may be followed by one or more sender identifier fields whose data is present in the payload. These identifiers are inserted by the mixer.

The RTP protocol is used only to transfer user data - usually multicast - to all participants in the session. A separate Real-Time Transport Control Protocol (RTCP) works with multiple destinations to provide feedback to RTP data senders and other session participants.

RTCP uses the same basic transport protocol as RTP (usually UDP), but a different port number. Each session participant periodically sends an RTCP packet to all other session participants. RFC 1889 describes three functions performed by RTCP.

The first function is to provide quality of service and feedback in case of congestion. Since RTCP packets are multicast, all participants in the session can evaluate how well the other participants work and receive. The sender's messages allow recipients to evaluate the data rate and transmission quality. The recipients' messages contain information about the problems they are experiencing, including packet loss and excessive ripple. For example, the bit rate for an audio/video application may be reduced if the link does not provide the desired quality of service at a given bit rate.

Recipient feedback is also important for diagnosing propagation errors.

By analyzing messages from all participants in a session, a network administrator can determine whether a given problem concerns one participant or is of a general nature.

The second main function of RTCP is sender identification. RTCP packets contain a standard textual description of the sender. They provide more information about the sender of data packets than a randomly selected sync source ID. In addition, they help the user to identify threads related to different sessions. For example, they allow the user to determine that separate audio and video sessions are open at the same time.

The third function is session sizing and scaling. To ensure quality of service and feedback to control congestion, as well as to identify the sender, all participants periodically send RTCP packets. The frequency of transmission of these packets decreases as the number of participants increases.

With a small number of participants, one RTCP packet is sent at most every five seconds. RFC 1889 describes an algorithm where participants limit the rate of RTCP packets based on the total number of participants. The goal is to keep RTCP traffic below 5% of the total session traffic.

The purpose of any network is to deliver data to the recipient with a guaranteed quality of service, including throughput, delay, and the allowable delay variation limit. As the number of users and applications grows, it becomes more and more difficult to ensure the quality of services.

Just responding to overload is no longer enough. A tool is needed to avoid congestion altogether, that is, to make it possible for applications to reserve network resources in accordance with the required quality of service.

Preventive measures are useful for both unicast and multicast. In unicast, two applications agree on a specific quality of service level for a given session. If the network is heavily loaded, it may not be able to provide the required quality of service. In this situation, applications will have to postpone the session until better times or try to reduce the quality of service requirements, if possible.

The solution in this case is for unicast applications to reserve resources to provide the required level of service. Then the routers on the intended path allocate resources (for example, a place in the queue and part of the capacity of the outgoing line). If the router is unable to allocate resources due to previous commitments, then it notifies the application. In this case, the application may try to initiate another session with lower quality of service requirements or reschedule it to a later date.

Multicasting poses much more complex resource reservation problems. It leads to the generation of huge amounts of network traffic - in the case of, for example, applications such as video, or when there is a large and dispersed group of recipients. However, traffic from a multicast source can in principle be significantly reduced.

There are two reasons for this. First, some members of a group may not need to deliver data from a particular source in a particular period of time. Thus, members of one group can receive information simultaneously via two channels (from two sources), but the recipient may be interested in receiving only one channel.

Secondly, that some members of the group are able to process only part of the information transmitted by the sender. For example, a video stream may consist of two components: one with low picture quality and the other with high picture quality. This format has a number of video compression algorithms: they generate a base component with a low quality picture and an additional component with a higher resolution.

Some recipients may not have enough processing power to process components with high resolution or be connected to the network through a subnet or link that does not have enough capacity to carry the full signal.

Resource reservation allows routers to determine in advance whether they can deliver multicast traffic to all recipients.

In previous attempts to implement resource reservations and in the approaches adopted in frame relay and ATM, the necessary resources are requested by the source of the data flow. This method is sufficient in the case of unicast transmission, because the transmitting application transmits data at a certain rate, and the required level of quality of service is inherent in the transmission scheme.

However, this approach cannot be used for multicasting. Different group members may have different resource requirements. If the original stream can be divided into substreams, then some members of the group may well want to receive only one of them. In particular, some receivers will only be able to process the low resolution video component. Or if several senders broadcast to the same group, then the recipient can choose only one sender or some subset of them. Finally, the quality of service requirements of different recipients may vary depending on the output equipment, processor power, and channel speed.

For this reason, resource reservations by the recipient are seen as preferable. Senders can provide routers with General characteristics traffic (e.g. data rate and variability), but it is up to the recipients to determine the required quality of service level. Routers then aggregate requests for resource allocations at common parts of the propagation tree.

RSVP is based on three concepts regarding data flows: session, flow specification, and filter specification. Session is a data stream identified by its destination. Note that this concept is different from that of an RTP session, although RSVP and RTP sessions may have a one-to-one correspondence. After a router reserves resources for a particular destination, it treats this as the start of a session and allocates resources for the duration of that session.

A reservation request from the destination end system, called a flow descriptor, consists of a flow specification and a filter. Flow specification defines the required quality of service and is used by the node to set the parameters of the packet scheduler. The router transmits packets with a given set of preferences based on the current flow specification.

Filter specification defines a set of packages under which resources are requested. Together with the session, it defines a set of packets (or flow) for which the required quality of service is to be provided. Any other packets destined for that destination are processed insofar as the network is able to do so.

RSVP does not define the content of the flow specification, it simply passes the request. A flow specification typically includes a service class, Rspec (R stands for reserve), and Tspec (T stands for traffic). The other two parameters are a set of numbers. The Rspec parameter defines the required quality of service, and the Tspec parameter describes the data flow. The contents of Rspec and Tspec are transparent to RSVP.

In principle, a filter specification describes an arbitrary subset of packets from a single session (that is, those packets whose destination is determined by that session). For example, a filter specification might define only specific senders, or define protocols or packets whose protocol header fields match those specified.

Rice. 3 illustrates the relationship between session, flow specification, and filter specification. Each incoming packet belongs to at least one session and is considered according to the logical flow for that session. If the packet does not belong to any session, then it is delivered insofar as there are free resources.

The main difficulty with RSVP is related to multicasting. An example of a multicast configuration is shown in fig. 6. This configuration consists of four routers. The link between any two routers, represented by a line, can be either a direct link or a subnet. Three hosts - Gl, G2 and G3 - are in the same group and receive datagrams with the corresponding multicast address. Data at this address is transmitted by two hosts - S1 and S2. The red line corresponds to the routing tree for S1 and this group, and the blue line for S2 and this group. Arrow lines indicate the direction of packets from S1 (red) and from S2 (blue).

The figure shows that all four routers must be aware of each recipient's resource reservation. Thus, resource allocation requests propagate backward through the routing tree.

RSVP uses two main message types: Resv and Path. Resv messages are generated by recipients and propagate up the tree, with each node along the way concatenating and reassembling packets from different recipients when possible. These messages cause the router to enter a resource reservation state for that session (multicast address). Eventually all the combined Resv messages reach the sending hosts. Based on the information received, they set the appropriate schedule control parameters for the first hop.

Rice. 7 shows the Resv message flow. Please note: messages are concatenated; therefore, only one message is sent up any branch of the combined delivery tree. However, these messages must be resent periodically to extend the resource reservation period.

The Path message is used to propagate reverse route information. All modern multicast routing protocols support only the direct route in the form of a propagation tree (down from the sender). But Resv messages must be sent back through all intermediate routers to all sending hosts.

Since the routing protocol does not provide reverse route information, it is carried by RSVP in Path messages. Any host that wants to be the sender sends a Path message to all members of the group. Along the way, each router and each destination host enters the path state, indicating that packets for this sender should be forwarded to the hop from which the packet was received. Rice. 5 shows that the Path packets are sent over the same paths as the data packets.

Consider the operation of the RSVP protocol. From the host's point of view, the operation of the protocol consists of the following steps (the first two steps in this sequence are sometimes reversed).

The recipient joins the multicast group by sending an IGMP message to the neighbor router.
The potential sender sends a message to the address of the group.
The recipient receives a Path message identifying the sender.
Now that the receiver has information about the return path, it can send Resv messages with stream descriptors.
Resv messages are sent over the network to the sender.
The sender starts transmitting data.
The receiver starts receiving data packets.

Yesterday's methods of working with large volumes of graphics are completely unsuitable for modern systems. Without new tools, it is impossible to meet the growing requirements for data transmission due to the growth of their volume, the spread of real-time applications and multicast distribution. RTP and RSVP provide a solid foundation for next generation LANs.

An example of the real application of these protocols is the VoIP (Voice over IP) model - voice transmission over IP networks, which is described in the H.232 standard and provides for the transmission of audio, video information and data over an IP network. In this case, the real-time protocol RTP is used to establish a connection, and the RSVP protocol is used to reserve network resources.

RTP protocol

The main transport protocol for multimedia applications has become the real-time protocol RTP (Real-Time Protocol), designed to organize the transmission of packets with coded speech signals over an IP network. RTP packets are transmitted over UDP protocol, working, in turn, over IP (Fig. 1.5.).

Rice. 1.5.

In fact, the level to which RTP belongs is not defined as unambiguously as shown in Fig. 1.5 and as it is usually described in the literature. On the one hand, the protocol really works on top of UDP, it is implemented application programs and, by all indications, is an application protocol. But at the same time, as stated at the beginning of this paragraph, RTP provides transport services independently of multimedia applications and is, from this point of view, just a transport protocol. Best definition: RTP is a transport protocol implemented at the application layer.

To transmit voice (multimedia) traffic, RTP uses packets, the structure of which is shown in Fig. 1.6.

An RTP packet consists of at least 12 bytes. The first two bits of the RTP header (version bit field, V) indicate the version of the RTP protocol (currently version 2).

Clearly, with this header structure, only one more RTP version is possible at most. The field following them contains two bits: the P bit, which indicates whether padding characters have been added to the end of the payload field (they are usually added if the transport protocol or encoding algorithm requires the use of fixed-size blocks), and the X bit, which indicates Whether an extended header is being used.

Rice. 1.6.

If used, the first word of the extended header contains the total length of the extension. Further, the four CC bits determine the number of CSRC fields at the end of the RTP header, i.e. the number of sources forming the flow. The marker bit M allows you to mark what the standard defines as significant events, for example, the beginning of a video frame, the beginning of a word in an audio channel, and so on. It is followed by a PT data type field (7 bits), which indicates the payload type code that determines the contents of the payload field - application data (Application Data), for example, uncompressed 8-bit MP3 audio, etc. From this code, the application can learn what to do to decode the data. The rest of the fixed-length header consists of a Sequence Number field, a Time Stamp field to record when the first word of the packet was created, and an SSRC timing source field that identifies this source. The last field can be a single device with only one network address, multiple sources that can represent different media (audio, video, etc.), or different streams of the same media. Since the sources may be different devices, the SSRC identifier is chosen randomly so that the chance of receiving data from two sources at once during an RTP session is minimal. However, a mechanism for resolving conflicts if they arise is also defined. The fixed part of the RTP header can be followed by up to 15 separate 32-bit CSRC fields that identify data sources.

RTP is supported by the Real-Time Transport Control Protocol (RTCP), which generates additional reports containing information about RTP sessions. Recall that neither UDP nor RTP are engaged in providing QoS (Quality of Service). The RTCP protocol provides feedback to senders, and to stream receivers it provides some QoS enhancements, packet information (loss, delay, jitter) and user (application, stream). For flow control, there are two types of reports - generated by senders and generated by recipients. For example, information about the percentage of lost packets and the absolute number of losses allows the sender, when receiving a report, to detect that channel congestion may cause receivers not to receive packet streams that they expected. In this case, the sender has the option to lower the coding rate to reduce congestion and improve reception. The sender report contains information about when the last RTP packet was generated (it includes both an internal timestamp and real time). This information allows the recipient to coordinate and synchronize multiple streams such as video and audio. If the stream is directed to several recipients, then streams of RTCP packets from each of them are organized. This will take steps to limit the bandwidth - inversely proportional to the rate at which RTCP reports are generated and the number of recipients.

It should be noted that although RTCP works separately from RTP, the RTP/UDP/IP chain itself leads to significant overhead (in the form of their headers). The G.729 codec generates packets of 10 bytes (80 bits every 10 ms). One RTP header, 12 bytes in size, is larger than this entire packet. In addition, an 8-byte UDP header and a 20-byte IP header (in IPv4) must be added to it, which creates a header that is four times the size of the transmitted data.

Thematic materials: