How to build large-scale end-to-end encrypted group video calls

Signal released end-to-end code group calls a year ago, and since then we ’ ve scaled from support for 5 participants all the way to 40. There is no off the shelf software that would allow us to support calls of that size while ensuring that all communication is end-to-end code, so we built our own open source Signal Calling Service to do the job. This post will describe how it works in more contingent .

Selective Forwarding Units (SFUs)

In a group call, each party needs to get their audio and video to every early player in the cry. There are 3 potential general architectures for doing so :

  • Full mesh: Each call participant sends its media (audio and video) directly to each other call participant. This works for very small calls, but does not scale to many participants. Most people just don’t have an Internet connection fast enough to send 40 copies of their video at the same time.
  • Server mixing: Each call participant sends its media to a server. The server “mixes” the media together and sends it to each participant. This works with many participants, but is not compatible with end-to-end encryption because it requires that the server be able to view and alter the media.
  • Selective Forwarding: Each participant sends its media to a server. The server “forwards” the media to other participants without viewing or altering it. This works with many participants, and is compatible with end-to-end-encryption.

Because Signal must have end-to-end encoding and scale to many participants, we use selective forwarding. A server that does selective forward is normally called a Selective Forwarding Unit or SFU.

If we focus on the run of media from a single commit player through an SFU to multiple receiving participants, it looks like this : SFU diagram A simplify version of the chief loop topology in the code in an SFU looks like this :

let socket = std::net::UdpSocket::bind(config.server_addr);  
let mut clients = ...;  // changes over time as clients join and leave
loop {
  let mut incoming_buffer = [0u8; 1500];
  let (incoming_size, sender_addr) = socket.recv_from(&mut incoming_buffer);
  let incoming_packet = &incoming_buffer[..incoming_size];

  for receiver in &clients {
     // Don't send to yourself
     if sender_addr != receiver.addr {
       // Rewriting the packet is needed for reasons we'll describe later.
       let outgoing_packet = rewrite_packet(incoming_packet, receiver);
       socket.send_to(&outgoing_packet, receiver.addr);
     }
  }
}

Signal’s Open Source SFU

When building support for group calls, we evaluate many loose source SFUs, but only two had adequate congestion master ( which, as we ’ ll see shortly, is critical ). We launched group calls using a modify translation of one of them, but concisely found that even with grave modifications, we couldn ’ t faithfully scale past 8 participants ascribable to high waiter CPU usage. To scale to more participants, we wrote a fresh SFU from scrape in Rust. It has now been serving all Signal group calls for 9 months, scales to 40 participants with comfort ( possibly more in the future ), and is clear enough to serve as a reference implementation for an SFU based on the WebRTC protocols ( ICE, SRTP, transport-cc, and googcc ). Let ’ s now take a deeper dive into the hardest part of an SFU. As you might have guessed, it ’ mho more complex than the simplified loop above .

The Hardest Part of an SFU

The hardest share of an SFU is forwarding the right video resolutions to each call participant while network conditions are constantly changing. This difficulty is a combination of the stick to fundamental problems :

  1. The capacity of each participant’s Internet connection is constantly changing and hard to know. If the SFU sends too much, it will cause additional latency. If the SFU sends too little, the quality will be low. So the SFU must constantly and carefully adjust how much it sends each participant to be “just right”.
  2. The SFU cannot modify the media it forwards; to adjust how much it sends, it must select from media sent to it. If the “menu” to select from were limited to sending either the highest resolution available or nothing at all, it would be difficult to adjust to a wide variety of network conditions. So each participant must send the SFU multiple resolutions of video and the SFU must constantly and carefully switch between them.

The solution is to combine several techniques which we will discuss individually :

  • Simulcast and Packet Rewriting allow switching between different video resolutions.
  • Congestion Control determines the right amount to send.
  • Rate Allocation determines what to send within that budget.

Simulcast and Packet Rewriting

In order for the SFU to be able to switch between different resolutions, each participant must send to the SFU many layers ( resolutions ) simultaneously. This is called simulcast. If we focus on just one sender ’ second media being forwarded to two receivers, it looks like this, where each recipient switches between small and medium layers but at unlike times : Simulcast diagram But what does the experience player determine as the SFU switches between unlike layers ? Does it see one layer switching resolutions or does it see multiple layers switching on and off ? This may seem like a minor distinction, but it has major implications for the role the SFU must play. Some video codecs, such as VP9 or AV1, make this slowly : switch layers is built into the video recording codec in a room called SVC. Because we ’ re however using VP8 to support a wide range of devices, and since VP8 doesn ’ t accompaniment SVC, the SFU must do something to transform 3 layers into 1. This is similar to how video streaming apps pour different choice video recording to you depending on how fast your Internet connection is. You view a one video stream switching between different resolutions, and in the background, you are receiving unlike encodings of the same television stored on the server. Like a video streaming waiter, the SFU sends you different resolutions of the same video. But unlike a video recording streaming waiter, there is nothing store and it must do this completely on the fly. It does so via a process called packet rewriting. Packet rewrite is the process of altering the timestamps, sequence numbers, and similar IDs that are contained in a media packet that indicate where on a media timeline a package belong to. It transforms packets from many independent media timelines ( one for each level ) into one mix media timeline ( one layer ). The IDs that must be rewritten when using RTP and VP8 are the postdate :

  • RTP SSRC: Identifies a stream of consecutive RTP packets. Each simulcast layer is identified by a unique SSRC. To convert from many layers (for example, 1, 2, and 3) to one layer, we must change (rewrite) this value to the same value (say, 1).
  • RTP sequence number: Orders the RTP packets that share an SSRC. Because each layer has a different number of packets, it is not possible to forward packets from multiple layers without changing (rewriting) the sequence numbers. For example, if we want to forward sequence numbers [7, 8, 9] from one layer followed by [8, 9, 10, 11] from another layer, we can’t send them as [7, 8, 9, 9, 10, 11]. Instead we’d have to rewrite them as something like [7, 8, 9, 10, 11, 12, 13].
  • RTP timestamp: Indicates when a video should be rendered relative to a base time. Because the WebRTC library we use chooses a different base time for each layer, the timestamps are not compatible between layers, and we must change (rewrite) the timestamps of one layer to match that of another.
  • VP8 Picture ID and TL0PICIDX: Identifies a group of packets which make up a video frame, and the dependencies between video frames. The receiving participant needs this information to decode the video frame before rendering. Similar to RTP timestamps, the WebRTC library we use chooses different sets of PictureIDs for each layer, and we must rewrite them when combining layers.

It would be theoretically possible to only rewrite good the RTP SSRCs and sequence numbers if we altered the WebRTC library to use consistent timestamps and VP8 PictureIDs across layers. however, we already have many clients in consumption generating discrepant IDs, so we need to rewrite all of those IDs to remain backwards compatible. And since the code to rewrite the versatile IDs is about identical to rewriting RTP sequence numbers, it ’ s not unmanageable to do so. To transform a single outgoing layer from multiple incoming layers for a given video stream, the SFU rewrites packets according to the follow rules :

  1. The outgoing SSRC is always the incoming SSRC of the smallest layer.
  2. If the incoming packet has an SSRC other than the one currently selected, don’t forward it.
  3. If the incoming packet is the first after a switch between layers, alter the IDs to represent the latest position on the outgoing timeline (one position after the maximum position forwarded so far).
  4. If the incoming packet is a continuation of packets after a switch (it hasn’t just switched), alter the IDs to represent the same relative position on the timeline based on when the switch occurred in the previous rule.

For exemplar, if we had two input layers with SSRCs A and B and a interchange occured after two packets, packet rewriting may look something like this :

Packet rewriting A simplified version of the code looks something like this :

let mut selected_ssrc = ...;  // Changes over time as bitrate allocation happens
let mut previously_forwarded_incoming_ssrc = None;
// (RTP seqnum, RTP timestamp, VP8 Picture ID, VP8 TL0PICIDX)
let mut max_outgoing_ids = (0, 0, 0, 0);
let mut first_incoming_ids = (0, 0, 0, 0);
let mut first_outgoing_ids = (0, 0, 0, 0);
for incoming in incoming_packets {
  if selected_ssrc == incoming.ssrc {
    let just_switched = Some(incoming.ssrc) != previously_forwarded_incoming_ssrc;
    let outgoing_ids = if just_switched {
      // There is a gap of 1 seqnum to signify to the decoder that the
      // previous frame was (probably) incomplete.
      // That's why there's a 2 for the seqnum.
      let outgoing_ids = max_outgoing + (2, 1, 1, 1);
      first_incoming_ids = incoming.ids;
      first_outgoing_ids = outgoing_ids;
      outgoing_ids
    } else {
      first_outgoing_ids + (incoming.ids - first_incoming_ids)
    }

    yield outgoing_ids;

    previous_outgoing_ssrc = Some(incoming.ssrc);
    max_outgoing_ids = std::cmp::max(max_outgoing_ids, outgoing_ids);
  }
}

Packet rewriting is compatible with end-to-end encoding because the rewrite IDs and timestamps are added to the packet by the transport player after the end-to-end encoding is applied to the media ( more on that under ). It is similar to how TCP sequence numbers and timestamps are added to packets after encoding when using TLS. This means the SFU can view these timestamps and IDs, but these values are no more matter to than TCP succession numbers and timestamps. In other words, the SFU doesn ’ metric ton determine anything from these values except that the participant is however sending media .

Congestion Control

Congestion control is a mechanism to determine how much to send over a net : not excessively much and not besides little. It has a long history, largely in the shape of TCP ’ south congestion control. unfortunately, TCP ’ mho congestion control algorithms by and large don ’ thyroxine oeuvre well for video calls because they tend to cause increases in rotational latency that lead to a poor call option experience ( sometimes called “ stave ” ). To provide good congestion control for video calls, the WebRTC team created googcc, a congestion see algorithm which can determine the right field total to send without causing large increases in reaction time. Congestion restraint mechanisms by and large depend on some kind of feedback mechanism sent from the packet receiver to the packet transmitter. googcc is designed to work with transport-cc, a protocol in which the telephone receiver sends periodic messages back to the sender saying, for exercise, “ I received packet X1 at time Z1 ; mailboat X2 at time Z2, … ”. The sender then combines this information with its own timestamps to know, for exemplar, “ I sent packet X1 at time Y1 and it was received at Z1 ; I sent packet X2 at prison term Y2 and it was received at Z2… ”. In the Signal Calling Service, we have implemented googcc and transport-cc in the imprint of pour march. The inputs into the stream pipeline are the aforesaid data about when packets were sent and received, which we call acks. The outputs of the grapevine are changes in how a lot should be sent over the network, which we call the target send rates. The first few steps of the stream plot the acks on a graph of stay vs. time and then calculate a slope to determine if the check is increasing, decreasing, or regular. The last step decides what to do based on the current gradient. A simplify version of the code looks like this :

let mut target_send_rate = config.initial_target_send_rate;
for direction in delay_directions {
  match direction {
    DelayDirection::Decreasing => {
      // While the delay is decreasing, hold the target rate to let the queues drain.
    }
    DelayDirection::Steady => {
      // While delay is steady, increase the target rate.
      let increase = ...;
      target_send_rate += increase;
      yield target_send_rate;
    }
    DelayDirection::Increasing => {
      // If the delay is increasing, decrease the rate.
      let decrease = ...;
      target_send_rate -= decrease;
      yield target_send_rate;
    }
  }
}

This is the crux of googcc : If latency is increasing, stop sending so much. If rotational latency is decreasing, let it continue. If latency is brace, try sending more. The leave is a send rate which close approximates the actual network capacity while adjusting to changes and keeping rotational latency low. Of course, the “ … ” in the code above about how much to increase or decrease is complicated. Congestion control is intemperate. But now you can see how it by and large works for television calls :

  1. The sender picks an initial rate and starts sending packets.
  2. The receiver sends back feedback about when it received the packets.
  3. The sender uses that feedback to adjust the send rate with the rules described above.

Rate Allocation

once the SFU knows how much to send, it now must determine what to send ( which layers to forward ). This process, which we call pace allotment, is like the SFU choose from a menu of layers constrained by a send pace budget. For exemplar, if each player is sending 2 layers and there are 3 other participants, there would be 6 total layers on the menu. If the budget is big enough, we can send everything we want ( up to the largest layer for each player ). But if not, we must prioritize. To help in prioritization, each participant tells the server what resolutions it needs by requesting a maximum settlement. Using that information, we use the follow rules for rate allocation :

  1. Layers larger than the requested maximum are excluded. For example, there is no need to send you high resolutions of every video if you’re only viewing a grid of small videos.
  2. Smaller layers are prioritized over larger layers. For example, it is better to view everyone in low resolution rather than some in high resolution and others not at all.
  3. Larger requested resolutions are prioritized before smaller requested resolutions. For example, once you can see everyone, then the video that appears largest to you will fill in with higher quality before the others.

A simplified version of the code looks like the trace .

// The input: a menu of video options.
// Each has a set of layers to choose from and a requested maximum resolution.
let videos = ...;

// The output: for each video above, which layer to forward, if any
let mut allocated_by_id = HashMap::new();
let mut allocated_rate = 0;

// Biggest first
videos.sort_by_key(|video| Reverse(video.requested_height));

// Lowest layers for each before the higher layer for any
for layer_index in 0..=2 {
  for video in &videos {
    if video.requested_height > 0 {
      // The first layer which is "big enough", or the biggest layer if none are.
      let requested_layer_index = video.layers.iter().position(
         |layer| layer.height >= video.requested_height).unwrap_or(video.layers.size()-1)
      if layer_index <= requested_layer_index {
        let layer = &video.layers[layer_index];
        let (_, allocated_layer_rate) = allocated_by_id.get(&video.id).unwrap_or_default();
        let increased_rate = allocated_rate + layer.rate - allocated_layer_rate;
        if increased_rate < target_send_rate {
          allocated_by_id.insert(video.id, (layer_index, layer.rate));
          allocated_rate = increased_rate;
        }
      }
    }
  }
}

Putting it all together

By combining these three techniques, we have a full solution :

  1. The SFU uses googcc and transport-cc to determine how much it should send to each participant.
  2. The SFU uses that budget to select video resolutions (layers) to forward.
  3. The SFU rewrites packets from many layers into one layer for each video stream.

The solution is that each participant can view all the others in the best means potential given the current network conditions, and in a direction compatible with end-to-end-encryption .

End-to-end encryption

Speaking of end-to-end encoding, it ’ s worth concisely describing how it works. Because it ’ s wholly opaque to the waiter, the code for it is not in the waiter, but preferably in the client. In particular, our implementation exists in RingRTC, an open source television calling library written in Rust. The contents of each frame are encrypted before being divided into packets, exchangeable to SFrame. The matter to contribution is very the key distribution and rotation mechanism, which must be robust against the follow scenarios :

  1. Someone who has not joined the call must not be able to decrypt media from before they joined. If this were not the case, someone who could obtain the encrypted media (such as by compromising the SFU) would be able to know what happened in the call before they joined, or worse, without ever having joined.
  2. Someone who has left the call must not be able to decrypt media from after they left. If this were not the case, someone who could obtain the encrypted media would be able to know what happened in the call after they left.

In regulate to guarantee these properties, we use the succeed rules :

  1. When a client joins the call, it generates a key and sends it to all other clients of the call over Signal messages (which are themselves end-to-end encrypted) and uses that key to encrypt media before sending it to the SFU.
  2. Whenever any user joins or leaves the call, each client in the call generates a new key and sends it to all clients in the call. It then begins using that key 3 seconds later (allowing for some time for clients to receive the new key).

Using these rules, each client is in see of its own cardinal distribution and rotation and rotates keys depending on who is in the call, not who is invited to the call. This means that each customer can verify that the above security properties are guaranteed. We hope people enjoy support for large group throughout video recording calls, and that our open generator Secure Calling Service is utilitarian for anyone that wants to build and deploy their own throughout encrypted group television calls. And we ’ re lease if you ’ ra interest in working on problems like these at Signal !

Leave a Reply

Your email address will not be published.