You’ve answered yourself. The transport SSTP uses is TCP, and by nature of TCP, the receiving side TCP stack cannot release the newer data to the upper layer processing until the older data have arrived. So if a transport packet is lost, it takes some time until the sender notices that and sends the packet again. During all that time, the newer packets did arrive to the destination, but they cannot be unpacked until the lost one arrives too. So there is a gap in delivery of the payload UDP packets, and after the missing transport packet arrives, all the payload UDP packets which were waiting for it are sent together (in correct order but with no time between them).
BTW, even transport of TCP over TCP is a bad idea, google for “tcp meltdown”. So all in all, VPNs using TCP transport only work well where no packet loss happens.
So if you cannot switch over to IPsec, it will always be like this.
BTW, with IPsec on the same connection you wouldn’t get so much jitter but merely a short dropout in the audio because the lost packet would not be retransmitted. But if the loss rate is not too high, the result is much better because only one packet is lost; when 10 packets wait in a row and then come at the same moment, the dropout in the audio is longer unless the receiving side has a de-jittering buffer for 10 packets, which in most cases means 200 ms. And a round trip of 400 ms causes a lot of discomfort to the call parties and even double talk, so the DJB is usually shorter, so more than one packet is effectively lost if 10 packets have to wait for the 11th one.