The packets you need to prioritize for outgoing direction are IPsec transport packets sent by the router itself. Depending on whether there is an external NAT between the two routers or not, these transport packets are either ESP ones (no NAT) or UDP ones to port 4500 at the L2TP/IPsec client and from port 4500 at the server (with NAT).
In download direction, you cannot affect what the ISP sends to you except by slowing down the delivery of traffic which uses some kind of feedback (all TCP and some application protocols on top of UDP, such as QUIC). So the only way to guarantee enough bandwidth for real-time traffic which cannot be moderated by throttling this kind of "moderatable" traffic so low that enough download bandwidth remains to be used by real-time flows. You can, however, throttle also the "moderatable" traffic coming via the L2TP.
So if you have a 20 Mbit/s download bandwidth from your ISP, you have to reserve, say, 120 kbit/s per voice call coming through the VPN, so if you expect five of them to exist simultaneously at peak time, you must cap the other traffic at 19.4 Mbit/s. That's just an example, in fact there's more "non-moderatable" traffic than just IP telephony, so you'll have to see how low you must keep the "moderatable" one to have a clear sound.
And within the "moderatable" traffic, you can prioritize - what comes via the L2TP may have a higher priority than what comes directly via WAN.
Instead of writing novels, post /export hide-sensitive. Use find&replace in your favourite text editor to systematically replace all occurrences of each public IP address potentially identifying you by a distinctive pattern such as my.public.ip.1.