Problem
We have been having issues with dropped packets, missing data points, and comms errors during testing of sensor board product (aka RIOMs) causing lost time and repeated testing errors. The test set up does work sometimes, however not reliably enough. I believe there may be issues with the router config. Being that UDP is primarily used by the RIOMs, maybe there's collision or some problem with bukets filling up?
The more tests we run per day, the more failure warnings and dropped/missing data points we get each time we run a batch.
I am taking on this project from someone who left the company a long time ago, and support is limited for this.
I have completely ruled out any physcal/cable connection issues.
Due to the nature of our work and contracts with customers, the product can not be changed in any way or the GUI used for testing. The IP addresses used are also unchangeable.
The test fixture
10 RIOMs, each one behind a single Router.
Routers are RB750UPr2 hex poe lite (mipsbe) v6.47.9
(config script below)
Diagram of test set up (below)
The process
Automated via interactive desktop GUI in Windows10.
Computer NIC is set to 192.168.129.250/24 and 10.10.10.250/24. This allows the GUI to send commands to the ACU for move-to, and also allows the GUI to send and receive sensor data from the individual RIOMs.
The GUI first sends a payload program to each RIOM for basic functionality.
After a power-cycle, the GUI puts the RIOMs into a "calibration" mode, and commands the ACU to move to various positions, sending "known" position data to each RIOM. At each position, the RIOMs set their own internal offsets/calibration.
After calibration, the GUI commands the ACU to move the fixture into various known test point orientations.
At each test position, raw data is received by the GUI from all of the sensors/RIOMs.
The GUI calculates pass/fail based on the recorded data and predetermined tolerances at the end of the procedure.
The RIOMs that are being tested all have identical fixed IP address which is set in the hardware: 192.168.0.63.
The public IP of each RIOM must be 10.10.10.x, as seen by the GUI. (10.10.10.14 is RIOM#1. 10.10.10.24 is RIOM#2. 10.10.10.34 is RIOM#3. 10.10.10.104 is RIOM#10)
I was able to obtain the following info from Engineering regarding the protocols:
"The RIOMs use a custom UDP protocol on port 21001 for normal communications. However the GUI will also use a TCP connection to start the payload on port 9760. We add our own psudo-TCP checksum and structure on the UDP packets. If you're wanting to put some rules on the routers you'll need to allow ports 21001 and 9760.
The GUI only knows about the router's reported IP address. So if a router has 10.10.10.14 on it's WAN port that will be what the GUI is looking for. From the GUI's perspective all the IPs in its table are reachable RIOMs, it's up to the router to translate. The routers really just translate the WAN (GUI network) to LAN (individual RIOM network). Each router has a unique IP and will convert requests/responses back and forth."
I have attached a simple illustration of my test set up (keep in mind I have 10 routers, but this only shows 5).
Thank you very much for any advice. I have been having such difficulty with these as networking is not my area of expertise, and Routerboards have so many options my head spins.