MNWis Fusion Technical Net
Monday Night 7:30 PM Central 0030 UTC
WiRES-X: MNWis #21,493
YSF: “US MNWis RDNT”, #21,493

Repeater Lock-Ups


Many people have reported Fusion repeater lock-ups that required the repeater to be shut down (power removed) and rebooted. I’ve had repeater lock happen as well with an HRI-200 directly connected to the DR2X. In this particular case I believe the event was triggered by sending a number of commands from the radio (such as the GM mode) which resulted in a race condition within the firmware.

I have previously observed that invalid data (I.e., with a lot of bit errors or badly formed packets from a hotspot) can also cause the lock-up. My assumption is that this is a problem of not validating inputs to a function prior to executing the function. The invalid data is not handled correctly and causes one of the lock-up faults I mention below to occur.

I have observed that data is more likely to be invalid if there is a lot of noise on the signal. Also a major source of problems are badly behaving hot spots that inject all sorts of crap into the data stream. There’s no control over these devices and it just gets worse as there is more work being done converting from one thing to another thing, etc.

Communications software is EXTREMELY HARD to write and much of what I see out there is poorly documented (which leads to bad code) and poorly tested.

Reflectors such as YSF or FCS are pretty dumb when they relay signals. They look for an incoming packet from one of the connected hotspots and just send that same packet out to all the other hotspots (including bridges to WiRES-X). Garbage in, garbage out. To address this I have modified the YSF reflector software to be very aggressive at dropping packets with CRC errors, invalid data in fields, and most importantly invalid FICH headers. The FICH header is extremely important for everything to work right. (You can look up FICH headers in the System Fusion documentation published by Yaesu.) This modified software is running on MNWis, Princeton Kentucky, MVARC (Idaho) and America Link. It has dramatically reduced the number of problems with WiRES-X problems.

So what is a “lock-up”. A lock-up occurs when the processor looses track of what it was doing and either halts or ends up executing invalid code. In either case it is no longer processing the code it was intended to run. Events that cause the processor to, using a technical term, go out to out to lunch can result of errors in the software that cause it to happen, corrupted memory, invalid states in a state machine, incorrectly handled race conditions, or a really nasty thing called priority inversion (where a high priority task which is pre-empting use of the CPU is waiting for a low priority task to complete, which can’t complete because, well, it’s not high priority.)

How to solve the problem?
First of all, the repeater firmware should never crash. Since it is very hard (but not impossible) to write code with no bugs, most continuously operating real-time systems will incorporate a “watch dog timer”. It is the purpose of this timer to detect when the “wheels have fallen off the processor” and perform a reset.

Since we can’t modify the repeater’s software, there are some actions we can take.

First would be to prevent bad data from getting into the system. A careful examination of the logs possibly combined with audio recordings could be helpful.

A quick fix might be to install a timer on the power supply such that the repeater’s power is removed for a short time once or several times a day.

I have a technical concept for a generic lock-up detector that would reboot the repeater in the event of a lock-up without the need to make any modifications. Not sure if I’ll work on this because I don’t know how much interest there would be. This solution would also work with non-Yaesu repeaters.

One further note. I made an observation that this problem was more likely to occur on 2 meters than on 440. Why? My theory is that there is a lot more interaction with the environment on 2 meters. Computer networks, small power supplies, computers, almost everything digital generally creates a lot more noise on 2 meters than it does on 440 MHz. I observed a situation where mixing products were occurring on a 2 meter repeater in a noisy environment. This repeater would lock up regularly. My theory was that the mixing products contained a lot of noise along with the C4FM signal causing a high number of bit errors and thus crashing the repeater. This was observed on a DR1X.

If you are struggling with this problem, I suggest you contact Juan at Yaesu Customer Service with your details. Provide Juan with as much information as you can.

Hits: 1