Yoz Linden commented on SVC-7031:
Here's a little more in-depth information on what we've done to improve the Group Chat situation:
Both Group and Conference Chat use a grid-wide delivery system which has been in place for the past four years [That's when group chat broke for TSMGO]. It's a highly concurrent distributed system which uses many modern techniques (such as asynchronous I/O) to achieve high performance. Unfortunately, it had a number of bugs in it which only appear under high load, and would cause high lag and out-of-order delivery.
We've made brief attempts to fix this in the past, but debugging intermittent problems in a concurrent distributed system is not easy, as any software engineer will tell you. We had to dedicate a team to this diagnosis, which required them to greatly improve our facilities to simulate high load and monitor the systems under stress. (I wasn't a member of that team. I just stood on the sidelines and marvelled.)
As you should now be able to see, the strategy paid off. Several specific pain points were identified, fixes were made and tested, and the changes were rolled to the production grid. All our monitoring indicates that the capacity of the system has, at minimum, doubled. The vast majority of messages are being delivered in well under a second. The system as a whole is now much more scalable.
All that said, there are some important caveats to note:
* These fixes only affect Group and Conference Chat. Nearby Chat (also known as Local Chat) is an entirely different system, but we're also looking at potential fixes there.
* Despite huge improvements to Group Chat, we're not calling it "fixed" yet. Some people still see message delivery lag under some circumstances, and there are still plenty of potential improvements to make. However, in addition to the unclogging of the system, the team also added a great deal of extra monitoring and instrumentation. This should allow future fixes to be identified and implemented much more quickly.
posted by - 1:27 PM