The key considerations in any server architecture are maximizing performance, security, stability, scalability while minimizing implementation complexity. For any server application this is a delicate balancing act, but for a massively multiplayer game server it is a gargantuan undertaking. Creating a stable, high performance environment that can grow to support thousands of simultaneous players interacting in virtual real-time with anything approaching a reasonable level of development effort is no mean feat. In fact, more than a few experienced, professional software design firms have failed miserably in the attempt.
So the challenge is great. Great, but not insurmountable. I've spent a great deal of time over the last several years pondering this problem, forming and discarding more concepts than I care to count. Do I know the absolute, final answer? I'm certain I do not. In fact, I have no doubt that years from now I will still be contemplating this challenge. But I have come up with a few ideas I think are quite good, ideas which offer strong answers to all of the key concerns. And of all these ideas, one stands alone: the 5-tier queue-driven architecture.
Curious? Then read on! Oh, and if you aren't in to architectural stuff and you just want to get to the meat of the "why" question - you'll probably just want to skip straight over to the Conclusions section... Other than than, have a good read and I look forward to your comments.
Casey Dement
NOTE - This server concept is partially adapted from and greatly inspired by a message distrubution system I designed with considerably collaboration with Luke Kolin, one of the finer programming minds I've had the priveledge to work with. None of the concepts which drive this model could ever have been refined to the level of practicality they have achieved without his endless willingness to listen to my wild ideas and then puncture them with more holes than swiss cheese.


The gateway tier can be thought of as the "public face" of the system. It can consist of one or more physical servers, load balanced through whatever scheme desired, which are the connection points for client sessions. It does not matter which gateway a client is connected to, and the death of a gateway server has no effect beyond disconnecting the set of client sessions which were attached to that box (which can be easily reconnected by the client).
The gateway servers also encapsulate all details related to the actual communication protocol between the client and the server, greatly simplifying the coding and readability of the rest of the server processes. This is accomplished through a translation layer between the physical packets and "system events" (as defined by the queue).
So what does this buy you?

The queue tier is both the simplest and the most important functionality of the entire architecture. It is also, being completely honest, the key single point of failure for the system. The queue, in a nutshell, is a prioritized collection of all events that currently need to be handled by any process (including the gateway tier) in the system. As processes are able, they poll the queue for the next task to be done. Once a process instance has retrieved a task, that task is removed from the queue so that no other process instance will duplicate the effort (and result). If, for any reason, the assigned process instance fails to complete the task in a reasonable time frame, that task is returned to the queue for the next available process instance. When a task is completed, it is permanently removed from the queue.
The most critical (and of course difficult) aspect of the queue functionality is maintaining synchronization between the vast number of process instances that are accessing the queue at any given time. As long as this consistency is maintained all is well with the server. So of course, it is VITAL that this code be as simple and stable as possible.
Also, it is important to note that there need not be just one queue. Since there are many types of events, many queues can be implemented on as granular an eventtype-by-eventtype basis as performance and hardware limitations demand. Just remember that unlike the other tiers, scaling in this way improves performance but does not improve redundancy. Each queue remains a single point of failure.
There are, thankfully, a great many very stable and effective ways to accomplish the needs of the queue. In past applications, I have even implemented the queue ENTIRELY at the database level, using connection-instance row level locking to maintain synchronization and automatic failure recovery. Such a simple approach could theoretically be appropriate for this application as well, but for several reasons I suspect a purpose-built solution is in order.
And of course, before we move on let's see how this aligns with our key considerations:

The process tier is a collection of seperate process servers that consume and produce the various types of events that occur in the simulation (how's that for a mouthful of not much help?). At a logical level each process can be thought of as an individual server which performs some task, and any task that must be performed is technically a seperate process (movement tracking, chat, combat, etc). On a physical level, however, each process is actually a set of one or more independent process servers independently competing to execute the next task of its type in the queue. Multiple processes (performance allowing) can also be combined on process servers as appropriate.
Fundamental to the process tier is the rule that all process servers must execute independently, with no regard for or knowledge of their peers. In this way, individual process servers can be added or removed from the cluster in real time with no interruption to the operation of the simulation. When an existing server goes away, any tasks it was executing are automatically returned to the event queue for consumption by another server. When a new server comes online, it begins consuming events. As long as at least one process server for each event type continues to function, the system need not go offline (though if sufficient infrastructure is lost obviously performance will be degraded).
And here is where the real payoff of this architecture becomes apparent:
In all the details here I know it's easy to miss the big picture. So let me spell it out clearly - here's what we get from doing the server this way:
In a nutshell, I'm trading being able to do one thing really fast for being able to do lots of things fast enough - and doing so in a way that fundamentally enables almost everything to be done in a fully parallel redundant fashion. I don't know how to lay it out any more plainly than that.
So that's it - my rant is complete. Think I'm nuts - tell me! Got a better idea? Let's hear it!