· Michele Mazzucco  · 10 min read

Virtual queues from scratch, Part 1: concepts and architecture

Kick‑start your journey into building lightning‑fast virtual queues: master the key queuing concepts and set up a lean, low‑latency architecture.

Kick‑start your journey into building lightning‑fast virtual queues: master the key queuing concepts and set up a lean, low‑latency architecture.

In our previous articles How to choose the right virtual waiting room and the recently published follow-up, we have explored how virtual waiting rooms help businesses handle traffic spikes gracefully, ensuring service stability and fairness for users. After those publications, many of our readers asked an interesting question:

Can we build a basic virtual waiting room ourselves, without relying on commercial vendors?

The answer is yes—and in this article series, we are offering a practical, step-by-step guide to building a simple, effective virtual waiting room using common tools like Docker, Redis and Python. We will call it the “Poor Man’s Guide” to highlight its simplicity and affordability.

We will walk through:

  • Understanding essential concepts (queue management, arrival rates, and service rates)
  • Choosing a core architecture by comparing the Push vs. Pull models.
  • Identifying common implementation pitfalls to avoid.
  • Selecting a foundational technology stack and justifying those choices.

By the end, you will have a working proof-of-concept (POC) of a virtual waiting room, demonstrating key concepts and functionality clearly.

While suitable for educational purposes and experimentation, please note that this implementation is not production-ready. Important considerations such as security, reliability, scalability, and user-friendly web interfaces have been intentionally simplified or omitted. However, this POC provides a solid foundation and clear insights into building a virtual waiting room without vendor lock-in or significant financial investment.

📌 Why Python for this proof-of-concept?

We chose Python because it is the most familiar back-end language for a huge slice of engineers—lowering the barrier for readers who want to tinker. To claw back performance we lean on:

  • msgspec for zero-copy JSON/Binary parsing
  • numba to JIT-compile the tiny bits of math that matter
  • httptools (the ultra-fast C HTTP parser behind uvloop) to make request parsing faster
  • uvloop, an ultra-fast event loop for asyncio

These micro-optimizations let us hit very low latencies in a language that prioritizes readability. In a 24/7 production setting we would likely migrate the hot path to Java or Go—or to a compiled language without garbage collection like C++ or Rust—but for a learning-friendly POC, Python’s ubiquity wins.

Let’s get started!


Table of contents

Part A – The “Why”: conceptual foundations

Queues are not a necessary evil; they are a design decision.

At scale, every popular digital service eventually needs to decide how it will make people wait. This article is Part A of our hands-on series, covering the essential “Why” of virtual queues. We will establish the core concepts and make the foundational architectural decisions before diving into code in the articles that follow.

1. Why virtual queues?

“Just auto‑scale!” is the usual reaction when capacity is short.

In reality, most back‑ends have hard saturation points—database connection limits, third‑party rate caps, or CPU‑bound algorithms.

Virtual queues (also known as virtual lobbies or virtual waiting rooms) create a deliberate buffer between unbounded demand and finite supply, smoothing spikes while keeping the user experience predictable.

Virtual queue
Figure 1 - Simple virtual queue: requests are parked in a buffer, preventing the system from becoming overloaded. At the same time, the virtual queue system keep users informed, e.g., by showing their position in the queue or by providing waiting time estimates.

Key benefits include:

BenefitImpact
☂️ Protect back‑endPrevents meltdown under traffic bursts
🤝 FairnessFirst‑come‑first‑served or priority tiers
👓 UX transparencyShow accurate wait times instead of cryptic errors
📈 AnalyticsPrecise metrics on demand vs capacity

2. Essential queuing concepts

Before diving into the code, let’s briefly cover a few key concepts:

  • Queue management: At its core, a virtual waiting room is simply a queue where requests wait until it’s their turn. Efficient queue management ensures fairness (first-come, first-served) and stability under high load.
  • Arrival rates (λ): The average rate at which users enter the system, typically measured in requests per second. Understanding arrival rates helps you scale your system to handle varying traffic levels.
  • Service rates (μ): How quickly the system can handle requests, typically measured as the average number of requests processed per second by each worker or server.
  • Utilization (ρ), ρ = λ / (c × μ): The ratio of arrival rate to total service capacity. It determines how busy the system is and directly influences waiting times and queue lengths.
  • Squared coefficient of variation (SCV): measures burstiness. High SCV results in longer queues.

A key concept for the rest of this series—in practice, you will see two kinds of waiting time:

  1. User-facing ETA: For users already in the queue, we estimate how long until it’s their turn—based on their position and how fast we are moving. This keeps users informed and reduces frustration.
  2. Admission control wait: For new arrivals, we estimate how long they would have to wait if we let them in right now—using a simple, robust formula (current queue backlog divided by throughput). We use this estimate to decide whether to admit or temporarily reject new requests, keeping the system healthy.

This separation is crucial: user transparency and system protection require different trade-offs.

💡 Why not fancy queue formulas?

  • You might see more complex queueing formulas in textbooks. Here, we use practical, worst-case approximations that hold up when the system is busy or overloaded.
  • When the queue is long (heavy or overload conditions), the total wait time is dominated by the backlog—it hardly matters if traffic is bursty or smooth in the moment. That’s why, for admission decisions, we use a simple linear estimate based just on queue length and throughput, and don’t worry about the “burstiness” factor. This keeps the logic robust under stress.

3. The core architectural choice: Push vs. Pull model

Before diving into the implementation, it is essential to understand the fundamental architectural choice of this POC: a pull-based dispatch model. This choice dictates how the system adapts to load and is a prerequisite for advanced abandonment handling.

There are two primary models for moving users from a waiting room queue to a backend application:

  1. Push (rate-based): A central dispatcher sends a pre-determined number of users (X per minute) to the backend.
  2. Pull (capacity-based): Backend servers actively “pull” the next user from the queue only when they have free capacity.

This POC implements the Pull model for its superior resilience and efficiency in modern cloud environments.

3.1. The Push model (rate-based)

In a Push model, the lobby service acts as a gate, releasing users at a pre-configured, constant rate. This approach is valued for its predictability and simplicity.

push model
Figure 2 - Push model: the virtual queue acts as a dispatcher, forwarding requests to the backend at a fixed rate.

Pros:

  • Simplicity: The logic is straightforward to implement and understand.
  • Predictable egress: Provides a constant, easily communicated rate (e.g., “500 users will be admitted per minute”).

Cons:

  • Less adaptive to dynamic capacity: This model’s primary challenge is its fixed nature in a dynamic environment.
    • If backend capacity degrades (due to faults or high resource usage), the constant push rate can overload the system, causing errors.
    • If backend capacity increases (due to auto-scaling), the fixed rate can lead to under-utilization of available resources, prolonging the queue unnecessarily.
  • Lacks direct feedback loop: The dispatcher operates independently of the real-time state of the backend workers. This makes it challenging to adjust the ingress rate based on factors like how many active slots are being consumed by abandoned sessions, or by requests consuming an extremely long (or short) amount of time to be served.

3.2. The Pull model (capacity-based)

In a Pull model, the paradigm is inverted. The backend servers drive the flow. This is a form of a CONWIP (Constant Work-In-Process) system. As soon as a worker finishes a request, it signals its availability and “pulls” the next user from the queue. This adaptive approach is a foundational principle for many large-scale, modern systems.

pull model
Figure 3 - Pull model: backends drive the flow.

Pros:

  • Resilient and adaptive: The system naturally adjusts to the backend’s real-time capacity. If a server fails, it simply stops pulling requests. If new servers are added via auto-scaling, they immediately start pulling, increasing throughput. It is inherently difficult to overload the backend with this model.
  • Maximizes efficiency: Ensures that backend servers are consistently utilized, shortening the queue as fast as currently possible.
  • Enables smart abandonment handling: Because the backend worker “owns” the request from start to finish, it can be instrumented with the cancellation logic discussed later. A pull model is the foundation for effectively handling in-flight abandonment.

Cons:

  • Slightly higher complexity: Requires a communication channel for the backend to signal its availability and request new users from the queue.

3.3. Rationale for this POC

For modern applications with dynamic, cloud-native infrastructure, the Pull model offers compelling advantages. Its ability to gracefully handle fluctuations in server capacity and to serve as a foundation for intelligent abandonment handling makes it a highly resilient and efficient architectural choice.


4. Common pitfalls

Designing virtual queues looks trivial—until real traffic shows up. Five failure modes appear in almost every first‑generation implementation:

  1. Tail‑latency blind spots – Engineering to hit a 100‑ms average but ignoring the 2‑second p99 that melts user trust under heavy load (ρ approaching 1).
  2. Queue overflow and back‑pressure collapse – Letting the queue grow unbounded eventually swaps or crashes servers. Always cap the maximum queue size and enforce TTL.
  3. Abandoned tickets – Tabs close, mobiles lose signal; stale entries skew occupancy unless actively pruned or auto‑expire.
  4. Uneven capacity signals – CPU pauses (GC, JIT, network blips) make μ bursty under heavy load; strategies such as token‑bucket or leaky‑bucket protect the queue.
  5. State duplication – Storing tickets in multiple stores introduces race conditions; keep a single source of truth.

We will tackle these head‑on throughout the series.


5. Technology choices at a glance

LayerToolingRationale
Fail safeOpenRestyFail safe rejection scheme at the edge via Lua script for exceptional scenarios—the digital equivalent of locking the front door when the building is full
Lobby APICustom ultra‑lean ASGI server (Python 3.12, uvloop)Minimal allocations, higher throughput, reduced latency
Serializationmsgspec.Struct5–8 % CPU drop vs Pydantic on the same workload, lower latency
Queue stateRedis 8Atomic operations via Lua scripts + expiry makes pruning trivial
MetricsVictoria Metrics + GrafanaStandard stack for time‑series
Load testingRust‑based generator (tokio runtime)No GIL, microsecond timers, 100k req/sec on laptop

🔎 Why not FastAPI / Flask / Starlette?

Full‑featured frameworks bundle routers, dependency‑injection, middleware pipelines, and rich exception hierarchies—perfect for large CRUD micro‑services, but every extra layer shows up when your entire latency budget is small. We expose only three hot‑path endpoints, so the extra flexibility becomes pure overhead.

In profiling, we identified the following culprits:

  • Router and path-param parsing
  • “Empty” middleware chain (CORS, logging, error handlers)
  • Python exceptions: building tracebacks knocks the request onto a slow path.

Our replacement is a 50‑line hand‑rolled router (dict lookup + compiled regex) and typed namedtuple result objects (slightly more efficient than dataclass objects) to signal abnormal conditions—no exceptions, no middleware. As shown below, combining our router with msgspec structs trimmed p95 by approximately 50% at 95 requests/second.

FastAPI vs custom ASGI router latency
Figure 4 - ρ = 0.95: measured mean and p95 of waiting times for FastAPI (left) and custom ASGI server (right). The difference between the two charts is due to the FastAPI overhead.

Coming up in Part B: Building the Core Service

In the next article in this series, we will move from design to code and begin the implementation. We will focus on bootstrapping the lean Lobby API service from scratch, covering how to:

  • Set up the custom, high-performance ASGI server.
  • Define the core API endpoints (/join, /position).
  • Establish the initial connection with Redis to manage basic state.
  • Once this foundation is in place, subsequent articles will build upon it to add the liveness system, backend protection, and performance calibration.

📬 Get weekly queue insights

Not ready to talk? Stay in the loop with our weekly newsletter — The Queue Report, covering intelligent queue management, digital flow, and operational optimization.

Back to Success Stories

Not sure where to start?

This guide provides a solid foundation for your project. When you are ready to tackle the unique challenges of a large-scale production environment—our team can provide expert architectural guidance.

Related stories

View all stories »
Virtual queues from scratch, Part 2: implementing the Lobby Service with Redis

Virtual queues from scratch, Part 2: implementing the Lobby Service with Redis

In Part 1, we established the architectural blueprint for our virtual queue. Now, it's time to translate that design into working code. This article kicks off the implementation by bootstrapping our Lobby Service in Python and integrating Redis to manage the queue's state. You will learn how to handle the first user requests and build the foundation for the entire waiting room.