A colleague once asked me, “Why would anyone build a custom async runtime when Tokio exists?” Fair question. Tokio is battle-tested, well-maintained, and fast. For 95% of use cases, it’s the right answer. But I’ve now been in three situations where it wasn’t — and each one taught me something about what a runtime actually does.
The first was a latency-sensitive trading system where work-stealing’s cache invalidation was unacceptable. The second was an embedded system with no allocator. The third was a specialized database engine where we needed precise control over I/O scheduling. Each time, understanding how to build a runtime from the ground up saved the project.
Let’s design one. Not a toy — a real runtime with I/O, timers, and task scheduling.
When to Build Your Own
Before writing a single line of code, let me be blunt about when this makes sense:
Thread-per-core architecture. You want tasks pinned to specific cores with no migration. Tokio’s work-stealing scheduler moves tasks between threads, which invalidates caches. For latency-sensitive workloads (trading, game servers, real-time audio), this is a problem.
Custom I/O backends. You need io_uring exclusively, or you’re targeting an embedded platform with a non-standard I/O mechanism. Tokio’s I/O driver is built on mio, which means epoll/kqueue.
Minimal footprint. You’re on a microcontroller or WebAssembly target where Tokio’s dependencies are too heavy.
Specialized scheduling. You need priority queues, deadline scheduling, or fair-share scheduling that Tokio doesn’t offer.
If none of these apply, use Tokio. Seriously.
The Architecture
Our runtime will have four components:
┌────────────────────────────────────────────────────────┐
│ Runtime │
│ │
│ ┌───────────┐ ┌────────────┐ ┌──────────┐ ┌──────┐│
│ │ Executor │ │ I/O │ │ Timer │ │ Spawn││
│ │ (task │ │ Reactor │ │ Wheel │ │ API ││
│ │ polling) │ │ (epoll) │ │ │ │ ││
│ └───────────┘ └────────────┘ └──────────┘ └──────┘│
└────────────────────────────────────────────────────────┘
Let’s build each piece.
Step 1: Task Representation
A task is a heap-allocated future with metadata for scheduling:
use std::future::Future;
use std::pin::Pin;
use std::sync::{Arc, Mutex};
use std::task::{Context, Poll, Wake, Waker};
struct Task {
/// The future being driven
future: Mutex<Pin<Box<dyn Future<Output = ()> + Send>>>,
/// Handle to the runtime for re-scheduling
runtime: Arc<RuntimeInner>,
/// Task ID for debugging
id: u64,
}
impl Wake for Task {
fn wake(self: Arc<Self>) {
// When woken, push ourselves back onto the ready queue
self.runtime.schedule(self.clone());
}
fn wake_by_ref(self: &Arc<Self>) {
self.runtime.schedule(self.clone());
}
}
One design decision here: we’re using Mutex<Pin<Box<dyn Future>>>. The Mutex is needed because wakers can fire from any thread, but we only poll the future from the executor thread. In a thread-per-core design, you could use RefCell instead — but then your futures can’t be Send, which is a tradeoff.
For a thread-per-core runtime:
use std::cell::RefCell;
struct LocalTask {
future: RefCell<Pin<Box<dyn Future<Output = ()>>>>,
// No Send bound needed — task never leaves this thread
runtime: Rc<LocalRuntimeInner>,
id: u64,
}
Step 2: The Task Queue
For a thread-per-core runtime, we use a simple VecDeque — no synchronization needed:
use std::collections::VecDeque;
use std::cell::RefCell;
use std::rc::Rc;
struct LocalQueue {
ready: RefCell<VecDeque<Rc<LocalTask>>>,
}
impl LocalQueue {
fn new() -> Self {
LocalQueue {
ready: RefCell::new(VecDeque::new()),
}
}
fn push(&self, task: Rc<LocalTask>) {
self.ready.borrow_mut().push_back(task);
}
fn pop(&self) -> Option<Rc<LocalTask>> {
self.ready.borrow_mut().pop_front()
}
fn is_empty(&self) -> bool {
self.ready.borrow().is_empty()
}
}
For a multi-threaded runtime with work-stealing, you’d use crossbeam-deque as we discussed in lesson 6.
Step 3: The I/O Reactor
The reactor translates OS I/O events into waker notifications. Here’s a production-grade version using mio:
use mio::{Events, Interest, Poll, Token, Registry};
use std::collections::HashMap;
use std::io;
use std::sync::{Arc, Mutex};
use std::task::Waker;
struct IoSource {
read_waker: Option<Waker>,
write_waker: Option<Waker>,
is_ready_read: bool,
is_ready_write: bool,
}
struct Reactor {
poll: Poll,
sources: HashMap<Token, IoSource>,
next_token: usize,
}
impl Reactor {
fn new() -> io::Result<Self> {
Ok(Reactor {
poll: Poll::new()?,
sources: HashMap::new(),
next_token: 0,
})
}
/// Register a new I/O source and return its token
fn register(
&mut self,
source: &mut impl mio::event::Source,
interest: Interest,
) -> io::Result<Token> {
let token = Token(self.next_token);
self.next_token += 1;
self.poll
.registry()
.register(source, token, interest)?;
self.sources.insert(
token,
IoSource {
read_waker: None,
write_waker: None,
is_ready_read: false,
is_ready_write: false,
},
);
Ok(token)
}
/// Register a waker for read readiness
fn poll_readable(&mut self, token: Token, waker: &Waker) -> Poll<()> {
let source = self.sources.get_mut(&token).unwrap();
if source.is_ready_read {
source.is_ready_read = false;
Poll::Ready(())
} else {
source.read_waker = Some(waker.clone());
Poll::Pending
}
}
/// Register a waker for write readiness
fn poll_writable(&mut self, token: Token, waker: &Waker) -> Poll<()> {
let source = self.sources.get_mut(&token).unwrap();
if source.is_ready_write {
source.is_ready_write = false;
Poll::Ready(())
} else {
source.write_waker = Some(waker.clone());
Poll::Pending
}
}
/// Process I/O events — called from the main loop
fn turn(&mut self, timeout: Option<std::time::Duration>) -> io::Result<()> {
let mut events = Events::with_capacity(256);
self.poll.poll(&mut events, timeout)?;
for event in events.iter() {
if let Some(source) = self.sources.get_mut(&event.token()) {
if event.is_readable() {
source.is_ready_read = true;
if let Some(waker) = source.read_waker.take() {
waker.wake();
}
}
if event.is_writable() {
source.is_ready_write = true;
if let Some(waker) = source.write_waker.take() {
waker.wake();
}
}
}
}
Ok(())
}
}
Step 4: The Timer Wheel
For our custom runtime, a simple sorted timer structure works well enough for moderate numbers of timers. For millions of timers, you’d want a hierarchical wheel like Tokio’s.
use std::collections::BinaryHeap;
use std::cmp::Reverse;
use std::task::Waker;
use std::time::Instant;
struct TimerEntry {
deadline: Instant,
waker: Waker,
}
impl PartialEq for TimerEntry {
fn eq(&self, other: &Self) -> bool {
self.deadline == other.deadline
}
}
impl Eq for TimerEntry {}
impl PartialOrd for TimerEntry {
fn partial_cmp(&self, other: &Self) -> Option<std::cmp::Ordering> {
Some(self.cmp(other))
}
}
impl Ord for TimerEntry {
fn cmp(&self, other: &Self) -> std::cmp::Ordering {
// Reverse so the earliest deadline is at the top of the heap
other.deadline.cmp(&self.deadline)
}
}
struct TimerWheel {
timers: BinaryHeap<TimerEntry>,
}
impl TimerWheel {
fn new() -> Self {
TimerWheel {
timers: BinaryHeap::new(),
}
}
fn add(&mut self, deadline: Instant, waker: Waker) {
self.timers.push(TimerEntry { deadline, waker });
}
/// Fire expired timers, return duration until next expiration
fn process(&mut self) -> Option<std::time::Duration> {
let now = Instant::now();
while let Some(entry) = self.timers.peek() {
if entry.deadline <= now {
let entry = self.timers.pop().unwrap();
entry.waker.wake();
} else {
return Some(entry.deadline - now);
}
}
None
}
}
Step 5: Tying It All Together
Now the main runtime struct that orchestrates everything:
use std::cell::RefCell;
use std::rc::Rc;
use std::time::Duration;
struct Runtime {
queue: LocalQueue,
reactor: RefCell<Reactor>,
timers: RefCell<TimerWheel>,
task_count: RefCell<usize>,
next_task_id: RefCell<u64>,
}
impl Runtime {
fn new() -> io::Result<Rc<Self>> {
Ok(Rc::new(Runtime {
queue: LocalQueue::new(),
reactor: RefCell::new(Reactor::new()?),
timers: RefCell::new(TimerWheel::new()),
task_count: RefCell::new(0),
next_task_id: RefCell::new(0),
}))
}
fn spawn(&self, future: impl Future<Output = ()> + 'static) {
let mut id = self.next_task_id.borrow_mut();
let task_id = *id;
*id += 1;
*self.task_count.borrow_mut() += 1;
// For a thread-per-core runtime, tasks don't need Send
// This is one of the big advantages
let task = Rc::new(LocalTaskImpl {
future: RefCell::new(Box::pin(future)),
id: task_id,
});
self.queue.push(task);
}
fn run(&self) {
loop {
// 1. Poll all ready tasks
let mut made_progress = true;
while made_progress {
made_progress = false;
while let Some(task) = self.queue.pop() {
made_progress = true;
// Create a waker that re-enqueues this task
let waker = task_waker(task.clone(), &self.queue);
let mut cx = Context::from_waker(&waker);
let mut future = task.future.borrow_mut();
match future.as_mut().poll(&mut cx) {
Poll::Ready(()) => {
*self.task_count.borrow_mut() -= 1;
}
Poll::Pending => {
// Will be re-queued by waker
}
}
}
}
// 2. All tasks complete?
if *self.task_count.borrow() == 0 {
break;
}
// 3. Process timers
let next_timer = self.timers.borrow_mut().process();
// 4. Wait for I/O events (using next timer as timeout)
let timeout = next_timer.unwrap_or(Duration::from_millis(100));
self.reactor.borrow_mut().turn(Some(timeout)).unwrap();
}
}
}
// Helper to create a waker for a local task
fn task_waker(task: Rc<LocalTaskImpl>, queue: &LocalQueue) -> Waker {
// In practice, you'd use a proper Waker implementation
// For thread-per-core, you can use unsafe RawWaker with Rc
// Here's a simplified version using the Wake trait approach
// Note: This requires the task to be Send for Arc<Wake>
// For a true thread-per-core runtime, you'd use RawWaker
// to avoid the Send requirement
todo!("implement with RawWaker for non-Send tasks")
}
Let me show the proper RawWaker implementation for non-Send tasks — this is the key enabler for thread-per-core runtimes:
use std::task::{RawWaker, RawWakerVTable, Waker};
fn create_local_waker(
task: Rc<LocalTaskImpl>,
queue: Rc<LocalQueue>,
) -> Waker {
struct WakerData {
task: Rc<LocalTaskImpl>,
queue: Rc<LocalQueue>,
}
let data = Box::new(WakerData {
task,
queue,
});
let ptr = Box::into_raw(data) as *const ();
static VTABLE: RawWakerVTable = RawWakerVTable::new(
// clone
|ptr| {
let data = unsafe { &*(ptr as *const WakerData) };
let cloned = Box::new(WakerData {
task: data.task.clone(),
queue: data.queue.clone(),
});
RawWaker::new(Box::into_raw(cloned) as *const (), &VTABLE)
},
// wake (takes ownership)
|ptr| {
let data = unsafe { Box::from_raw(ptr as *mut WakerData) };
data.queue.push(data.task.clone());
},
// wake_by_ref
|ptr| {
let data = unsafe { &*(ptr as *const WakerData) };
data.queue.push(data.task.clone());
},
// drop
|ptr| {
unsafe { drop(Box::from_raw(ptr as *mut WakerData)) };
},
);
// Safety: WakerData is managed correctly through the vtable
unsafe { Waker::from_raw(RawWaker::new(ptr, &VTABLE)) }
// This is a simplified version - we define WakerData inside
// the function for encapsulation
struct WakerData {
task: Rc<LocalTaskImpl>,
queue: Rc<LocalQueue>,
}
}
Adding Async I/O Types
A runtime is useless without async I/O types. Here’s how to build an async TCP listener:
use mio::net::TcpListener as MioTcpListener;
use std::io;
use std::net::SocketAddr;
struct AsyncTcpListener {
inner: MioTcpListener,
token: Token,
}
impl AsyncTcpListener {
fn bind(addr: SocketAddr, runtime: &Runtime) -> io::Result<Self> {
let mut listener = MioTcpListener::bind(addr)?;
let token = runtime
.reactor
.borrow_mut()
.register(&mut listener, Interest::READABLE)?;
Ok(AsyncTcpListener {
inner: listener,
token,
})
}
async fn accept(&self, runtime: &Runtime) -> io::Result<(mio::net::TcpStream, SocketAddr)> {
loop {
// Try non-blocking accept
match self.inner.accept() {
Ok(result) => return Ok(result),
Err(e) if e.kind() == io::ErrorKind::WouldBlock => {
// Not ready — wait for the reactor
WaitReadable {
token: self.token,
runtime,
}
.await;
}
Err(e) => return Err(e),
}
}
}
}
struct WaitReadable<'a> {
token: Token,
runtime: &'a Runtime,
}
impl<'a> Future for WaitReadable<'a> {
type Output = ();
fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<()> {
self.runtime
.reactor
.borrow_mut()
.poll_readable(self.token, cx.waker())
}
}
Adding Sleep Support
struct Sleep {
deadline: Instant,
registered: bool,
}
impl Sleep {
fn new(duration: Duration) -> Self {
Sleep {
deadline: Instant::now() + duration,
registered: false,
}
}
}
impl Future for Sleep {
type Output = ();
fn poll(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<()> {
if Instant::now() >= self.deadline {
return Poll::Ready(());
}
if !self.registered {
// Access the runtime's timer wheel via thread-local
CURRENT_RUNTIME.with(|rt| {
rt.borrow()
.as_ref()
.unwrap()
.timers
.borrow_mut()
.add(self.deadline, cx.waker().clone());
});
self.registered = true;
}
Poll::Pending
}
}
thread_local! {
static CURRENT_RUNTIME: RefCell<Option<Rc<Runtime>>> = RefCell::new(None);
}
fn sleep(duration: Duration) -> Sleep {
Sleep::new(duration)
}
Testing the Runtime
Let’s put it all together with a real test:
fn main() -> io::Result<()> {
let rt = Runtime::new()?;
// Set the thread-local runtime
CURRENT_RUNTIME.with(|current| {
*current.borrow_mut() = Some(rt.clone());
});
rt.spawn(async {
println!("[task 1] starting");
sleep(Duration::from_millis(100)).await;
println!("[task 1] woke up after 100ms");
});
rt.spawn(async {
println!("[task 2] starting");
sleep(Duration::from_millis(50)).await;
println!("[task 2] woke up after 50ms");
sleep(Duration::from_millis(100)).await;
println!("[task 2] woke up after another 100ms");
});
rt.spawn(async {
println!("[task 3] I complete immediately");
});
rt.run();
println!("All tasks completed");
Ok(())
}
Expected output:
[task 1] starting
[task 2] starting
[task 3] I complete immediately
[task 2] woke up after 50ms
[task 1] woke up after 100ms
[task 2] woke up after another 100ms
All tasks completed
Design Tradeoffs
Building your own runtime forces you to make explicit tradeoffs that Tokio has already made for you:
Send vs !Send tasks. Tokio requires Send because tasks can migrate between threads. A thread-per-core runtime can accept !Send tasks, which means you can use Rc, Cell, and thread-local state freely. This is a genuine advantage for performance-sensitive code.
Static vs dynamic dispatch. We used Box<dyn Future> for type erasure. You could use generics instead (struct Task<F: Future>) to avoid the vtable indirection, but then you can’t have a heterogeneous task queue.
Allocation strategy. We allocate each task on the heap with Box::pin. A slab allocator (like Tokio’s) reuses memory for tasks, reducing allocation pressure.
I/O driver choice. We used mio (epoll/kqueue). For Linux-only deployments, io_uring gives you better performance at the cost of portability.
The right choices depend entirely on your use case. That’s why custom runtimes exist — not because Tokio is wrong, but because different problems have different optimal points in the design space. Now that you’ve built one, you understand what those tradeoffs are.