Client transfers between servers

Accelerating Geo-distributed Learning with Client Transfers

Nomad is the first dynamic client transfer framework for multi-server FL, reallocating clients based on network conditions and data alignment to reduce latency and improve learning. Unlike static assignments, Nomad enables flexible migration during training. Experiments show accuracy improvements of up to 31.8 points in join-only settings and 18.8 points under churn, consistently surpassing strong baselines and scaling well across geographic deployments.

Crash-recovery during decentralized training of an LLM

Go With The Flow: Churn-Tolerant Decentralized Training of Large Language Models

GWTF is the first practical, crash-tolerant decentralized framework for collaboratively training LLMs on heterogeneous volunteer clients. It handles node churn and unstable networks through a novel decentralized flow algorithm that optimizes microbatch routing. Evaluations on GPT- and LLaMa-like models show that GWTF reduces training time by up to 45% in challenging, geographically distributed settings.

Asynchronous Byzantine Federated Learning

We propose an asynchronous, Byzantine-resilient FL algorithm that avoids straggler delays and requires no server dataset. By updating after a safe number of client contributions, it outperforms state-of-the-art methods, achieving faster training and higher accuracy under multiple attack types.

Topology Update

Dynamic Topology Optimization for Non-IID Data in Decentralized Learning

Morph is a decentralized learning topology optimizer that adapts peer selection based on model dissimilarity to overcome non-IID data and static communication limits. By reshaping the graph through gossip-based discovery, it boosts robustness and performance. Experiments on CIFAR-10 and FEMNIST show Morph outperforming static and epidemic baselines, achieving higher accuracy, faster convergence, and more stable learning with fewer communication rounds.

xample of a network where not forwarding signatures after delivering a message based on dissemination paths would prevent some nodes from authenticating it.

Reliable-Communication-in-Hybrid-Authentication-and-Trust-Models

This work extends two classical reliable communication protocols to combine authenticated links and processes, introducing DualRC. It leverages trusted nodes (e.g., gateways) and components (e.g., Intel SGX) to improve communication reliability, with methods to validate network implementation.

OPODIS · January 2025 · Rowdy Chotkan,  Bart Cox,  Vincent Rahli,  Jérémie Decouchant
Flat Multi-Server

Asynchronous Multi-Server Federated Learning for Geo-Distributed Clients

Spyker is the first fully asynchronous multi-server FL system, eliminating server idle time and single-server bottlenecks. Clients communicate only with their nearest server, while servers also update each other asynchronously. This continuously active design improves scalability and performance across MNIST, CIFAR-10, and WikiText-2.

Diffusion Process

Training Diffusion Models with Federated Learning

We introduce a federated diffusion framework that allows independent, privacy-preserving training of DDPMs without exposing local data. By adapting FedAvg and leveraging the UNet backbone efficiently, our method cuts parameter exchange by up to 74% compared to naive FedAvg, while preserving image quality close to centralized training, as measured by FID.

Reduced catastrophic forgetting effect (preferably 1280x720 pixels)

Parameterizing federated continual learning for reproducible research

We present the first fully configurable framework for Federated Continual Learning, designed to reproduce complex, evolving learning scenarios. It supports large-scale deployments via containerization and Kubernetes, enabling precise experimentation. Demonstrations on CIFAR-100 and heterogeneous task sequences show Freddie’s effectiveness and uncover persistent performance challenges in real FCL settings.

Model training phases during a local training

Aergia: leveraging heterogeneity in federated learning systems

To speed up the Federated Learning process, learning tasks can be offloaded to other clients. Using similarity metrics and a resource aware scheduler, we are able to speed up the training process for Federated Learning.

Layer size distribution

Memory-aware and context-aware multi-DNN inference on the edge

Masa is a memory-aware multi-DNN scheduling framework for edge devices that ensures low response times without modifying models. It leverages inter/intra-network dependencies and context to cut latency by up to 90% on low-memory devices.

Pervasive and Mobile Computing · July 2022 · Bart Cox,  Robert Birke,  Lydia Y Chen
Relaxed loading policy ordering

MemA: Fast Inference of Multiple Deep Models

The paper introduces EdgeCaffe, a framework for exploring scheduling policies in multi-inference DNN jobs on resource-constrained edge devices. It proposes MemA, a memory-aware policy that improves execution time by up to 5x without additional resources, based on layer-specific memory demands.

IEEE PerCom · May 2021 · Jeroen Galjaard,  Bart Cox,  Amirmasoud Ghiassi,  Lydia Y. Chen,  Robert Birke
Architecture of M ASA

Masa: Responsive multi-dnn inference on the edge

Masa, a responsive memory-aware multi-DNN execution framework, an on-device middleware featuring on modeling inter- and intra-network dependency and leveraging complimentary memory usage of each layer.

IEEE PerCom · April 2021 · Bart Cox