论文标题
集体通信的有效直接连接拓扑
Efficient Direct-Connect Topologies for Collective Communications
论文作者
论文摘要
我们考虑将有效的网络拓扑提取到集体通信的问题。我们提供了一个算法框架,用于构建针对与工作负载相关的延迟与带宽权衡的直接连接拓扑。我们的方法合成了给定集群大小和学位的许多不同的拓扑和时间表,然后确定给定工作负载的适当拓扑和时间表。我们的算法始于小的,最佳的基础拓扑以及相关的通信时间表,并使用可以迭代应用的技术来得出更大的拓扑和时间表。此外,我们通过使用新型的多项式时间算法为它们生成有效的集体时间表,将研究充分的大规模图形拓扑结合到我们的算法框架中。我们的评估使用多个测试床和大规模模拟,以证明我们派生的拓扑和时间表具有显着的性能益处。
We consider the problem of distilling efficient network topologies for collective communications. We provide an algorithmic framework for constructing direct-connect topologies optimized for the latency vs. bandwidth trade-off associated with the workload. Our approach synthesizes many different topologies and schedules for a given cluster size and degree and then identifies the appropriate topology and schedule for a given workload. Our algorithms start from small, optimal base topologies and associated communication schedules and use techniques that can be iteratively applied to derive much larger topologies and schedules. Additionally, we incorporate well-studied large-scale graph topologies into our algorithmic framework by producing efficient collective schedules for them using a novel polynomial-time algorithm. Our evaluation uses multiple testbeds and large-scale simulations to demonstrate significant performance benefits from our derived topologies and schedules.
