Sunday 02 February 2025
The quest for faster data transfer has long been a thorn in the side of data processing systems. As the amount of data generated and stored continues to explode, efficient processing and analysis have become essential for extracting valuable insights. One major bottleneck in this process is the time it takes to transport large datasets between machines.
Traditionally, data transfer protocols such as TCP/IP-over-Ethernet have been used, but these serializing data into a single contiguous buffer before handing it off to the network card, which can be costly and inefficient for columnar data batches. Columnar data formats like Apache Arrow are widely used in distributed data processing systems, but they require serialization when transferring between hosts.
Enter RDMA (Remote Direct Memory Access), a technology that allows machines to access each other’s memory directly without involving the CPU or network card. By leveraging RDMA, data transfer can be done in a zero-copy manner, eliminating the need for serialization and reducing latency.
Researchers have been exploring the potential of RDMA for accelerating data transfer, but most solutions have been proprietary or limited to research prototypes. Now, a team of scientists has developed an open-source protocol called Thallus that uses RDMA to transport Apache Arrow data over Infiniband.
Thallus is designed as a client-server model, where queries are executed on the server and results are sent back to the client using RDMA. The protocol uses RPC (Remote Procedure Call) for control operations and RDMA for data operations. By eliminating serialization overhead, Thallus achieves significant performance gains compared to traditional TCP/IP-based solutions.
In benchmarking experiments, Thallus was found to be up to 5.5 times faster than a pure RPC-based implementation in data transport duration, and up to 2.5 times faster in end-to-end query execution duration. The relative performance gain of Thallus diminishes with the reduction in the size of the result set, as the constant overheads of doing RDMA become significant.
The implications of Thallus are far-reaching, as it has the potential to revolutionize data processing systems. By accelerating data transfer, Thallus can enable faster query execution times, improved scalability, and reduced latency. As the world continues to generate vast amounts of data, innovations like Thallus will play a crucial role in unlocking insights from this information deluge.
Thallus is an open-source protocol that uses RDMA to transport Apache Arrow data over Infiniband.
Cite this article: “Accelerating Data Transfer with Thallus: An Open-Source Protocol for Efficient Data Processing”, The Science Archive, 2025.
Rdma, Thallus, Apache Arrow, Infiniband, Data Transfer, Remote Direct Memory Access, Open-Source, Protocol, Rpc, Serialization







