Clusterless

Last updated on Jun 20, 2023

Clusterless is an alternative infrastructure to the standard ‘big data’ technology stacks like Apache Hadoop, Apache Spark, or Apache Airflow.

The Clusterless architecture is decentralized and serverless, unlike these existing systems.

That is, developers can build and deploy applications without having to first provision any runtimes or servers to manage the applications. More importantly, they don’t have to manage any schedulers or task managers or data nodes.

A given Clusterless application processes data as it arrives, and when done, it notifies any relevant listeners downstream to begin processing. If no data arrives at any given moment, no cpu time is used.

Data engineers have many roles and responsibilities, but managing a cluster and it’s associated runtimes should not be one of them. More importantly, data engineers should not be sorting through gigs of stack-traces while debugging complex distributed applications.

The Clusterless architecture allows data engineers to focus on these core activities:

parsing and reformatting data to support end-user tools
re-partitioning data to make different access patterns performant
feature extraction to improve model accuracy
training and validation

Clusterless views each of these activities as simple workloads it manages. Storage and scheduling are simply cloud services Clusterless and a given workload use. These services are highly secure, performant, and reliable. More to the point, they are isolated from any other workloads and the services. A jar file in the workload can’t break your storage or another workload. Or a hung process on a node won’t stop your data pipeline or incur wasted cpu charges.

And all the metadata managed by the Clusterless infrastructure is transparent and accessible as simple files stored in JSON. This simplifies the development of any integration interfaces or tools for status reporting from any programming language.

Clusterless

Chris K Wensel

Data and Analytics Architect