As a data engineer, I have written one application over and over. A tool to parse complex text that stores it in an optimized format.
These read/write functions live at opposite ends in a singe data processing application where a large number of computing functions are coordinated in the middle. Turns out this is exactly what Cascading is, a framework for reading and writing data while performing various processing and computations in between.
But there are a couple challenges with this sort of application.
First, it is a monolith. As a monolith, a single failure stops the whole process, typically forcing all upstream work to be recomputed on a retry after resolving any issues if they aren’t simply transient ones.
Second, the application may start out as a simple pipeline, but the application quickly becomes a DAG of individual flows. Cascading has a Cascade topological scheduler for just this purpose. Often a large complex flow is broken down into smaller flows to reduce the monolith effects.
This is not just refactoring, but an attempt to mitigate the needs of various stakeholders, the application grew in complexity because of increasing requirements and scope. This single application, running a static dag of various flows, then becomes a blocker for developers and stakeholders.
Tessellate is a (local mode) Cascading application that primarily reads and writes data. Most commonly, read text based csv or JSON data, and write out binary Apache Parquet files.
Turns out, many failures happen at the read and write stages, and these are easily tested before deployment, or are transient in nature and simply resolved with a retry.
So the idea behind Tessellate is to get data into a clean, schematized, and both optimal format and layout for downstream stakeholders. Isolate this part of the pipeline from the business part, because there isn’t just one business part, there isn’t just one stakeholder for a dataset. Let them solve their problems independently.
Tessellate was designed to work with the Clusterless framework.