Data Architect with 30 year of experience focusing on the intersections of very large data, cloud infrastructure, and machine learning.
In the last 5 years I’ve picked up some additional key insights handling massive data in modern cloud environments that, especially if scaled down, will increase developer agility and lower costs.
To simplify and share these insights I have started a new project aimed at developers and data scientists that intersects data engineering and cloud infrastructure.
It’s called Clusterless, and the community can be found here:
You can follow along as I post about it by follwing the #clusterless tag on Mastodon.
In order to fund this project, I’m actively taking on cloud infrastructure, data engineering, and feature engineering consulting/contracting work. Or retainer as a consulting architect to reduce client project risk and costs. Preferrably these project can rely on the Clusterless platform.
If you are a company:
- Considering large scale data infrastructure and cloud initiatives — needing experienced planning, design, and development
- Have existing data and feature engineering pipelines — needing insights on appropriately managing latency, scale, and costs
- Relying on Cascading — looking to refresh an existing Cascading application, lower cloud costs, undergo some related upgrades, or even migrate off
Reach out to discuss how I can be of help.
If your organization is still actively in production with Cascading (this includes Scalding and other derivative frameworks), consider sponsoring the project: https://github.com/sponsors/cwensel
Key technologies and accomplishments:
- Cloud Computing
- Reliably ingressed terabytes a day with virtual no data loss and high system utilization against elastic capacity
- Engineered petabytes of data to improve query performance and cost for AWS Athena and PrestoSQL/Trino
- One of the first to deploy Apache Hadoop in production on AWS EC2 (prior to AWS EMR)
- One of the first users of AWS EMR - https://aws.amazon.com/solutions/case-studies/razorfish/ (actually 2009)
- Observability and Visualization
- Engineered petabytes of data for use in KPI dashboards, like Tableau and Apache Superset, to measure performance changes and provide attribution across releases
- Created the first real-time monitoring platform for Apache Hadoop and Apache Spark, acquired in 2016
- Former W3C Distributed Tracing Working Group member
- Machine Learning / Data Science
- Engineered petabytes of data for use in customer facing predictive performance models
- Modeling, simulation, and feature engineering using large and complex distributed-tracing and telemetry data
- ML Ops — automation of feature extraction, training, and validation
- “Big Data”
- Created and instructed the first commercial Apache Hadoop training program taught at Sun, Apple, HP, and Orange.
- Authored last chapter of Hadoop: The Definitive Guide
- Open Source
- Created Cascading, used by thousands of developers for creating distributed applications, in production at Apple, Twitter, Stripe, Chase, and Salesforce
- Supported multiple data oriented languages created on top of the Cascading planner like Scalding, Cascalog, and Lingual
- Created and participated in dozens of related data oriented projects like Apache Calcite
- Apache Software Foundation (ASF) Member