Chris K Wensel

Data and Analytics Architect

Experienced Data & Systems Architect focusing on the intersection of very large data, cloud infrastructure, and machine learning.

Professional Services

I’m actively taking on:

  • Developer Contracting on an hourly/daily rate
  • Architecture Consulting on a monthly retainer

If you are:

  • Considering large scale data infrastructure and cloud initiatives — needing experienced planning, design, and development to reduce project risk and costs
  • Planning new or have existing Data Engineering and Feature Engineering pipelines — needing insights on managing latency, scale, and costs
  • Relying on Cascading — looking to refresh an existing Cascading application, lower cloud costs, undergo related upgrades, or even migrate off

Reach out to discuss how I can be of help.

If your organization is still actively in production with Cascading (this includes Scalding and other derivative frameworks), consider sponsoring the project:

Cloud Data and Feature Engineering

Over the last 5 years I’ve picked up some additional key insights handling massive data in modern cloud environments that, especially if scaled down, will increase developer agility and lower costs.

To simplify and share these insights I have started new open-source projects aimed at developers and data scientists that intersects data engineering and cloud infrastructure:

The community can be found here:

You can follow along as I post about it by following the #clusterless tag on Mastodon.

Key Technologies and Accomplishments

  • Cloud Computing
    • Years of experience deploying systems via AWS CDK and CloudFormation
    • Reliably ingressed terabytes a day with virtual no data loss and high system utilization against elastic capacity
    • Engineered petabytes of data to improve query performance and cost for AWS Athena and PrestoSQL/Trino
    • One of the first to deploy Apache Hadoop in production on AWS EC2 (prior to AWS EMR)
    • One of the first users of AWS EMR - (actually 2009)
  • Observability and Visualization
    • Engineered petabytes of data for use in KPI dashboards, like Tableau and Apache Superset, to measure performance changes and provide attribution across releases
    • Created the first real-time monitoring platform for Apache Hadoop and Apache Spark, acquired in 2016
    • Former W3C Distributed Tracing Working Group member
  • Machine Learning / Data Science
    • Engineered petabytes of data for use in customer facing predictive performance models
    • Modeling, simulation, and feature engineering using large and complex distributed-tracing and telemetry data
    • ML Ops — automation of feature extraction, training, and validation
  • “Big Data”
    • Created and instructed the first commercial Apache Hadoop training program taught at Sun, Apple, HP, and Orange.
    • Authored last chapter of Hadoop: The Definitive Guide
  • Open Source
    • Created Cascading, used by thousands of developers for creating distributed applications, in production at Apple, Twitter, Stripe, Chase, and Salesforce
    • Supported multiple data oriented languages created on top of the Cascading planner like Scalding, Cascalog, and Lingual
    • Created and participated in dozens of related data oriented projects like Apache Calcite
    • Apache Software Foundation (ASF) Member