Experienced Data & Systems Architect focusing on the intersection of very large data, cloud infrastructure, and machine learning.
Professional Services
I’m actively taking on:
- Developer Contracting on an hourly/daily rate
- Architecture Consulting on a monthly retainer
If you are:
- Learning how to adopt AI/LLM technologies - wanting repeatable LLM augmented development workflows and help navigating the AI space.
- Considering large scale data infrastructure and cloud initiatives — needing experienced planning, design, and development to reduce project risk and costs
- Planning new or have existing Data Engineering and Feature Engineering pipelines — needing insights on managing latency, scale, and costs
- Relying on Cascading — looking to refresh an existing Cascading application, lower cloud costs, undergo related upgrades, or even migrate off
Reach out to discuss how I can be of help. chris@wensel.net
If your organization is still actively in production with Cascading (this includes Scalding and other derivative frameworks), consider sponsoring the project: https://github.com/sponsors/cwensel
Repeatable AI/LLM Engineering Processes
AI/LLM augmented development is still new and emerging. While tools like CoPilot and Claude were just emerging, I started a lose agile process that endeavors to help engineers on the full experience continuum continue to collaborate and yet still produce trust worthy code when delegating the bulk of the code creation to an LLM.
Recommendation Decisioning Records (RDRs) are a way to reduce context drift when engineering a new feature set or whole application. This isn’t a rigid process the same way agile processes aren’t rigid, it’s a starter framework that allows teams to adapt how they build software but ensuring a higher level of quality and repeatability.
RDRs can be applied incrementally, or you can go all in. As an experiment, I created Arcaneum, a RAG embedding pipeline and CLI for providing full text and semantic search against source code, pdfs, and markdown documents.
So that I was able to identify and offset pitfalls in pure LLM development, I have never looked at a line of code in Arcaneum.
I don’t recommend this for production code, but I do recommend every engineer attempt to build a new tool while blindfolded in order to understand how it feels to be fully connected with the features of an application and fully disconnected from the source. It’s a strange experience, but an important one.
Clone the RDR repo to get started.
Cloud Data and Feature Engineering
Over time I’ve picked up some additional key insights handling massive data in modern cloud environments that, especially if scaled down, will increase developer agility and lower costs.
To simplify and share these insights I have started new open-source projects aimed at developers and data scientists that intersects data engineering and cloud infrastructure:
The community can be found here:
You can follow along as I post about it by following the #clusterless tag on Mastodon.
Key Technologies and Accomplishments
- Verifiable Credentials
- Working with government agencies, experience in creating cryptographically secure ePassport derived credentials (e.g. Photo ID) for use as identity verification or adjacent credentials for asserting occupational licensing status or member loyalty programs.
- ePassport - ICAO 9303
- ICAO Digital Travel Credential (DTC)
- mDL (Mobile Driver’s License) - ISO/IEC 18013
- OpenID4VCI (OpenID for Verifiable Credential Issuance)
- OpenID4VP (OpenID for Verifiable Presentations)
- SIOPv2 (Self-Issued OpenID Provider v2) - An emerging cloud wallet presentation layer
- Cloud Architecture
- Years of experience deploying systems via AWS CDK and CloudFormation
- Parameterized declarative deployment frameworks for provisioning developer (scaled down), UAT, and production systems
- Enabling local first integration testing via topologically equivalent desktop deployments
- Cloud Computing
- Reliably ingressed terabytes a day with virtual no data loss and high system utilization against elastic capacity
- Engineered petabytes of data to improve query performance and cost for AWS Athena and PrestoSQL/Trino
- One of the first to deploy Apache Hadoop in production on AWS EC2 (prior to AWS EMR)
- One of the first users of AWS EMR - https://aws.amazon.com/solutions/case-studies/razorfish/ (actually 2009)
- Observability and Visualization
- Engineered petabytes of data for use in KPI dashboards, like Tableau and Apache Superset, to measure performance changes and provide attribution across releases
- Created the first real-time monitoring platform for Apache Hadoop and Apache Spark, acquired in 2016
- Former W3C Distributed Tracing Working Group member
- Machine Learning / Data Science
- Engineered petabytes of data for use in customer facing predictive performance models
- Modeling, simulation, and feature engineering using large and complex distributed-tracing and telemetry data
- ML Ops — automation of feature extraction, training, and validation
- “Big Data”
- Created and instructed the first commercial Apache Hadoop training program taught at Sun, Apple, HP, and Orange.
- Authored last chapter of Hadoop: The Definitive Guide
- Open Source
- Created Cascading, used by thousands of developers for creating distributed applications, in production at Apple, Twitter, Stripe, Chase, and Salesforce
- Supported multiple data oriented languages created on top of the Cascading planner like Scalding, Cascalog, and Lingual
- Created and participated in dozens of related data oriented projects like Apache Calcite
- Apache Software Foundation (ASF) Member