Streaming pipelines on GCP: high-volume event-driven data, validated and governed

I built and operated ETL/ELT and streaming pipelines on GCP for high-volume event-driven data, with backend services in Python and Java, and contributed validation frameworks and data governance standards in a multi-stakeholder environment.

Real case with the client anonymized for confidentiality. The business problem, method and decisions are described; no code or sensitive data is published.

Client. A technology company with a product that generates high-volume events, where data is not a byproduct: it is the raw material of the business.

Approach

A case of two links of the method —software engineering and data engineering— in a domain where volume does not forgive carelessness: at streaming scale, badly validated data is not fixed by hand, it multiplies. The thesis that orders the work is direct: without reliable data there is no analytics or AI worth anything; you only automate the errors faster. Validation, therefore, is not a final step but a property of the pipeline.

The problem detected

Processing high-volume events in real time requires resolving two tensions at once: that of performance —not losing events or accumulating lag— and that of trust —that what enters the data lake says what it says—. In environments like this, what fails is usually not throughput, but silent quality: anomalies and drifts that do not break the pipeline but poison the decisions downstream.

Functional assessment

The work took place in a multi-stakeholder environment, where different teams consumed the same data with different expectations. Part of the assessment was precisely that: understanding which decision each dataset serves, so as not to optimize a pipeline against a metric that no one cares about.

Building the technical solution

ETL/ELT and streaming pipelines on GCP (Dataflow, Pub/Sub) for high-volume event-oriented data, designed to scale without sacrificing traceability.
Backend services in Python (FastAPI) and Java, deployed via Docker, as the pieces that surround and feed the pipelines.
Data validation and anomaly detection frameworks, integrated into the flow, so that quality is a guarantee of the system and not a later inspection.

Information and data layer

Beyond the code, I contributed to data governance standards in an environment with multiple teams: shared rules about what a piece of data means, how it is validated and who answers for it. In an event-driven product, that governance is what prevents each team from building its own version of the truth.

How the work was conducted

This work predates the current agentic tools: the discipline came from the craft, not the instrumentation. The method was that of the serious data engineering of its time —automated validation, anomaly detection, governance standards agreed between teams—, the same practices I now bring to an agent harness, but back then sustained by hand. I record it this way out of honesty: there was no AI in the construction or in the product.

What this case proves

Data engineering at real scale: high-volume streaming on GCP, not a toy pipeline.
Stack versatility: delivery in Python and in Java, not monoculture.
Early governance: quality and data standards in multi-stakeholder environments, a constant throughout my trajectory.