Fleet data platform: a canonical model that joins operations, costs and risk per vehicle

I designed the canonical data model that unifies six fleet sources —operations, risk telematics, SAP costs, vehicle fleet, commercial and drivers— under a single key per vehicle and period, and built two custom MCP servers that expose the systems' APIs and the domain documentation to code agents.

Real case with the client anonymized for confidentiality. The business problem, method and decisions are described; no code or sensitive data is published.

Client. An operator transporting personnel to oil and mining sites in Vaca Muerta and Mendoza: hundreds of vehicles, contracts by site, and a fuel and risk cost measured per kilometer driven on demanding routes.

Approach

This case spans three links of the method —sociology of organizations, software engineering and data engineering— with no handoffs between them: the same reading that organized what a trip, a cost or a risk event means in this company is the one that later modeled the data and built the tools. The fourth link —AI over the data— is outside the delivered scope: here intelligence is applied as a working instrument of the engineer, not as a product. The distinction is deliberate and part of the brand’s honesty.

The thesis that governs the work is that of contextual data quality: a technically correct fleet data point can still lie if it is read outside the human process that produced it. A license plate is not a stable key; a driver is not a unique identifier across systems; a “trip” in operations is not the same unit as a “trip” in costs. Modeling the meaning first, and only then the table, is what separates a data platform from a faster warehouse of errors.

The problem detected

The company governed its fleet with scattered spreadsheets: operations lived in one system, risk and telematics in another, costs and revenue in SAP, and the master of vehicles and contracts in a third place. No one could answer, with a single reliable query, the question that matters to the business: does this vehicle, in this period and under this contract, make or lose money, and at what level of risk does it operate?

The stated problem was “we want a dashboard.” The real problem was prior to that: there was no common key that allowed joining the sources without degrading the meaning at each join. Without that backbone, any dashboard would have been an average of data that do not speak the same language.

Functional analysis

I mapped six source layers and, for each one, what it really says —not what the manual says—:

Source	What it contributes	Unit of analysis
Operations	Runs, kilometers, fuel, punctuality	trip / vehicle
Risk telematics	Events, ranking, driving score, ADAS, positions	event / vehicle
SAP (S/4HANA + BTP + CPI)	Costs, revenue, cost centers	accounting document / cost center
Vehicle fleet	Master: plate ↔ internal ID ↔ cost center ↔ contract	vehicle
Commercial	Client and contract	contract
People / drivers	Multi-source reconciliation of drivers	person

The analysis included an explicit gap analysis: for example, the telematics platform’s positions API did not expose the ranking or the ADAS, which forced escalating with the provider instead of assuming the data was available. Documenting what cannot be obtained is as much part of the analysis as documenting what can.

Building the technical solution

The central piece is not a pipeline: it is a canonical integration key —patente_normalizada + periodo + asignación vigente— that allows joining the six sources without losing the meaning at the joins. On that key I designed a dimensional model:

Dimensions dim_cliente, dim_contrato, dim_centro_costo.
A fact_asignacion_vehiculo fact versioned over time, because a vehicle changes contract and cost center, and crossing costs against operations has to respect which assignment was active in each period.

The model was documented as a source of truth —conceptual model, canonical data model, business glossary, roadmap by source and a living register of open questions—, so that the design decision stays auditable and not in one person’s head.

Faced with the MVP the client was building in parallel on a lakehouse, I positioned the target model explicitly: not as competition, but as the meaning layer that any analytics tool needs underneath in order not to produce dashboards that lie.

Information and data layer

Data governance was design, not an annex. Three fronts:

Quality and traceability. Each business metric can be traced back to its source and its definition; the glossary fixes what each term means before it is computed.
Identity and compliance. I identified a driver identity risk —the multi-source reconciliation of people— with direct impact on auditing, framed within Law 25.326 on personal data protection. A badly resolved join here is not a bug: it is a regulatory liability.
Executive communication. The state of the model was sustained with progress reports and meeting minutes for the stakeholders, so that the data decisions were legible also for those who do not read SQL.

How the work was run

The work was carried out with a code agent harness (Claude Code) governed by its own instruments, not by improvised prompting. The concrete and verifiable:

Two custom MCP servers, in Python, that expose the domain to the agents:
- one for the fleet operations system (assets, runs sheet, fuel report, generic request), with mapping to the canonical model;
- one for the telematics platform positions API (vehicles, kilometer count, daily summary, last position), with OAuth2 ROPC over Azure B2C authentication and a smoke test via CLI.
Each server carries its OpenAPI spec versioned in the repo and a degraded mode without credentials: the agent can reason about the shape of the API even without production access. This is layered governance applied to the tool —the agent accesses the data contract, not the secret.
The domain documentation (canonical model, glossary, analyses) is also exposed via MCP, so that the agent works against the versioned source of truth and not against its memory.
The repetitive tasks of the analysis —querying an API, contrasting its response against the model, recording a gap— become reproducible workflows over these tools, instead of manual steps that degrade with fatigue.

The point: AI is not in the product delivered to the client; it is in how I build, the way any serious engineer uses their toolset today. Showing it this way —factual, without adjectives— is what distinguishes real command of the tool from the fashionable discourse.

What this case proves

Real data engineering: a multi-source canonical model with a key that preserves the meaning, not a schema drawn from memory.
Software engineering with regulated-environment discipline: two real MCPs, with specs and tests, not a generic mention of “I use AI.”
Sociotechnical reading: the driver identity risk and Law 25.326 were detected by reading the human process behind the data, not by auditing the code.