From Raw to Certified: How a Dataset Gets Promoted in a Data-Driven Organization
Not all data is created equal. A raw CSV dropped into a landing zone by an upstream system is very different from a clean, validated, business-approved dataset that an executive trusts to make a quarterly decision. Yet in many organizations, both live side by side in the same data lake — indistinguishable to the average analyst.
This is the problem that dataset certification solves. It is the organizational process by which a dataset moves from "someone put this here" to "this is officially trusted data." It answers three questions that matter to every data consumer:
- Is this data accurate? Has it been validated against known quality rules?
- Is this data complete? Does it cover the scope it claims to cover?
- Is this data approved? Has a responsible person signed off that it is fit for its intended purpose?
This article walks through how a formal dataset certification process works in a modern data organization — the stages, the roles, the criteria, and the governance workflow.
Data moves through defined quality stages before reaching business consumers
The Four Stages of Dataset Maturity
Before any formal process can exist, organizations need a shared vocabulary for data maturity. The most widely adopted model maps directly to the Medallion Architecture — a tiered approach to data quality that separates raw ingestion from business-ready data.
Stage 1 — Raw (Bronze)
Data arrives as-is from source systems. No transformations, no validation, no guarantees. The Bronze layer is an immutable archive: every record that was ever received is preserved, including duplicates, nulls, and schema inconsistencies. No dataset at this stage should be used for business decisions.
Who can access it: Data engineers only. What it means: "We received this. We make no promises about it."
Stage 2 — Validated (Silver)
The data has been cleaned, deduplicated, and conformed to a standard schema. Basic quality rules have been applied: expected columns exist, data types are correct, referential integrity checks pass. The dataset is technically usable but has not yet been reviewed for business accuracy.
Who can access it: Data engineers, data scientists, and analysts doing exploratory work. What it means: "This data is technically clean. It has not been business-approved."
Stage 3 — Recommended (Gold — Pending)
A data steward or domain owner has reviewed the dataset and determined it is ready for broader use. It has been documented, tagged, and submitted for final certification review. This is the "candidate" stage — the dataset is good enough to recommend, but not yet formally certified.
Who can access it: Analysts and BI tools, with the understanding that it is under review. What it means: "A human expert has reviewed this and recommends it for use."
Stage 4 — Certified (Gold — Approved)
The dataset has passed all quality gates, received sign-off from a data owner, and been formally endorsed for business use. It is the authoritative source for its domain. Reports, dashboards, and ML models built on certified datasets are considered production-grade.
Who can access it: All authorized consumers across the organization. What it means: "This is the official, trusted source. Use this."
The Roles Involved
Dataset certification is not a technical process — it is an organizational one. Technology enables it, but people make it work. Three distinct roles are required.
Data Engineer
Responsible for building and maintaining the pipeline that moves data from Raw to Validated. The data engineer implements quality checks, applies transformations, and documents schema changes. They do not certify data — their role ends at making data technically sound.
Data Steward
A domain expert (often a senior analyst or business analyst) who understands the business meaning of the data. The data steward validates that the dataset accurately represents what it claims to represent — not just that the columns have correct types, but that a customer_status = 'active' record actually corresponds to what the business considers an active customer. The steward submits the dataset for certification and owns the documentation.
Data Owner
A senior stakeholder (often a department head or VP of Data) who gives final sign-off on certification. The data owner is accountable for the dataset — if a certified dataset turns out to be wrong, the data owner is responsible. This accountability is what gives certification its organizational weight.
The critical insight: certification requires accountability. A dataset certified by nobody is trusted by nobody. The moment a named person stakes their reputation on a dataset's accuracy, the entire organization's relationship with that data changes.
The Certification Criteria
Before a dataset can move from Stage 3 to Stage 4, it must satisfy a defined set of criteria. These criteria should be documented in a Data Certification Checklist maintained by the data governance team.
A typical checklist covers six dimensions:
| Dimension | What Is Checked | Example Criterion |
|---|---|---|
| Completeness | Are all expected records present? | No more than 0.1% null values in key columns |
| Accuracy | Does the data match source-of-truth systems? | Customer count matches CRM system ±0.5% |
| Timeliness | Is the data current? | Refreshed within SLA (e.g., daily by 6 AM) |
| Consistency | Is the data consistent across joins? | No orphaned foreign keys |
| Uniqueness | Are there duplicate records? | Primary key has zero duplicates |
| Documentation | Is the dataset described? | Data dictionary exists, all columns have descriptions |
Every criterion must be met before certification is granted. Failing even one criterion sends the dataset back to the Data Engineer for remediation.
The Governance Workflow
The certification process follows a defined workflow with clear handoffs between roles. Here is how it works end-to-end in a mature data organization.
Step 1 — Engineer Submits for Steward Review
The Data Engineer completes the technical work: the Silver layer pipeline is stable, automated quality checks pass consistently over at least two weeks of production runs, and the dataset schema is documented. The engineer opens a Data Certification Request — a ticket or form that includes:
- Dataset name, schema, and location
- Pipeline documentation (how data is produced)
- Quality check results (success rate over the past 30 days)
- Known limitations or caveats
Step 2 — Steward Conducts Business Review
The Data Steward receives the certification request and conducts a business-level review over 5–10 business days. This is not a rubber stamp — the steward runs sample queries, compares the dataset against other known-good sources, and validates business logic. Common questions the steward asks:
- Does the definition of "active customer" in this dataset match our agreed business definition?
- Are historical records preserved correctly, or has something been silently backfilled?
- Are there any known periods where source data was unreliable (system migrations, outages)?
If the steward is satisfied, they fill in the documentation gaps, apply the appropriate tags (status = recommended), and escalate to the Data Owner.
Step 3 — Owner Grants Final Certification
The Data Owner reviews the steward's findings and the certification checklist. This review is typically brief (1–2 days) — the owner is not re-doing the steward's work, they are exercising judgment about whether the dataset is ready to be the official source of truth for their domain. If approved, the owner:
- Applies the
Certifiedendorsement in the data catalog - Sets a recertification date (typically 6 or 12 months)
- Notifies data consumers that the dataset is now the authoritative source
Step 4 — Catalog Update and Communication
The certification is recorded in the central data catalog (Unity Catalog in a Databricks environment). All metadata is updated:
-- Record certification in Unity Catalog
ALTER TABLE prod_catalog.customers.customer_master
SET TAGS (
'certification_status' = 'certified',
'certified_by' = 'jane.doe@company.com',
'certified_date' = '2026-06-03',
'recertification_due' = '2027-06-03',
'data_owner' = 'VP Data & Analytics',
'sla_refresh' = 'daily_06h00_UTC'
);
-- Apply the Unity Catalog native endorsement
-- (done via the UI or REST API — visible to all workspace users)
A communication is sent to all data consumers announcing the new certified dataset and, importantly, deprecating any previous ad-hoc versions that analysts may have been using.
What Certification Means in Practice
Certification changes the relationship between data and its consumers at every level of the organization.
For the Executive: A certified dataset is the number they present to the board. When the CFO asks "how many active customers do we have?", the answer comes from a certified dataset — not from an analyst's personal query.
For the Analyst: Certified datasets are the starting point for every report and dashboard. The analyst does not need to verify the data's quality — that work has already been done. This dramatically reduces the time spent on data validation and the number of "but my numbers don't match yours" conversations.
For the Data Scientist: ML models trained on certified data are considered production-eligible. Models trained on non-certified data require additional documentation explaining why certified data was not used.
For Compliance and Audit: When a regulator asks "what data was used to produce this regulatory report?", the answer is a certified dataset with a documented lineage trail. The certification record — who certified it, when, and against what criteria — is part of the audit trail.
Maintaining Certification: Recertification and Revocation
Certification is not permanent. Data pipelines change, source systems are replaced, and business definitions evolve. A dataset certified today may no longer accurately represent the business reality twelve months from now.
Scheduled Recertification
Every certified dataset has a recertification date — typically set at 6 or 12 months. As the date approaches, the Data Steward is automatically notified to re-run the certification checklist. If the dataset passes, certification is renewed. If it fails, the dataset is downgraded to "Recommended" until issues are resolved.
Triggered Recertification
Certain events automatically trigger an out-of-cycle recertification review:
- Source system migration: A new CRM or ERP system means the data extraction logic has changed
- Schema change: Adding or removing columns in the underlying pipeline
- SLA breach: The dataset fails to refresh on time for more than three consecutive cycles
- Quality alert: Automated monitoring detects a significant anomaly (e.g., record count drops 40% in one day)
Revocation
If a serious data quality issue is discovered in a certified dataset, the Data Owner can immediately revoke certification. Revocation is a rare but important mechanism — it signals to all consumers that the dataset should not be used until the issue is resolved. A revocation notice is sent to all known consumers with details of the issue and expected resolution timeline.
Delta Lake provides the ACID transactions and time travel that make certification auditable
Why Delta Lake Makes Certification Auditable
A certification process is only as strong as its audit trail. Delta Lake — the open-source storage format underlying Databricks — provides two capabilities that are essential for dataset certification:
ACID Transactions: Every write to a Delta table is atomic. There are no partial updates, no half-loaded files, no data corruption from failed jobs. When a dataset is certified, its state at the moment of certification is guaranteed to be consistent.
Time Travel: Delta Lake stores the full transaction history of every table. When an auditor asks "what did this dataset look like on the date we filed our Q1 report?", you can query it:
-- Reconstruct the exact state of the dataset on a past date
SELECT COUNT(*) as customer_count
FROM prod_catalog.customers.customer_master
TIMESTAMP AS OF '2026-03-31 23:59:59';
This is not an approximation or a backup restore — it is the exact, immutable state of the data at that moment. No other storage format offers this capability at scale.
Common Pitfalls to Avoid
Certifying too early: The pressure to have certified data can lead organizations to certify datasets that are not truly ready. A certification granted too quickly, revoked three months later, destroys trust — both in the dataset and in the certification process itself.
No clear ownership: Certification without a named owner is meaningless. If nobody is accountable, nobody takes the process seriously. Every certified dataset must have a specific person — not a team, not a department — who is responsible for its accuracy.
Ignoring recertification: Organizations invest heavily in the initial certification workflow but neglect recertification. A dataset certified three years ago that has never been reviewed is not trustworthy, regardless of its label.
Over-certifying: Not every dataset needs to be certified. Certification is expensive (in time and organizational attention) and should be reserved for datasets that drive decisions. Internal experiment tables, exploratory analysis outputs, and one-time data extracts do not need the full certification treatment.
Conclusion
Dataset certification is the organizational process that transforms a data lake from a repository of files into a trusted business asset. It requires technical rigor (quality gates, automated checks, documented lineage), organizational structure (defined roles, formal workflows, clear accountability), and ongoing commitment (recertification, monitoring, revocation when needed).
The Medallion Architecture provides the technical scaffolding: Bronze holds the raw record of history, Silver ensures technical quality, and Gold serves certified, business-approved data to the organization. But the architecture alone is not enough. The certification process — the roles, the criteria, the workflow — is what gives the Gold layer its meaning.
Organizations that invest in dataset certification consistently report fewer "data trust" issues, faster onboarding of new analysts, and more confident executive decision-making. The cost is real: certification takes time and requires genuine organizational buy-in. The return is a data platform that the entire organization trusts — and that trust is worth every hour invested.
Ready to build a dataset governance and certification framework for your organization? Talk to the LanaCloud team. We specialize in Databricks Lakehouse architecture, Unity Catalog governance, and end-to-end data platform design.