CRISP-DM

Definition

CRISP-DM (Cross-Industry Standard Process for Data Mining) is a process model published in 1999 by a consortium of European companies (SPSS, NCR, Daimler-Chrysler, OHRA). It structures data mining and data science projects into 6 cyclical phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, Deployment.

Despite its age, CRISP-DM remains the most adopted framework: KDnuggets 2014 survey reported 43% adoption, more than double any alternative. Its strength is being industry-agnostic and tool-agnostic.

The six phases

1. Business Understanding (15-20% of time)

Define business objectives: what does the organization want to achieve?
Translate into analytics objectives: what question to answer with data?
Assess situation: available resources, constraints, risks
Define success criteria: measurable metrics

2. Data Understanding (20-25% of time)

Collect initial data: identify and access data sources
Describe data: volume, format, coverage, data dictionary
Explore data: descriptive statistics, visualizations, correlations
Verify data quality: completeness, accuracy, outliers

3. Data Preparation (50-70% of time)

Select data: choose relevant variables and records
Clean data: handle missing values, outliers, duplicates
Construct data: feature engineering, aggregations, derive new features
Integrate data: merge from different sources
Format data: transform for modeling tools (normalization, encoding)

4. Modeling (10-20% of time)

Select modeling technique: regression, classification, clustering, etc.
Design test plan: train/validation/test split, cross-validation
Build model: train algorithms with optimal parameters
Assess model: accuracy, precision, recall, F1, AUC, etc.
Iterate: return to data preparation if performance insufficient

5. Evaluation (5-10% of time)

Evaluate results: does the model meet business success criteria?
Review process: identify skipped or to-be-reviewed steps
Determine next steps: deployment, new iterations, or project termination

6. Deployment (5-10% of time)

Plan deployment: how to put into production (batch, real-time, embedded)
Plan monitoring: how to monitor performance and data drift
Produce final report: document findings and recommendations
Review project: lessons learned for future projects

Iterative nature

CRISP-DM is not waterfall. The arrows in the circular diagram indicate you can return to previous phases:

Modeling reveals data quality issues → back to Data Preparation
Evaluation shows insufficient model → back to Modeling or Data Understanding
Deployment discovers edge cases → back to Data Preparation or Business Understanding

The outer cycle (from Deployment to Business Understanding) represents successive projects that refine the solution.

Modern adaptations

CRISP-ML(Q): 2020 extension for production ML, adds monitoring, maintenance, and quality assurance phases. Addresses ML-specific concerns like model drift, retraining, and A/B testing.

Agile Data Science: integration of CRISP-DM with Agile sprints. Each sprint executes mini CRISP-DM cycles, delivering value increments. Favored in teams adopting DataOps.

TDSP (Team Data Science Process) by Microsoft: more prescriptive version with templates, checklists, and Azure-specific tooling. Emphasis on collaboration and reproducibility.

Practical considerations

Data Preparation dominates: 50-70% of time goes into this phase. Underestimating this effort is a common cause of project delays. Investing in upfront data quality (data governance, catalogs) reduces this overhead.

Business Understanding is critical: projects starting from “we have data, let’s find insights” (data-first) fail more than those starting from business problem. CRISP-DM forces starting from business understanding.

Deployment is often overlooked: many projects end with Jupyter notebooks or PowerPoint reports. CRISP-DM reminds that value is realized only through deployment and user adoption.

Skill gap: CRISP-DM requires both technical skills (modeling, data engineering) and business skills (domain knowledge, stakeholder management). Junior data scientists tend to over-focus on modeling.

Alternatives and comparisons

SEMMA (Sample, Explore, Modify, Model, Assess): SAS process, more tool-specific and less emphasis on business understanding.

KDD (Knowledge Discovery in Databases): academic predecessor of CRISP-DM, more theoretical and less practical.

Agile/Lean: complementary frameworks. CRISP-DM defines “what to do”, Agile defines “how to organize the team”. Many orgs combine CRISP-DM with sprints and retrospectives.

Common misconceptions

”CRISP-DM is waterfall”

No. The phases are iterative. You regularly return to previous phases when discovering new information. The circular diagram represents this cyclicality.

”CRISP-DM is obsolete, superseded by Agile”

False. CRISP-DM and Agile operate at different levels. CRISP-DM structures analytical workflow, Agile structures team and delivery. They complement each other.

”CRISP-DM ignores production and monitoring”

No. The Deployment phase explicitly includes monitoring and maintenance planning. Many projects neglect this phase, but the framework includes it.

”CRISP-DM is only for classical data mining, not for deep learning”

Not true. The principles (understand business, prepare data, model, evaluate, deploy) apply to any ML approach, including deep learning. CRISP-ML(Q) modernizes specific details.

DataOps: methodology to accelerate CRISP-DM through automation
Agile Software Development: framework to organize iterative sprints
LLM: modern approach building on machine learning foundations that CRISP-DM guides
DevOps: parallel discipline for software deployment related to CRISP-DM deployment phase

Sources

Chapman, P. et al. (2000). CRISP-DM 1.0: Step-by-step data mining guide
Provost, F. & Fawcett, T. (2013). Data Science for Business
KDnuggets (2014). “Poll: What main methodology are you using for your analytics, data mining, or data science projects?”
Studer, S. et al. (2020). “Towards CRISP-ML(Q): A Machine Learning Process Model with Quality Assurance Methodology”