Definition
CRISP-DM (Cross-Industry Standard Process for Data Mining) is a process model published in 1999 by a consortium of European companies (SPSS, NCR, Daimler-Chrysler, OHRA). It structures data mining and data science projects into 6 cyclical phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, Deployment.
Despite its age, CRISP-DM remains the most adopted framework: KDnuggets 2014 survey reported 43% adoption, more than double any alternative. Its strength is being industry-agnostic and tool-agnostic.
The six phases
1. Business Understanding (15-20% of time)
- Define business objectives: what does the organization want to achieve?
- Translate into analytics objectives: what question to answer with data?
- Assess situation: available resources, constraints, risks
- Define success criteria: measurable metrics
2. Data Understanding (20-25% of time)
- Collect initial data: identify and access data sources
- Describe data: volume, format, coverage, data dictionary
- Explore data: descriptive statistics, visualizations, correlations
- Verify data quality: completeness, accuracy, outliers
3. Data Preparation (50-70% of time)
- Select data: choose relevant variables and records
- Clean data: handle missing values, outliers, duplicates
- Construct data: feature engineering, aggregations, derive new features
- Integrate data: merge from different sources
- Format data: transform for modeling tools (normalization, encoding)
4. Modeling (10-20% of time)
- Select modeling technique: regression, classification, clustering, etc.
- Design test plan: train/validation/test split, cross-validation
- Build model: train algorithms with optimal parameters
- Assess model: accuracy, precision, recall, F1, AUC, etc.
- Iterate: return to data preparation if performance insufficient
5. Evaluation (5-10% of time)
- Evaluate results: does the model meet business success criteria?
- Review process: identify skipped or to-be-reviewed steps
- Determine next steps: deployment, new iterations, or project termination
6. Deployment (5-10% of time)
- Plan deployment: how to put into production (batch, real-time, embedded)
- Plan monitoring: how to monitor performance and data drift
- Produce final report: document findings and recommendations
- Review project: lessons learned for future projects
Iterative nature
CRISP-DM is not waterfall. The arrows in the circular diagram indicate you can return to previous phases:
- Modeling reveals data quality issues → back to Data Preparation
- Evaluation shows insufficient model → back to Modeling or Data Understanding
- Deployment discovers edge cases → back to Data Preparation or Business Understanding
The outer cycle (from Deployment to Business Understanding) represents successive projects that refine the solution.
Modern adaptations
CRISP-ML(Q): 2020 extension for production ML, adds monitoring, maintenance, and quality assurance phases. Addresses ML-specific concerns like model drift, retraining, and A/B testing.
Agile Data Science: integration of CRISP-DM with Agile sprints. Each sprint executes mini CRISP-DM cycles, delivering value increments. Favored in teams adopting DataOps.
TDSP (Team Data Science Process) by Microsoft: more prescriptive version with templates, checklists, and Azure-specific tooling. Emphasis on collaboration and reproducibility.
Practical considerations
Data Preparation dominates: 50-70% of time goes into this phase. Underestimating this effort is a common cause of project delays. Investing in upfront data quality (data governance, catalogs) reduces this overhead.
Business Understanding is critical: projects starting from “we have data, let’s find insights” (data-first) fail more than those starting from business problem. CRISP-DM forces starting from business understanding.
Deployment is often overlooked: many projects end with Jupyter notebooks or PowerPoint reports. CRISP-DM reminds that value is realized only through deployment and user adoption.
Skill gap: CRISP-DM requires both technical skills (modeling, data engineering) and business skills (domain knowledge, stakeholder management). Junior data scientists tend to over-focus on modeling.
Alternatives and comparisons
SEMMA (Sample, Explore, Modify, Model, Assess): SAS process, more tool-specific and less emphasis on business understanding.
KDD (Knowledge Discovery in Databases): academic predecessor of CRISP-DM, more theoretical and less practical.
Agile/Lean: complementary frameworks. CRISP-DM defines “what to do”, Agile defines “how to organize the team”. Many orgs combine CRISP-DM with sprints and retrospectives.
Common misconceptions
”CRISP-DM is waterfall”
No. The phases are iterative. You regularly return to previous phases when discovering new information. The circular diagram represents this cyclicality.
”CRISP-DM is obsolete, superseded by Agile”
False. CRISP-DM and Agile operate at different levels. CRISP-DM structures analytical workflow, Agile structures team and delivery. They complement each other.
”CRISP-DM ignores production and monitoring”
No. The Deployment phase explicitly includes monitoring and maintenance planning. Many projects neglect this phase, but the framework includes it.
”CRISP-DM is only for classical data mining, not for deep learning”
Not true. The principles (understand business, prepare data, model, evaluate, deploy) apply to any ML approach, including deep learning. CRISP-ML(Q) modernizes specific details.
Related terms
- DataOps: methodology to accelerate CRISP-DM through automation
- Agile Software Development: framework to organize iterative sprints
- LLM: modern approach building on machine learning foundations that CRISP-DM guides
- DevOps: parallel discipline for software deployment related to CRISP-DM deployment phase
Sources
- Chapman, P. et al. (2000). CRISP-DM 1.0: Step-by-step data mining guide
- Provost, F. & Fawcett, T. (2013). Data Science for Business
- KDnuggets (2014). “Poll: What main methodology are you using for your analytics, data mining, or data science projects?”
- Studer, S. et al. (2020). “Towards CRISP-ML(Q): A Machine Learning Process Model with Quality Assurance Methodology”