Data Mining And Predictive Analysis

Learn the data mining and predictive analysis essentials with hands-on techniques that turn raw data into actionable insights.

(DM-PA.AE1) / ISBN : 978-1-64459-374-5
Lessons
Lab
TestPrep
AI Tutor (Add-on)
301 Reviews
Get A Free Trial

About This Course

This Data Mining and Predictive Analytics course cuts through the noise to teach you the practical skills you need to analyze data and make accurate predictions. You’ll learn how to apply real-world data mining techniques, work with machine learning  (ML) models, and extract insights that drive smarter decisions. We break down complex concepts into straightforward lessons to uncover the most profitable nuggets of knowledge from the data while avoiding the potential pitfalls that may cost your company millions of dollars.

Skills You’ll Get

  • Understand how to gather, clean, and organize raw data for analysis
  • Capitalize on core methods like classification, clustering, and association rule mining
  • Build predictive models using ML algorithms 
  • Represent data insights using visual elements for better interpretation and decision-making 
  • Apply statistical methods to analyze and interpret data trends
  • Understand and implement ML algorithms for predictive tasks
  • Learn how to select and create features to improve model accuracy 
  • Evaluate model performance and fine-tune for better accuracy

1

Preface

  • What is Data Mining? What is Predictive Analytics?
  • Why is this Course Needed?
  • Who Will Benefit from this Course?
  • Danger! Data Mining is Easy to do Badly
  • “White-Box” Approach
  • Algorithm Walk-Throughs
  • Exciting New Topics
  • The R Zone
  • Appendix: Data Summarization and Visualization
  • The Case Study: Bringing it all Together
  • How the Course is Structured
2

An Introduction to Data Mining and Predictive Analytics

  • What is Data Mining? What Is Predictive Analytics?
  • Wanted: Data Miners
  • The Need For Human Direction of Data Mining
  • The Cross-Industry Standard Process for Data Mining: CRISP-DM
  • Fallacies of Data Mining
  • What Tasks can Data Mining Accomplish
  • The R Zone
  • R References
  • Exercises
3

Data Preprocessing

  • Why do We Need to Preprocess the Data?
  • Data Cleaning
  • Handling Missing Data
  • Identifying Misclassifications
  • Graphical Methods for Identifying Outliers
  • Measures of Center and Spread
  • Data Transformation
  • Min–Max Normalization
  • Z-Score Standardization
  • Decimal Scaling
  • Transformations to Achieve Normality
  • Numerical Methods for Identifying Outliers
  • Flag Variables
  • Transforming Categorical Variables into Numerical Variables
  • Binning Numerical Variables
  • Reclassifying Categorical Variables
  • Adding an Index Field
  • Removing Variables that are not Useful
  • Variables that Should Probably not be Removed
  • Removal of Duplicate Records
  • A Word About ID Fields
  • The R Zone
  • R Reference
  • Exercises
4

Exploratory Data Analysis

  • Hypothesis Testing Versus Exploratory Data Analysis
  • Getting to Know The Data Set
  • Exploring Categorical Variables
  • Exploring Numeric Variables
  • Exploring Multivariate Relationships
  • Selecting Interesting Subsets of the Data for Further Investigation
  • Using EDA to Uncover Anomalous Fields
  • Binning Based on Predictive Value
  • Deriving New Variables: Flag Variables
  • Deriving New Variables: Numerical Variables
  • Using EDA to Investigate Correlated Predictor Variables
  • Summary of Our EDA
  • The R Zone
  • R References
  • Exercises
5

Dimension-Reduction Methods

  • Need for Dimension-Reduction in Data Mining
  • Principal Components Analysis
  • Applying PCA to the Houses Data Set
  • How Many Components Should We Extract?
  • Profiling the Principal Components
  • Communalities
  • Validation of the Principal Components
  • Factor Analysis
  • Applying Factor Analysis to the Adult Data Set
  • Factor Rotation
  • User-Defined Composites
  • An Example of a User-Defined Composite
  • The R Zone
  • R References
  • Exercises
6

Univariate Statistical Analysis

  • Data Mining Tasks in Discovering Knowledge in Data
  • Statistical Approaches to Estimation and Prediction
  • Statistical Inference
  • How Confident are We in Our Estimates?
  • Confidence Interval Estimation of the Mean
  • How to Reduce the Margin of Error
  • Confidence Interval Estimation of the Proportion
  • Hypothesis Testing for the Mean
  • Assessing The Strength of Evidence Against The Null Hypothesis
  • Using Confidence Intervals to Perform Hypothesis Tests
  • Hypothesis Testing for The Proportion
  • Reference
  • The R Zone
  • R Reference
  • Exercises
7

Multivariate Statistics

  • Two-Sample t-Test for Difference in Means
  • Two-Sample Z-Test for Difference in Proportions
  • Test for the Homogeneity of Proportions
  • Chi-Square Test for Goodness of Fit of Multinomial Data
  • Analysis of Variance
  • Reference
  • The R Zone
  • R Reference
  • Exercises
8

Preparing to Model the Data

  • Supervised Versus Unsupervised Methods
  • Statistical Methodology and Data Mining Methodology
  • Cross-Validation
  • Overfitting
  • Bias–Variance Trade-Off
  • Balancing The Training Data Set
  • Establishing Baseline Performance
  • The R Zone
  • R Reference
  • Exercises
9

Simple Linear Regression

  • An Example of Simple Linear Regression
  • Dangers of Extrapolation
  • How Useful is the Regression? The Coefficient of Determination, r2
  • Standard Error of the Estimate, s
  • Correlation Coefficient r
  • Anova Table for Simple Linear Regression
  • Outliers, High Leverage Points, and Influential Observations
  • Population Regression Equation
  • Verifying The Regression Assumptions
  • Inference in Regression
  • t-Test for the Relationship Between x and y
  • Confidence Interval for the Slope of the Regression Line
  • Confidence Interval for the Correlation Coefficient ρ
  • Confidence Interval for the Mean Value of y Given x
  • Prediction Interval for a Randomly Chosen Value of y Given x
  • Transformations to Achieve Linearity
  • Box–Cox Transformations
  • The R Zone
  • R References
  • Exercises
10

Multiple Regression and Model Building

  • An Example of Multiple Regression
  • The Population Multiple Regression Equation
  • Inference in Multiple Regression
  • Regression With Categorical Predictors, Using Indicator Variables
  • Adjusting R2: Penalizing Models For Including Predictors That Are Not Useful
  • Sequential Sums of Squares
  • Multicollinearity
  • Variable Selection Methods
  • Gas Mileage Data Set
  • An Application of Variable Selection Methods
  • Using the Principal Components as Predictors in Multiple Regression
  • The R Zone
  • R References
  • Exercises
11

k-Nearest Neighbor Algorithm

  • Classification Task
  • k-Nearest Neighbor Algorithm
  • Distance Function
  • Combination Function
  • Quantifying Attribute Relevance: Stretching the Axes
  • Database Considerations
  • k-Nearest Neighbor Algorithm for Estimation and Prediction
  • Choosing k
  • Application of k-Nearest Neighbor Algorithm Using IBM/SPSS Modeler
  • The R Zone
  • R References
  • Exercises
12

Decision Trees

  • What is a Decision Tree?
  • Requirements for Using Decision Trees
  • Classification and Regression Trees
  • C4.5 Algorithm
  • Decision Rules
  • Comparison of the C5.0 and CART Algorithms Applied to Real Data
  • The R Zone
  • R References
  • Exercises
13

Neural Networks

  • Input and Output Encoding
  • Neural Networks for Estimation and Prediction
  • Simple Example of a Neural Network
  • Sigmoid Activation Function
  • Back-Propagation
  • Gradient-Descent Method
  • Back-Propagation Rules
  • Example of Back-Propagation
  • Termination Criteria
  • Learning Rate
  • Momentum Term
  • Sensitivity Analysis
  • Application of Neural Network Modeling
  • The R Zone
  • R References
  • Exercises
14

Logistic Regression

  • Simple Example of Logistic Regression
  • Maximum Likelihood Estimation
  • Interpreting Logistic Regression Output
  • Inference: Are the Predictors Significant?
  • Odds Ratio and Relative Risk
  • Interpreting Logistic Regression for a Dichotomous Predictor
  • Interpreting Logistic Regression for a Polychotomous Predictor
  • Interpreting Logistic Regression for a Continuous Predictor
  • Assumption of Linearity
  • Zero-Cell Problem
  • Multiple Logistic Regression
  • Introducing Higher Order Terms to Handle Nonlinearity
  • Validating the Logistic Regression Model
  • WEKA: Hands-On Analysis Using Logistic Regression
  • The R Zone
  • R References
  • Exercises
15

NaïVe Bayes and Bayesian Networks

  • Bayesian Approach
  • Maximum A Posteriori (MAP) Classification
  • Posterior Odds Ratio
  • Balancing The Data
  • Naïve Bayes Classification
  • Interpreting The Log Posterior Odds Ratio
  • Zero-Cell Problem
  • Numeric Predictors for Naïve Bayes Classification
  • WEKA: Hands-on Analysis Using Naïve Bayes
  • Bayesian Belief Networks
  • Clothing Purchase Example
  • Using The Bayesian Network to Find Probabilities
  • The R Zone
  • R References
  • Exercises
16

Model Evaluation Techniques

  • Model Evaluation Techniques for the Description Task
  • Model Evaluation Techniques for the Estimation and Prediction Tasks
  • Model Evaluation Measures for the Classification Task
  • Accuracy and Overall Error Rate
  • Sensitivity and Specificity
  • False-Positive Rate and False-Negative Rate
  • Proportions of True Positives, True Negatives, False Positives, and False Negatives
  • Misclassification Cost Adjustment to Reflect Real-World Concerns
  • Decision Cost/Benefit Analysis
  • Lift Charts and Gains Charts
  • Interweaving Model Evaluation with Model Building
  • Confluence of Results: Applying a Suite of Models
  • The R Zone
  • R References
  • Exercises
  • Hands-On Analysis
17

Cost-Benefit Analysis Using Data-Driven Costs

  • Decision Invariance Under Row Adjustment
  • Positive Classification Criterion
  • Demonstration Of The Positive Classification Criterion
  • Constructing The Cost Matrix
  • Decision Invariance Under Scaling
  • Direct Costs and Opportunity Costs
  • Case Study: Cost-Benefit Analysis Using Data-Driven Misclassification Costs
  • Rebalancing as a Surrogate for Misclassification Costs
  • The R Zone
  • R References
  • Exercises
18

Cost-Benefit Analysis for Trinary and -Nary Classification Models

  • Classification Evaluation Measures for a Generic Trinary Target
  • Application of Evaluation Measures for Trinary Classification to the Loan Approval Problem
  • Data-Driven Cost-Benefit Analysis for Trinary Loan Classification Problem
  • Comparing Cart Models With and Without Data-Driven Misclassification Costs
  • Classification Evaluation Measures for a Generic k-Nary Target
  • Example of Evaluation Measures and Data-Driven Misclassification Costs for k-Nary Classification
  • The R Zone
  • R References
  • Exercises
19

Graphical Evaluation of Classification Models

  • Review of Lift Charts and Gains Charts
  • Lift Charts and Gains Charts Using Misclassification Costs
  • Response Charts
  • Profits Charts
  • Return on Investment (ROI) Charts
  • The R Zone
  • R References
  • Exercises
  • Hands-On Exercises
20

Hierarchical and k-Means Clustering

  • The Clustering Task
  • Hierarchical Clustering Methods
  • Single-Linkage Clustering
  • Complete-Linkage Clustering
  • k-Means Clustering
  • Example of k-Means Clustering at Work
  • Behavior of MSB, MSE, and Pseudo-F as the k-Means Algorithm Proceeds
  • Application of k-Means Clustering Using SAS Enterprise Miner
  • Using Cluster Membership to Predict Churn
  • The R Zone
  • R References
  • Exercises
  • Hands-On Analysis
21

Kohonen Networks

  • Self-Organizing Maps
  • Kohonen Networks
  • Example of a Kohonen Network Study
  • Cluster Validity
  • Application of Clustering Using Kohonen Networks
  • Interpreting The Clusters
  • Using Cluster Membership as Input to Downstream Data Mining Models
  • The R Zone
  • R References
  • Exercises
22

BIRCH Clustering

  • Rationale for BIRCH Clustering
  • Cluster Features
  • Cluster Feature TREE
  • Phase 1: Building The CF Tree
  • Phase 2: Clustering The Sub-Clusters
  • Example of Birch Clustering, Phase 1: Building The CF Tree
  • Example of BIRCH Clustering, Phase 2: Clustering The Sub-Clusters
  • Evaluating The Candidate Cluster Solutions
  • Case Study: Applying BIRCH Clustering to The Bank Loans Data Set
  • The R Zone
  • R References
  • Exercises
23

Measuring Cluster Goodness

  • Rationale for Measuring Cluster Goodness
  • The Silhouette Method
  • Silhouette Example
  • Silhouette Analysis of the IRIS Data Set
  • The Pseudo-F Statistic
  • Example of the Pseudo-F Statistic
  • Pseudo-F Statistic Applied to the IRIS Data Set
  • Cluster Validation
  • Cluster Validation Applied to the Loans Data Set
  • The R Zone
  • R References
  • Exercises
24

Association Rules

  • Affinity Analysis and Market Basket Analysis
  • Support, Confidence, Frequent Itemsets, and the A Priori Property
  • How Does The A Priori Algorithm Work (Part 1)? Generating Frequent Itemsets
  • How Does The A Priori Algorithm Work (Part 2)? Generating Association Rules
  • Extension From Flag Data to General Categorical Data
  • Information-Theoretic Approach: Generalized Rule Induction Method
  • Association Rules are Easy to do Badly
  • How Can We Measure the Usefulness of Association Rules?
  • Do Association Rules Represent Supervised or Unsupervised Learning?
  • Local Patterns Versus Global Models
  • The R Zone
  • R References
  • Exercises
25

Segmentation Models

  • The Segmentation Modeling Process
  • Segmentation Modeling Using EDA to Identify the Segments
  • Segmentation Modeling using Clustering to Identify the Segments
  • The R Zone
  • R References
  • Exercises
26

Ensemble Methods: Bagging and Boosting

  • Rationale for Using an Ensemble of Classification Models
  • Bias, Variance, and Noise
  • When to Apply, and not to apply, Bagging
  • Bagging
  • Boosting
  • Application of Bagging and Boosting Using IBM/SPSS Modeler
  • References
  • The R Zone
  • R Reference
  • Exercises
27

Model Voting and Propensity Averaging

  • Simple Model Voting
  • Alternative Voting Methods
  • Model Voting Process
  • An Application of Model Voting
  • What is Propensity Averaging?
  • Propensity Averaging Process
  • An Application of Propensity Averaging
  • The R Zone
  • R References
  • Exercises
  • Hands-On Analysis
28

Genetic Algorithms

  • Introduction To Genetic Algorithms
  • Basic Framework of a Genetic Algorithm
  • Simple Example of a Genetic Algorithm at Work
  • Modifications and Enhancements: Selection
  • Modifications and Enhancements: Crossover
  • Genetic Algorithms for Real-Valued Variables
  • Using Genetic Algorithms to Train a Neural Network
  • WEKA: Hands-On Analysis Using Genetic Algorithms
  • The R Zone
  • R References
  • Exercises
29

Imputation of Missing Data

  • Need for Imputation of Missing Data
  • Imputation of Missing Data: Continuous Variables
  • Standard Error of the Imputation
  • Imputation of Missing Data: Categorical Variables
  • Handling Patterns in Missingness
  • Reference
  • The R Zone
  • R References
30

Case Study, Part 1: Business Understanding, Data Preparation, and EDA

  • Cross-Industry Standard Practice for Data Mining
  • Business Understanding Phase
  • Data Understanding Phase, Part 1: Getting a Feel for the Data Set
  • Data Preparation Phase
  • Data Understanding Phase, Part 2: Exploratory Data Analysis
31

Case Study, Part 2: Clustering and Principal Components Analysis

  • Partitioning the Data
  • Developing the Principal Components
  • Validating the Principal Components
  • Profiling the Principal Components
  • Choosing the Optimal Number of Clusters Using Birch Clustering
  • Choosing the Optimal Number of Clusters Using k-Means Clustering
  • Application of k-Means Clustering
  • Validating the Clusters
  • Profiling the Clusters
32

Case Study, Part 3: Modeling And Evaluation For Performance And Interpretability

  • Do You Prefer The Best Model Performance, Or A Combination Of Performance And Interpretability?
  • Modeling And Evaluation Overview
  • Cost-Benefit Analysis Using Data-Driven Costs
  • Variables to be Input To The Models
  • Establishing The Baseline Model Performance
  • Models That Use Misclassification Costs
  • Models That Need Rebalancing as a Surrogate for Misclassification Costs
  • Combining Models Using Voting and Propensity Averaging
  • Interpreting The Most Profitable Model
33

Case Study, Part 4: Modeling and Evaluation for High Performance Only

  • Variables to be Input to the Models
  • Models that use Misclassification Costs
  • Models that Need Rebalancing as a Surrogate for Misclassification Costs
  • Combining Models using Voting and Propensity Averaging
  • Lessons Learned
  • Conclusions
A

Appendix A

  • Data Summarization and Visualization
  • Part 1: Summarization 1: Building Blocks Of Data Analysis
  • Part 2: Visualization: Graphs and Tables For Summarizing And Organizing Data
  • Part 3: Summarization 2: Measures Of Center, Variability, and Position
  • Part 4: Summarization And Visualization Of Bivariate Relationships

1

An Introduction to Data Mining and Predictive Analytics

  • Analyzing a Dataset
2

Data Preprocessing

  • Handling Missing Data
  • Creating a Histogram
  • Creating a Scatterplot
  • Creating a Normal Q-Q Plot
  • Creating Indicator Variables
3

Exploratory Data Analysis

  • Analyzing the churn Dataset
  • Exploring Categorical Variables
  • Exploring Numeric Variables
  • Exploring Multivariate Relationships
  • Investigating Correlation Values and p-values in Matrix Form
4

Dimension-Reduction Methods

  • Creating a Scree Plot
  • Profiling the Principal Components
  • Calculating Communalities
  • Validating the Principal Components
  • Applying Factor Analysis to a Dataset
5

Univariate Statistical Analysis

  • Estimating the Confidence Interval for the Mean
  • Estimating the Confidence Interval of the Population Proportion
6

Multivariate Statistics

  • Performing a t-test for Finding the Difference in Means
  • Performing a z-test for Finding the Difference in Proportions
  • Performing a Chi-Square Test for Homogeneity of Proportions
  • Performing a Chi-Square Test for Goodness of Fit of Multinomial Data
  • Analyzing a Variance
7

Preparing to Model the Data

  • Balancing the Training and Testing Datasets
8

Simple Linear Regression

  • Plotting Data with a Regression Line
  • Measuring the Goodness of Fit of the Regression
  • Performing Regression with Other Hikers
  • Verifying the Regression Assumptions
  • Determining Prediction and Confidence Intervals
  • Assessing Normality in Scrabble
  • Applying Box-Cox Transformations
9

Multiple Regression and Model Building

  • Approximating the Relationship between the Variables in a Scatterplot
  • Identifying Confidence Intervals
  • Creating a Dot Plot
  • Determining the Sequential Sums of Squares
  • Analyzing Multicollinearity
  • Applying the Best Subsets Procedure in a Regression Model
  • Applying the Stepwise Selection Procedure in a Regression Model
  • Applying the Backward Elimination Procedure
  • Applying Forward Selection Procedure
  • Using the Principal Components as Predictors in Multiple Regression
10

k-Nearest Neighbor Algorithm

  • Running KNN
  • Calculating the Euclidean Distance
11

Decision Trees

  • Plotting a Classification Tree
12

Neural Networks

  • Running a Neural Network
13

Logistic Regression

  • Creating a Plot for Logistic Regression
  • Interpreting Logistic Regression and Odds Ratio for a Dichotomous Predictor
14

NaïVe Bayes and Bayesian Networks

  • Calculating Posterior Odds Ratio
  • Calculating the Log Posterior Odds Ratio
  • Calculating the Numeric Predictors for Naive Bayes Classification
15

Model Evaluation Techniques

  • Estimating Costs for Benefit Analysis
16

Cost-Benefit Analysis Using Data-Driven Costs

  • Analyzing Cost-benefit Using Data-driven Misclassification Costs
17

Cost-Benefit Analysis for Trinary and -Nary Classification Models

  • Analyzing the Cost-Benefit for the Trinary Loan Classification Problem
18

Hierarchical and k-Means Clustering

  • Using Single-linkage Clustering
  • Using Complete-linkage Clustering
  • Finding Clusters in Data
19

Kohonen Networks

  • Using a 3x2 Kohonen Network
  • Interpreting Clusters
20

Measuring Cluster Goodness

  • Plotting Silhouette Values of a Dataset
  • Applying Cluster Validation to a Dataset
21

Association Rules

  • Viewing the Output Sorted by Support
22

Segmentation Models

  • Predicting Income Using Caps and No Caps Groups
23

Genetic Algorithms

  • Using Genetic Algorithms to Train a Neural Network

Any questions?
Check out the FAQs

Check out our FAQs for more information on data mining and predictive analytics courses.

Contact Us Now

Data mining is the process of discovering patterns and relationships in large datasets using statistical and ML techniques. It helps in extracting useful information from raw data. 

Predictive analytics uses historical data, statistical algorithms, and ML to predict future outcomes. It builds models to forecast trends and behaviors, helping in decision-making. 

Data mining helps discover patterns, trends, and insights from large data sets, enabling businesses to make informed decisions and drive efficiency.

Yes, data analysts are generally well-paid. Entry-level data analysts can earn around $40,000 to $66,000 annually, while mid-level analysts can make approximately $74,000. Senior data analysts often earn six-figure salaries, especially with specialized skills.

This data mining and predictive analysis training course is designed for data analysts, business professionals, and anyone interested in leveraging data for predictive decision-making.

As this is an intermediate to advanced level course, a basic understanding of data analysis, statistics, or programming is helpful.

This data mining and analysis training course focuses on commonly used tools like Python, R, Excel, and popular ML libraries such as scikit-learn and TensorFlow.

After completing this course, you’ll have a skillset for roles like data analyst, data scientist, business analyst, and ML engineer, among others.

Related Courses

All Course
scroll to top