Predictive Analytics for Text

Model Understanding and Generative Alignment Laboratory (MUGA LAB)
Academic Year 2025–2026


Course Overview

Predictive Analytics for Text introduces supervised learning techniques for text-based prediction,
emphasizing reproducibility, interpretability, and uncertainty estimation.

Students learn to:


Course Objectives

By the end of the course, students should be able to:

Objective Description
Text Preprocessing Tokenize, lemmatize, and normalize text using spaCy/NLTK
Vectorization Implement BoW, TF–IDF, and embedding-based representations
Supervised Learning Train and evaluate classification and regression models
Feature Analysis Use SHAP and calibration metrics to interpret model behavior
Deep Learning Construct MLPs, RNNs, and Transformer-based architectures
Uncertainty Modeling Estimate predictive uncertainty using BayesFlow
Model Integration Combine deterministic and probabilistic models into ensembles
Scientific Communication Write structured, interpretable reports and presentations

Weekly Schedule

Week Topics Key Models / Methods
1 Text preprocessing — tokenization, lemmatization, POS, NER spaCy, NLTK
2 Vectorization and embeddings TF–IDF, Word2Vec, GloVe, FastText
3 Classification I Perceptron, Adaline, Logistic Regression, Naive Bayes
4 Classification II SVM, Random Forest, XGBoost
5 Regression models Linear, SVR, Random Forest, XGBoost
6 Evaluation and validation Accuracy, F1, ROC AUC, MSE, MAE, R²
7 Feature analysis and tuning Feature importance, SHAP, Optuna
8 Deep Learning I MLP, TF–IDF or embedding inputs
9 Deep Learning II GRU, LSTM sequence modeling
10 Transformers + TabNet Attention, positional encoding, mask interpretation
11 BayesFlow for Uncertainty Posterior inference, calibration
12 Ensemble Learning Stacking classical + deep + probabilistic models
13–15 Capstone Project Implementation, checkpoints, final defense

Tools and Frameworks

Domain Libraries / Tools
Text Processing spaCy, NLTK, pandas, NumPy
ML & Evaluation scikit-learn, XGBoost, Optuna
Deep Learning PyTorch, torchtext, transformers, pytorch-tabnet
Uncertainty & Interpretability BayesFlow, SHAP, matplotlib, seaborn

Assessment Breakdown

Component Weight
Chapter Critiques (6 total) 30%
Long Tests (2) 30%
Capstone Project 30%
Assignments 10%

Capstone Project

Each student (or team) develops an end-to-end predictive pipeline for text data.

Requirements:

Evaluation Criteria:


Alignment with MUGA LAB Framework

This course operates within the MUGA LAB Pedagogical Stack, integrating:

Layer Description
Mathematical Foundation Linear models, optimization, calibration theory
Computational Layer Reproducible ML pipelines (TF–IDF, embeddings, SHAP, DEHB)
Pedagogical Layer Structured prompt engineering and interpretive reflection
Research Integration Links to Model Understanding and Generative Alignment (2025)

Module Description Folder
Vectorization — Traditional Bag-of-Words and TF–IDF workflows /courses/predictive_analytics_for_text/notebooks/W02_Traditional_Vectorization_v1.1.ipynb
Vectorization — Embeddings Word2Vec, FastText, GloVe, BayesFlow integration /courses/predictive_analytics_for_text/notebooks/W02_Embedding_Vectorization_v1.1.ipynb
Master Split Generation Reproducible train/val/test index generation /courses/predictive_analytics_for_text/scripts/generate_master_split.py
Indexing & Preprocessing Helpers Text and split utilities /helpers/

Contact

MUGA LAB — mugalab.research@gmail.com
https://lexmuga.github.io/mugalab