Project Goal
The primary objective is to build a predictive model that identifies Medicare Part D prescribers (by NPI) who are likely to exceed a peer-normalized opioid prescribing threshold in the following year.
Target Variable
The model predicts the probability that a provider's opioid prescription share next year will be in the top 20% of their peer group (defined by specialty and state). The target is formally defined as:
Pr(opioid_sharet+1 ≥ peer_q80t)
Intended Action
The model's output is intended to be a supportive tool. It should be used to prioritize providers for positive interventions like education, clinical reviews, and promoting naloxone co-prescribing. It is explicitly designed for non-punitive actions.
Description of the Datasets
The model integrates five public datasets to create a comprehensive view of each provider's prescribing patterns and their local community context.
Why XGBoost?
XGBoost is a fast, regularized, tree-boosting method that is well-suited for this project's tabular data, heterogeneous features, missing values, and non-linear interactions. It provides strong baselines and remains explainable for clinical and governance review.
1. Tabular + Mixed Feature Types
Works well with numeric and one-hot categorical features, learning complex interaction terms automatically.
2. Missing-Value Handling
Trees in XGBoost learn a "default direction," so imputation pipelines are not needed for many features.
3. Nonlinear Signals
Captures threshold effects and interactions (e.g., state × specialty × opioid share) effectively.
Implementation Notes
NDC Normalization & Opioid Flag
- The pipeline converts openFDA's 10-digit dashed
product_ndcto an 11-digit format by zero-padding according to FDA rules to match Part D data. - An opioid flag is created using a conservative heuristic based on text in pharmacologic class fields and DEA schedule signals (e.g., CII/CIII).
Feature Engineering (NPI×year)
- From Part D: Features include
total_claims,total_30ds,total_cost,opioid_claimswhich is used to deriveopioid_share, andavg_cost_per_30ds. - County Features: Data is merged via a ZIP-to-FIPS bridge. This includes ACS data (population, poverty), CHR indicators (smoking, obesity, etc.), and a rolling 12-month mean overdose rate from the CDC.
Labeling (Peer-Normalized)
- Providers are grouped by state × specialty.
- The 80th percentile of
opioid_shareis computed within each peer group to define the thresholdpeer_q80. - The label is 1 if a provider's
opioid_shareis greater than or equal to their peer group's threshold. - For true forecasting, the model is trained on features at year t to predict the label at year t+1.
Modeling
Overall Model Performance
The model is successful at determining high-risk prescribers compared to random selection and is highly effective at rank-ordering providers by risk.
Prioritization Performance (Lift Analysis)
Lift measures the model's effectiveness at finding high-risk cases compared to random selection, which is crucial for prioritizing resources.
- Lift @ 1% (k=1%): 4.384
The top 1% of providers ranked by risk are 4.4 times more likely to be high-risk than a random selection. - Lift @ 5% (k=5%): 3.989
In the top 5% of riskiest providers, the model is nearly 4 times more likely to find a high-risk case. - Lift @ 10% (k=10%): 3.441
In the top 10%, the model is still 3.4 times more effective than random chance.
Data Governance & Ethics
Demo
A short walkthrough showing how the workflow identifies high-risk prescribers and how the scoring output can be used for non-punitive interventions.