Cross-sectional Functional Stratification Based on Psychometric Profiling and Machine Learning in Patients With Substance Use Disorders (SUD)

Lauro Gutiérrez Castro155 enrolled

Overview

Substance use disorders (SUDs) show considerable clinical heterogeneity that limits the usefulness of traditional categorical diagnoses. This observational, cross-sectional study aims to apply an unsupervised deep learning method - an autoencoder - to learn continuous latent representations from standardised psychometric data and to explore whether those representations can help stratify clinical subpopulations. The investigators will recruit 155 adults undergoing residential treatment for SUD. Participants will complete six validated instruments assessing impulsivity (BIS-11), anger regulation (STAXI-2), behavioural activation/avoidance (BADS), borderline symptomatology (BSL-23), generalised anxiety (GAD-7), and environmental reward (EROS). Demographic and clinical variables (age, sex, primary substance, years of use, prior treatments) will also be recorded. After data cleaning and standardisation (z-scores), a symmetric autoencoder with a 12-dimensional bottleneck (architecture 21-32-24-12-24-32-21) will be trained using mean squared error loss. Regularisation includes L2 weight decay and dropout. The model will be trained 30 times with different random seeds to assess stability; the five best models (by validation pseudo-R²) will be combined into a weighted ensemble. Five-fold cross-validation will evaluate generalisation. For comparison, principal component analysis (PCA) will be applied to the same data. Gaussian mixture models (GMM) will be fitted on the latent space to explore potential clinical subgroups. The primary outcome is the stability of the latent representation (coefficient of variation of validation MSE across runs). Secondary outcomes include reconstruction performance (pseudo-R²) of the ensemble, comparison with PCA, and the interpretability of latent dimensions via correlations with original variables. GMM results will be described using BIC, silhouette width, bootstrap stability, and clinical characterisation of clusters. This study does not involve any intervention. Results will be hypothesis-generating and require external validation. No automated clinical decisions will be made.

Substance use disorders (SUDs) are characterised by substantial heterogeneity in clinical presentation, behavioural patterns, emotional regulation difficulties, impulsivity, and treatment response. Individuals with the same categorical diagnosis may differ considerably in symptom severity, comorbid psychopathology, and psychosocial functioning. This variability limits the explanatory value of traditional diagnostic classifications and supports the development of dimensional and data-driven approaches for patient characterisation. Recent advances in machine learning provide methods capable of identifying latent structures within complex clinical datasets. Autoencoders, a form of unsupervised deep learning, can learn compact nonlinear representations of multidimensional data while preserving relevant information from the original variables. Compared with traditional linear dimensionality reduction methods such as principal component analysis (PCA), autoencoders may better capture complex interactions among psychological and behavioural variables. When combined with probabilistic clustering approaches such as Gaussian mixture models (GMM), these latent representations may facilitate the identification of clinically meaningful patient subgroups. The purpose of this observational study is to apply an autoencoder model to psychometric and clinical data obtained from adults receiving residential treatment for substance use disorders. The study aims to explore latent dimensions underlying symptom and behavioural variability and to evaluate whether these dimensions support stable subgroup identification. Primary Objective: To learn a 12-dimensional latent representation from standardised psychometric and clinical variables using an autoencoder model and evaluate the stability of this representation across repeated training procedures. Secondary Objectives: To compare the reconstruction performance of the autoencoder with principal component analysis (PCA). To characterise the clinical meaning of the latent dimensions through correlations with the original variables. To explore potential patient subgroups using Gaussian mixture models (GMM) applied to the latent space. To assess the stability and interpretability of the identified subgroups. Study Design: This is a single-centre, observational, cross-sectional, non-interventional study conducted in a residential addiction treatment facility. Recruitment is planned from February 2024 through December 2025. The study is registered prior to dissemination of results. Study Population: Approximately 155 adults diagnosed with substance use disorder according to DSM-5 criteria will be included. Eligible participants must be 18 years of age or older, currently receiving residential treatment, capable of completing study questionnaires, and willing to provide written informed consent. Participants with active psychotic disorders, severe cognitive impairment, significant language or literacy barriers, or imminent discharge from treatment will be excluded. Measures and Data Collection: Participants will complete a battery of validated self-report instruments assessing impulsivity, anger regulation, behavioural activation and avoidance, borderline symptomatology, anxiety, and environmental reward. Additional demographic and clinical variables will include age, sex, primary substance of use, years of substance use, and prior treatment history. Questionnaires include: Barratt Impulsiveness Scale (BIS-11) State-Trait Anger Expression Inventory-2 (STAXI-2) Behavioral Activation for Depression Scale (BADS) Borderline Symptom List-23 (BSL-23) Generalized Anxiety Disorder-7 (GAD-7) Environmental Reward Observation Scale (EROS) Data Analysis: Clinical variables will be standardised prior to analysis. Missing values are expected to be minimal and will be handled using median imputation procedures. Redundant variables with excessive multicollinearity may be removed before modelling. An autoencoder neural network will be trained to generate a reduced latent representation of the clinical data. Model performance and stability will be evaluated across repeated training runs and cross-validation procedures. Reconstruction accuracy will be compared with PCA using equivalent dimensionality. The resulting latent space will subsequently be analysed using Gaussian mixture models to explore potential patient subgroups. Model selection will consider statistical fit, cluster stability, and clinical interpretability. Correlations between latent dimensions and original clinical variables will be examined to facilitate interpretation of the learned representations. Ethical Considerations: The study protocol has been approved by the corresponding Institutional Ethics Committee. All participants will provide written informed consent prior to participation. Data will be anonymised after collection, and no direct identifiers will be retained. This study is observational and will not modify routine clinical treatment. No automated clinical decisions will be made based on model outputs. Participants may experience mild emotional discomfort or fatigue while completing questionnaires; psychological support will be available if needed. The study will be conducted in accordance with the Declaration of Helsinki and applicable local ethical regulations. Dissemination: Results will be submitted for publication in peer-reviewed scientific journals and presented at academic conferences. De-identified data and analysis code may be shared publicly after publication to support transparency and reproducibility.

Outcomes

Primary Outcomes

Latent dimension scores

Twelve continuous latent dimensions derived from the bottleneck layer of a symmetric autoencoder trained on 21 standardized clinical variables. Each dimension represents a compressed, nonlinear combination of the original psychometric indicators (impulsivity, emotion regulation, behavioral activation, borderline symptoms, anxiety, and environmental reward). The dimensions are extracted for each participant after averaging the predictions of an ensemble of the five best autoencoder runs. Unit of Measure: Standardized z-score (mean = 0, SD = 1 in the training sample)

Time frame: Baseline (single assessment, cross-sectional)

Secondary Outcomes

Gaussian mixture model cluster membership

Categorical assignment of each participant to one of the clusters obtained by fitting a Gaussian mixture model with full covariance matrices to the 12-dimensional latent space. The number of clusters is determined by the Bayesian Information Criterion (BIC) and clinical interpretability. This outcome is exploratory and does not imply discrete subtypes. Unit of Measure: Nominal (cluster number: 1, 2, …)

Time frame: Baseline

Autoencoder reconstruction pseudo-R²

Proportion of variance in the original 21 clinical variables that is explained by the autoencoder's reconstructions, defined as 1 - (MSE\_model / MSE\_null), where MSE\_null is the mean squared error of a model predicting only the mean. This metric is calculated for the ensemble of the five best models and for each of the 30 independent runs separately. Unit of Measure: Proportion (range 0 to 1)

Time frame: Baseline (computed on the validation split and on the full sample after training)

Autoencoder reconstruction mean squared error

Average squared difference between the original 21 standardized input variables and the reconstructed outputs produced by the autoencoder. Lower values indicate better reconstruction. Reported for the ensemble model and for each independent run. Unit of Measure: Mean squared error (dimensionless, as data are z-standardized)

Time frame: Baseline

Coefficient of variation of reconstruction MSE

Coefficient of variation (CV = standard deviation / mean) of the reconstruction MSE computed over 30 independent autoencoder training runs with different random seeds. This metric assesses the stability and reproducibility of the model. Unit of Measure: Percentage (%)

Time frame: Baseline (after all runs are completed)

Cross-validated reconstruction R²

Mean R² (and standard deviation) obtained from 5-fold cross-validation repeated 3 times, using the same autoencoder architecture and hyperparameters. This evaluates how well the model generalises to unseen patients. Unit of Measure: Proportion (range 0 to 1)

Time frame: Baseline

Explained variance by 12 principal components

Total proportion of variance explained by the first 12 principal components obtained from PCA applied to the same 21 standardized variables. This serves as a comparator for the autoencoder's reconstruction performance. Unit of Measure: Proportion (range 0 to 1)

Time frame: Baseline

Cross-sectional Functional Stratification Based on Psychometric Profiling and Machine Learning in Patients With Substance Use Disorders (SUD)

Overview

Conditions

Interventions

Eligibility

Locations (1)

Outcomes

Primary Outcomes

Secondary Outcomes