IFoDS 2026 : Sessions

Sessions

Session 1: Advanced Statistical Methods for Complex Data Modeling
Session 2: Frontiers in Statistical Learning and Inference
Session 3: Statistical Learning Methods for Biomedical Data and Precision Medicine
Session 4: Statistical and Computational Foundations of Machine Learning
Session 5: AI-Powered Statistical Inference
Session 6: Geometric and Statistical Modeling of Complex Data
Session 7: Modern Statistical Methods for High-Dimensional, Fairness and Deep Learning Models
Session 8: Statistical Aspects of Modern Machine Learning
Session 9: Frontier Methods in Biostatistics
Session 10: Trustworthy AI and Statistical Governance

Session 1: Advanced Statistical Methods for Complex Data Modeling

Inference for Mark-Specific Causal Effects

Lianqiang Qu, Central China Normal University

Abstract: This paper presents a framework for causal inference in the presence of censored data, where the failure time is marked by a continuous variable referred to as a mark. The mark is observed after treatment and is not meaningful when the failure time is censored. In addition, due to the continuous nature of the marks, observations at each given mark are sparse. These facts make the identification and estimation of causality a challenging task. To address these issues, we define a new mark-specific treatment effect within the potential outcomes framework and characterize its identifying conditions. We then propose a local smoothing estimator for the causal effects and establish its asymptotic properties. We further develop testing methods to evaluate whether the treatment has an effect on the failure time when controlling the values of the mark at certain points or within a defined interval, and develop a Gaussian approximation method to obtain the critical values. We evaluate our method using simulation studies as well as a real dataset from the Antibody Mediated Prevention trials.

Joint Modeling for Zero-Inflation in Microbiome-Metabolome Association Analysis

Peng Ye, University of International Business and Economics

Abstract: When studying microbiome-metabolome association, where microbiome data serve as predictors and metabolites as outcomes, several major analytical challenges arise. One key issue is the presence of zero-inflation in microbiome data. It is essential to distinguish between structural zeros (true absence of the taxa) and random zeros (due to sampling variability), as they may have different relationships with the outcome. Failing to make this distinction will lead to biased inference. Another critical challenge is appropriately handling censored metabolomics data due to detection limits. Common approaches such as deleting censored observations or substituting them with the detection limit can lead to biased estimates. The problem is compounded when censored values stem from both exposed population, where the metabolite is present but falls below the detection limit, and unexposed population, where the metabolite is truly absent. To address these challenges, we propose a joint modeling framework that explicitly separates the effects of structural and random zeros in microbiome predictors while accounting for censoring in metabolomic outcomes. This integrated approach improves the accuracy and efficiency of association estimates and enhances the validity of inference. We also evaluate the proposed approach through extensive simulations and apply it to examine the association between the microbiome and the metabolome in Bogalusa Heart Study.

Distributed privacy-preserving group inference for high-dimensional generalized linear models

Dongxiao Han, Nankai University

Abstract: This article introduces a novel method for distributed, differentially private group inference in the high-dimensional generalized linear model. We consider a setting with an untrusted server and a group of trusted, data-holding clients. Each client first constructs a local nonprivate test statistic based on a weighted quadratic function of the regression sub-vector corresponding to the group, and the debiasing and re-weighting techniques. We then formulate our global differentially private test statistic by using the one-shot method, the Gaussian mechanism, and encryption and decryption procedures. Our proposed approach offers several key advantages: By operating under the untrusted server setting, our method resolves the critical weaknesses endemic to trust-based architectures, namely, systemic fragility and exposure to adversarial or governmental targeting. Moreover, the approach incorporates a bounded encryption procedure to ensure secure communication, eliminate the risk of the server actively leaking data, enable easy computation of the estimator’s sensitivity. and circumvent the computation of intractable sensitivities of complex statistics. Furthermore, our proposed method is capable of handling highly correlated covariates and preserving high power for identifying dense but weak signals. Unlike conventional methods, it also avoids the need to handle the Hessian and precision matrices. Simulation studies are carried out to examine the finite-sample behaviour of the proposed method. An application to an adult income dataset is provided.

Session 2: Frontiers in Statistical Learning and Inference

Decoupled Functional Central Limit Theorems for Two-Time-Scale Stochastic Approximation

Yuze Han, Renmin University of China

Abstract: Two-time-scale stochastic approximation is a variant of the classic stochastic approximation (SA), devised to find the roots of systems with two interconnected equations based on noisy observations. In this approach, two iterates are updated at varying speeds using different step sizes, with each update influencing the other. Previous studies have shown that, under suitable conditions, the convergence rates of the fast and slow iterates are determined only by their own step sizes. This phenomenon is termed decoupled convergence. However, existing results mainly concern convergence properties at a single iteration index, while a trajectory-level counterpart of decoupled convergence remains largely unexplored. In this talk, we establish decoupled functional central limit theorems (FCLTs) for the trajectories of two-time-scale SA. Our results characterize how the coupling between the fast and slow updates influences their asymptotic trajectory behavior. In particular, the limiting process on the fast time scale coincides with that of standard SA, whereas the limiting process on the slow time scale can be interpreted as the limit of SA applied to a composed operator with a modified noise term. We further discuss the implications of our FCLTs, including simultaneous convergence over a range of iterates and possible insights for algorithm design.

Design-Based Edge-Level Causal Inference with Machine Learning Assisted Covariate Adjustment

Hanzhong Liu, Tsinghua University

Abstract: We study design-based causal inference for edge-level outcomes in directed networks under dyadic interference. In this setting, outcomes are defined on directed edges and depend on the joint treatment assignments of pairs of units, inducing a complex dependence structure that invalidates standard estimation and inference procedures developed for node-level data. We construct Horvitz--Thompson estimators for a general class of edge-level causal effects and establish their asymptotic normality under mild regularity conditions. To enable valid inference, we develop variance estimators that exploit identifiable components of network dependence, yielding substantially less conservative bounds than classical approaches. To improve efficiency, we incorporate auxiliary covariates through a sample splitting and cross-fitting procedure. A key technical challenge is that standard two-fold sample splitting fails in the presence of edge-level outcomes due to the dependence induced by shared units. To address this issue, we introduce a three-fold sample splitting and cross-fitting scheme that restores the conditional independence required for unbiased estimation. Under a stability condition, the resulting covariate-adjusted estimator is asymptotically normal and accommodates both linear adjustment and flexible machine learning methods. We further introduce a calibration step that guarantees no asymptotic efficiency loss relative to the unadjusted estimator. Simulation studies and a real-data application confirm the theoretical results and demonstrate substantial efficiency gains.

A Weighted Subspace Approach to Variable Importance Evaluation

Yiwei Fan, Beijing Institute of Technology

Abstract: A central goal of feature screening in high-dimensional settings is to rank predictors by an importance measure that quantifies their association with the response. Marginal screening methods ignore dependence among predictors, whereas forward screening partially mitigates this issue by conditioning on a selected subset. However, forward screening discards the remaining predictors and typically requires an expanding conditioning set as the iterations proceed. To overcome these limitations, we propose a systematic screening framework via weighted subspace iteration from a fixed-point perspective. The framework aggregates information across predictors while keeping the conditioning variables low-dimensional. Unlike existing one-shot or randomly constructed subspace methods, our framework uses screening feedback to iteratively refine the conditioning subspace and ultimately reaches a fixed point. The proposed framework applies to both linear and nonlinear settings and can be paired with a broad class of association measures. It is shown to achieve rank consistency under mild conditions. Extensive simulations demonstrate that the framework outperforms existing approaches, and its practical advantages are further illustrated through a real data example.

Transfer Learning with Heterogeneous Feature Spaces in Linear Regression

Junlong Zhao, Beijing Normal University

Abstract: Transfer learning is a promising technique for enhancing the performance of a target task by leveraging information from source data. Most transfer learning methods are designed for homogeneous feature spaces, where the target and source domains share the same feature space. However, heterogeneous feature spaces, where the target and sources have different feature spaces, are commonly encountered yet less investigated. In this paper, we focus on linear models in which the feature spaces of the sources are subsets of that of the target, considering both low- and high-dimensional settings. An importance weighting transfer learning method combined with projection-based imputation is proposed to effectively utilize source information. Our method includes a sample selection process to reduce the risk of negative transfer and is less sensitive to the choice of the projection matrix used in the imputation step. Both entry-wise and global convergence rates of our estimator are established. Numerical results and real data analysis confirm the effectiveness of our proposed method.

Session 3: Statistical Learning Methods for Biomedical Data and Precision Medicine

Modeling Time-Varying Effects of Recurrent Exposures: A Time-Adapted Exponential Model to Assess Impact of Post-LVAD Bleeding on Mortality

Guangyu Yang, Renmin University of China

Abstract: Bleeding is a common and recurrent adverse event in patients following left ventricular assist device implantation and is associated with increased mortality risk. Our method is motivated by the need to understand the impact of bleeding on mortality, which poses several challenges: (i) bleeding can occur at any time post-implantation; (ii) its effect varies over time; and (iii) bleeding events often recur. However, no existing method addresses all these challenges. This study introduces the Time-Adapted Exponential (TAE) model, which accommodates recurrent bleeding events and incorporates an exponential time-adapted term in the Cox model to characterize both the transient effect at the moment of onset and the evolving effect over time for each bleeding occurrence. The TAE model also provides a framework to assess whether the effect of bleeding events varies over time and whether each bleeding has a homogeneous effect. Asymptotic properties of the TAE estimator are derived. Extensive simulation studies demonstrate that the TAE method exhibits strong numerical performance. Application to the INTERMACS database reveals a decaying pattern in mortality risk following each bleeding event. Furthermore, all bleeding events are associated with an increased risk of mortality, with the first event having a greater impact than subsequent ones.

Neural Network on Interval-Censored Data: Application to the Prediction of Alzheimer's Disease

Tao Sun, Renmin University of China

Abstract: Alzheimer’s disease (AD) is a progressive neurodegenerative disorder with complex genetic, cognitive, and functional determinants. Given the limited effectiveness of current treatments, accurate prediction of AD progression is important for early prevention, individualized monitoring, and timely intervention. However, prediction is challenging because AD onset is often interval censored due to intermittent clinical assessments, and available predictors may include high-dimensional genetic variants and repeatedly measured cognitive and functional scores with nonlinear effects. Motivated by data from all four phases of the Alzheimer’s Disease Neuroimaging Initiative (ADNI), this talk presents neural-network-based methods for interval-censored data. For large-scale genetic data, we develop a semiparametric transformation model with sieve estimation and a computationally efficient generalized score test to identify variants associated with AD progression. Selected variants are then used to construct a neural network prediction model for interval-censored outcomes. We further extend the framework to dynamic prediction by integrating multivariate functional principal component analysis with neural networks, allowing longitudinal cognitive and functional scores to update individualized AD risk predictions at each follow-up visit. Simulation studies and ADNI applications show that the proposed methods improve prediction accuracy over existing approaches and identify subgroups with distinct progression risk profiles. An online dynamic prediction platform has also been developed at http://olap.ruc.edu.cn.

Incorporating External Data for Analyzing Randomized Clinical Trials: A Transfer Learning Approach

Wei Ma, Renmin University of China

Abstract: Randomized clinical trials are the gold standard for analyzing treatment effects, but high costs and ethical concerns can limit recruitment, potentially leading to invalid inferences. Incorporating external trial data with similar characteristics into the analysis using transfer learning appears promising for addressing these issues. In this paper, we present a formal framework for applying transfer learning to the analysis of clinical trials, considering three key perspectives: transfer algorithm, theoretical foundation, and inference method. For the algorithm, we adopt a parameter-based transfer learning approach to enhance the lasso-adjusted stratum-specific estimator developed for estimating treatment effects. A key component in constructing the transfer learning estimator is deriving the regression coefficient estimates within each stratum, accounting for the bias between source and target data. To provide a theoretical foundation, we derive the l1 convergence rate for the estimated regression coefficients and establish the asymptotic normality of the transfer learning estimator. Our results show that when external trial data resembles current trial data, the sample size requirements can be reduced compared to using only the current trial data. Finally, we propose a consistent nonparametric variance estimator to facilitate inference. Numerical studies demonstrate the effectiveness and robustness of our proposed estimator across various scenarios.

Kernel Smoothing-Based Methods for Estimating the Optimal Individualized Treatment Rule for Binary Outcomes

Min Zhang, Tsinghua University

Abstract: Estimating optimal individualized treatment rules (ITRs) is central to precision medicine, where treatment decisions are tailored to patient characteristics. While substantial progress has been made for continuous outcomes, methodological development for binary outcomes remains limited. A popular approach frames ITR estimation as a weighted classification problem using the augmented inverse probability weighted estimator (AIPWE) of the contrast function. Although AIPWE is doubly robust and unbiased under standard causal assumptions, applying it to binary outcomes presents unique challenges. The discrete nature of binary outcomes can cause misalignment between the signs of the estimated and true contrast functions, leading to suboptimal classification. To address this, we propose a kernel-smoothing enhancement of AIPWE that improves contrast function estimation by borrowing information across similar individuals. The proposed method retains double robustness, reduces classification erro# and accommodates high-dimensional covariates through a dimension-reduction strategy. We evaluate the proposed method via simulations under various model misspecifications and apply it to a breast cancer clinical trial to demonstrate its practical utility.

Session 4: Statistical and Computational Foundations of Machine Learning

Mild Over-Parameterization Benefits Tensor PCA

Cong Fang, Peking University

Abstract: In this talk, we investigate how mild over-parameterization enhances gradient descent methods for tensor principal component analysis (Tensor PCA). Traditional gradient methods typically require an enormous number of samples to recover a hidden signal. We show that introducing a modest amount of over-parameterization can dramatically reduce sample complexity without relying on expensive initialization procedures, thereby significantly narrowing the statistical-to-computational gap. Specifically, in the symmetric setting, our proposed normalized stochastic gradient ascent method breaks the previous belief that gradient-based algorithms need a prohibitive number of samples, achieving the best-known statistical efficiency for polynomial algorithms under suitable settings. In the asymmetric setting, under a limited memory budget, we show that mild over-parameterization not only improves sample efficiency but also adapts naturally to the underlying problem structure, further reducing the required sample size when signals are more aligned. Together, these results provide evidence that mild over-parameterization offers both optimization and statistical benefits, effectively bridging the gap between what is statistically possible and what is computationally tractable.

Understanding the Deep Learning from Data Statistics

Zhiqin Xu, Shanghai Jiao Tong University

Abstract: Fundamental research in deep learning constitutes a paramount field of study, within which the phenomenon-driven approach has emerged as a crucial methodology. By discovering and systematically exploring significant phenomena, fundamental research can more effectively facilitate the comprehension and practical application of neural networks. This report will introduce several key phenomena, including the Frequency Principle and the Condensation Phenomenon, to elucidate the underlying mechanisms of deep learning. The discussion will span from the learning dynamics of elementary perceptrons to the memory and reasoning capabilities of large-scale models.(Reference《 An Introduction to Deep Learning Phenomena》)

Data Selection for LLM: Evolving from Closed to Open

Jun Shu, Xi'an Jiaotong University

Abstract: The current machine learning methods represented by foundation models have an urgent demand for large-scale training data, which often requires collecting massive amounts of training data from the open Internet environment. As a result, the closed-data assumption is no longer valid. This issue severely constrains the effective application of machine learning methods in real-world scenarios and has become a bottleneck that the field urgently needs to address. This talk will focus in particular on sample weighting or sample selection, a typical methodology for handling data noise. It will introduce how this methodology has evolved from traditional manually designed weighting schemes under the closed-data assumption to more advanced automated weighting methods under the open-data assumption.

Implicit Models Made Explicit: From Structural Understanding to Efficient Inference

Zenan Ling, Huazhong University of Science and Technology

Abstract: Implicit neural networks define outputs through fixed points, equilibria, or continuous dynamics, offering a flexible alternative to conventional finite-depth architectures. Yet their implicit nature also makes them difficult to analyze and expensive to evaluate. This talk presents a unified perspective on making implicit neural networks explicit, with a focus on Deep Equilibrium Models. I will discuss how their structure can be understood through explicit theoretical characterizations, and how their inference can be accelerated by exploiting the underlying dynamics of fixed-point computation.

Session 5: AI-Powered Statistical Inference

A Conditional Distribution Equality Testing Framework Using Deep Generative Learning

Siming Zheng, Southeast University

Abstract: In this work, we propose a general framework for testing the conditional distribution equality in a two-sample problem, which is most relevant to covariate shift and causal discovery. Our framework is built on neural network-based generative methods and sample splitting techniques by transforming the conditional distribution testing problem into an unconditional one. We introduce two special tests: the generative permutation-based conditional distribution equality test and the generative classification accuracy-based conditional distribution equality test. Theoretically, we establish a minimax lower bound for statistical inference in testing the equality of two conditional distributions under certain smoothness conditions. We demonstrate that the generative permutation-based conditional distribution equality test and its modified version can attain this lower bound precisely or up to some iterated logarithmic factor. Moreover, we prove the testing consistency of the generative classification accuracy-based conditional distribution equality test. We also establish the convergence rate for the learned conditional generator by deriving new results related to the recently-developed offset Rademacher complexity and approximation properties using neural networks. Empirically, we conduct numerical studies including synthetic datasets and two real-world datasets, demonstrating the effectiveness of our approach.

Generative Doubly Robust Estimation for General Treatment Effects

Qixian Zhong, Xiamen University

Abstract: This paper introduces a unified framework for doubly robust (DR) estimation of a broad class of causal functionals, including average, quantile, and asymmetric least squared treatment effects, as well as their conditional counterparts. While DR estimators are well-established for average treatment effects, their development for distributional parameters like quantile treatment effects has not yet been investigated. We bridge this gap by integrating conditional generative models into a loss-based estimating framework. Our approach uses generative models to synthesize counterfactual samples, defines a target loss whose minimizer corresponds to the causal functional of interest, and constructs a final DR estimator by combining these elements with inverse probability weighting. The resulting estimators are shown to be root-n consistent, asymptotically normal, and semiparametrically efficient for unconditional effects, provided either the propensity score or the generative model is correctly specified. For conditional effects, we employ deep neural networks, establishing minimax-optimal convergence rates that adapt to low intrinsic data structures. Simulations confirm the double robustness and finite-sample performance of the proposed methods. This work provides a robust and flexible tool for distributional and heterogeneous causal inference in observational studies, where model misspecification is a persistent concern.

Optimal Semi-supervised Inference for Estimating Equations: A Nonparametric Projection Approach Using ReQU Neural Networks

Shanshan Song, Tongji University

Abstract: Semi-supervised learning aims to effectively leverage unlabeled data to improve prediction or estimation accuracy over the supervised counterparts. In this work, we propose a class of Nonparametric Projection-based Adaptive Semi-Supervised estimators for a general class of estimating equation problems (NPASS). The NPASS is a two-step estimator, where the key step is to estimate a parameter-dependent projection function via a randomization technique and deep differentiable neural networks with Rectifier Quadratic Unit (ReQU) activation functions. A debiasing strategy and a one-step update are developed to effectively integrate the information from unlabeled data. The asymptotic normality of the resulting estimator is established under general conditions, demonstrating the adaptability of our method in some sense. A major challenge in our theoretical analysis is designing the ReQU network class to balance the prediction errors for the target projection function and its derivatives. Our method allows the input dimension to be high-dimensional, and more importantly, it is robust to the slow convergence of the nonparametric neural network estimator. Furthermore, when the asymptotic covariance matrix involves the density function, our method can readily provide a consistent plug-in estimator, thereby avoiding the delicate density estimation or bootstrapping for inference. We perform numerical experiments to demonstrate the effectiveness and superiority of our proposed method over some existing methods in various scenarios.

Metric Conformal Prediction Based on the Expected Local Radius

Rui Qiu, East China Normal University

Abstract: We propose a conformal prediction method for responses taking values in general metric spaces. Central to our approach is the introduction of the expected local radius, a geometrically interpretable quantity that characterizes the metric effort required to accumulate probability mass around a candidate point. This quantity is intrinsic to the metric, does not require linear operations or densities, and yields adaptive prediction sets capturing the distributional contours. We further utilize a metric distributional random forest to estimate the conditional distribution, mitigating the practical impact of multivariate covariates through adaptive splitting. Theoretically, we establish uniform consistency of the proposed estimators and prove asymptotic conditional validity of the resulting prediction sets. Empirical results demonstrate favorable performance in settings with multivariate covariates and complex response structures.

Session 6: Geometric and Statistical Modeling of Complex Data

Deep Isometric Manifold Embedding for Video

Feng Li, Peking University

Abstract: Financial analysts increasingly communicate through video, yet statistical models rarely use non-verbal behavior because raw pose trajectories are confounded by camera angle, scale, occlusion, and individual heterogeneity. We propose a geometry-aware statistical framework for constructing invariant and interpretable features from hand-gesture trajectories. The framework treats a gesture window as a landmark trajectory observed up to nuisance transformations, removes translation and scale, aligns rotations through Procrustes geometry, and learns a low-dimensional embedding that approximately preserves the resulting shape distances. Segment-level summaries of the embedding---gesture entropy and kinematic instability---are then used as generated covariates in a predictive analysis of sector-level returns. In a large corpus of financial video broadcasts, coordinate-based measures of physical movement are not statistically informative after adjustment for text, vocal, and market variables, whereas the proposed shape-based summaries show detectable conditional associations. These associations are most visible in segments with neutral textual attitude, suggesting that invariant gesture features may contain behavioral information not captured by transcripts alone. The empirical findings should be interpreted as predictive associations rather than causal evidence of psychological states. More broadly, the paper illustrates how nuisance-invariant feature construction can make high-dimensional behavioral video data usable in statistical modeling.

Federated LoRA Fine-Tuning for LLMs via Collaborative Alignment

Long Feng, University of Hong Kong

Abstract: Low-rank adaptation (LoRA) has emerged as a powerful tool for parameter-efficient fine-tuning of large language models (LLMs). This paper studies LoRA under a federated learning setting, enabling collaborative fine-tuning across clients while preserving parameter efficiency. We focus on a highly heterogeneous regime in which clients share only partial structure and a substantial subset may be contaminated. We propose Collaborative Low-rank Alignment and Identifiable Recovery (CLAIR), a contamination-aware framework that relies only on preliminary local estimators. Its formulation applies broadly, from linear regression to neural network and LLM modules, whenever local adaptation can be represented by matrix-valued updates. CLAIR recovers the shared LoRA subspace and detects contaminated clients via a structured low-rank plus block-sparse decomposition. We prove exact recovery of the shared LoRA subspace in the noiseless case, stable recovery under preliminary estimation error and consistent collaborative-set recovery under mild separation conditions. We further quantify the gain from CLAIR refinement: it reduces off-subspace estimation error through cross-client averaging while preserving client-specific variation within the shared LoRA subspace, thus improves over local fine-tuning whenever this oracle gain outweighs the costs of subspace estimation and benign-client heterogeneity. Empirically, we demonstrate the benefits of CLAIR by fine-tuning a Transformer architecture on a text-copying task. The results show accurate contamination detection and improved benign-client performance compared with local fine-tuning and non-robust federated averaging.

Hidden Block Regression: A General Framework for Multi-Response Models with Group Structures and Hidden Variables

Yuehan Yang, Central University of Finance and Economics

Abstract: Multi-response models with group structures and hidden variables are prevalent in complex data analysis. However, existing methods often fall short in capturing the intricate interactions between observed and unobserved components. In this paper, we introduce Hidden Block Regression (HBR), a novel method that provides a unified framework for modeling grouped structures influenced by unobservable factors. By integrating hidden block detection with group structure preservation, HBR offers a flexible strategy that generalizes to various forms of hidden factor involvement, enabling accurate prediction and variable selection. We derive theoretical deviation bounds for the HBR estimator, demonstrating its capacity to identify hidden variables while maintaining essential group structure information. Extensive simulations reveal that HBR consistently outperforms existing methods, particularly in scenarios involving complex interactions between observed and latent factors. A case study on a yeast dataset underscores its practical utility in identifying key predictors in such contexts.

Ball Impurity: Measuring Heterogeneity in General Metric Spaces

Ting Li, Southern University of Science and Technology

Abstract: Data in various domains, such as neuroimaging and network data analysis, often come in complex forms without possessing a Hilbert structure. The complexity necessitates innovative approaches for effective analysis. We propose a novel measure of heterogeneity, ball impurity, which is designed to work with complex non-Euclidean objects. Our approach extends the notion of impurity to general metric spaces, providing a versatile tool for feature selection and tree models. The ball impurity measure exhibits desirable properties, such as the triangular inequality, and is computationally tractable, enhancing its practicality and usefulness. Extensive experiments on synthetic data and real data from the UK Biobank validate the efficacy of our approach in capturing data heterogeneity. Remarkably, our results compare favorably with state-of-the-art methods in metric spaces, highlighting the potential of ball impurity as a valuable tool for addressing complex data analysis tasks.

Session 7: Modern Statistical Methods for High-Dimensional, Fairness and Deep Learning Models

Allocation of Large Portfolio by Sparse Group Lasso

Lei Huang, Southwest Jiaotong University

Abstract: We propose Regression Model with Sparse Group Lasso (RM-SPAGL), a high-dimensional portfolio selection method that combines the unconstrained regression formulation of mean variance optimization with Sparse Group Lasso regularization and factor-based covariance structure. The procedure incorporates industry grouping, allows simultaneous group and within-group sparsity, and selects the tuning parameter by risk-constrained cross-validation to target a prespecified portfolio risk level. Under standard regularity conditions, we establish convergence of the resulting portfolio return to its theoretical target at a rate determined by both elementwise and group level sparsity. In simulations, RM-SPAGL attains Sharpe ratios close to the theoretical benchmark, keeps realized risk near the target level, and selects fewer than 0.02% white noise assets. Out-of-sample analyses using S&P 500 constituents and Chinese A-Share Market further show comparable risk-adjusted performance and lower turnover than several benchmark procedures. The evidence suggests that exploiting group structure can improve the stability and interpretability of high-dimensional portfolio selection.

A High-Dimensional Regression Model Based on Residual-Driven Nonlinear Screening

Shouri Hu, University of Electronic Science and Technology of China

Abstract: In many regression problems, predictors can influence the response through both dominant linear effects and more complex nonlinear deviations. A central challenge is that nonparametric regression in high dimensions is severely limited by the curse of dimensionality, while common feature screening and selection strategies typically rely on strong sparsity assumptions on the full set of predictors. We propose Residual-Driven Hybrid Regression (RDHR), a two-stage framework for high-dimensional prediction that mitigates both difficulties by separating linear and nonlinear learning. In the first stage, RDHR fits a conventional linear learner to capture the dominant linear signal and form residuals; importantly, this step does not require sparsity of the original predictors. In the second stage, RDHR performs residual-based feature screening to identify a small subset of predictors with remaining nonlinear association, and then applies nonparametric regression to the residuals to refine predictions. This design reduces the effective dimension of the nonparametric task and thereby alleviates the curse of dimensionality, while retaining interpretability and parsimony through variable selection and a mild additive structure in the refinement stage. We establish theoretical guarantees for the second stage, including a sure screening property and consistency with explicit convergence rates, which together imply controlled prediction error. Extensive simulations across diverse correlation structures, dimensional regimes, sample sizes, and outcome types demonstrate that RDHR improves predictive accuracy and yields stable variable identification relative to competing approaches. An application to real data further illustrates the practical effectiveness and flexibility of RDHR.

Random Subset Averaging

Jie Hu, Tsinghua University

Abstract: We propose a new ensemble prediction method, Random Subset Averaging (RSA), tailored for settings with many correlated covariates, including extreme regimes in which the number of predictors far exceeds the sample size and the covariates exhibit strong dependence. RSA constructs candidate models via a binomial random subset strategy and aggregates their predictions through a two-round weighting scheme, yielding a hierarchical aggregation structure that separates model-fit and subset-construction uncertainty. All tuning parameters are selected via cross-validation, requiring no prior knowledge of covariate relevance. We establish the asymptotic optimality of RSA under mild rate conditions, allowing for data-dependent first-round weights. Under orthogonal designs, RSA incurs no asymptotic approximation loss relative to flat Mallows averaging while substantially relaxing the associated rate conditions, and achieves a lower finite-sample risk bound than both nested Mallows averaging and random subset regression. Simulation studies demonstrate that RSA consistently delivers accurate and stable predictive performance across a wide range of sample sizes, dimensional settings, sparsity levels and correlation structures, outperforming conventional model selection and ensemble learning methods. An empirical application to financial return forecasting further illustrates its practical utility.

Nonparametric Based Fairness Auditing

Xianli Zeng, Xiamen University

Abstract: Machine learning models used in high-stakes settings, such as recidivism prediction and hiring, may exhibit substantial disparities across sensitive groups. Fairness auditing seeks to address this issue through two main tasks: certification, which tests whether a model satisfies a fairness criterion, and flagging, which identifies subpopulations experiencing unfair treatment. Existing methods often rely on restrictive assumptions, are computationally expensive, or mainly focus on discrete protected attributes. In this talk, I will present two statistical frameworks for fairness auditing. The first approach is based on empirical likelihood. It is computationally efficient, supported by asymptotic theory for valid inference, and enables both fairness certification and subgroup discovery. The second approach is designed for settings with continuous protected attributes, where standard methods may fail to identify unfair intervals. This framework applies nonparametric techniques to estimate disparity as a function of the protected attribute, based on which we design a test with valid size control. Applications to the COMPAS dataset show that these methods can reveal both intersectional and age-related disparities that may be missed by standard approaches.

Session 8: Statistical Aspects of Modern Machine Learning

Leading Science and Clinical Research-Biostatistics in the Age of AI

Haoda Fu, Amgen

Abstract: AI is rapidly transforming society and has emerged as the defining technology of our generation. Its influence spans across industries, with significant implications for scientific research, healthcare, and drug development. In this work, we review the historical progression of AI and its growing role in pharmaceutical research, highlighting how AI-driven methodologies are revolutionizing drug discovery and development processes. As the integration of AI in biostatistics and clinical research deepens, scientists must cultivate essential skills to remain effective in this evolving landscape. We discuss four critical competencies for scientists working in the AI era: AI Mindsets, AI Communication, AI Integration, and AI-Enabled Innovation. Looking ahead, we explore the potential future impact of AI on the pharmaceutical data analytics ecosystem. The convergence of AI with biostatistics presents both challenges and opportunities, requiring a thoughtful balance between leveraging AI’s capabilities and maintaining rigorous scientific and ethical standards. By embracing AI-driven approaches while upholding core statistical principles, the next generation of scientists can contribute to more efficient, data-driven advancements in clinical research. This discussion aims to provide insights into the evolving role of AI in biostatistics and inspire forward-thinking strategies for navigating the intersection of AI and scientific discovery in the pharmaceutical industry.

Non-Asymptotic Theories of Neural Networks for Dependent Data

Gan Yuan, City University of Hong Kong

Abstract: Recent years have seen substantial progress in the theoretical analysis of deep neural networks, though the majority of existing results assume independent observations. In contrast, the statistical properties of deep ReLU neural networks for modelling nonlinear dependent data are investigated, encompassing a broad class of time series and spatial models. Sharp non-asymptotic error bounds for the DNN estimator are established, demonstrating that these bounds depend explicitly on the underlying dependence structure of the data, the architectural characteristics of the network, and the dependence structure under different mixing scenarios. Systematic simulation studies demonstrate that the empirical results align closely with the theoretical findings.

Robustness and Fairness in Medical Machine Learning: Optimization Strategies for Reliable and Equitable Chest X-ray AI

Lu Tang, University of Pittsburgh

Abstract: This paper studies fairness and robustness in chest X-ray machine learning under class imbalance and distribution shift. Its primary methodological contribution is an extension of tilted empirical risk minimization (TERM), denoted Stratified TERM (StraTERM), which partitions data into clinically coherent strata and applies group-level tilted optimization within each stratum. The resulting objective is designed to emphasize poorly served subgroups within clinically comparable contexts while preserving a practical training procedure. The paper develops a fairness framework that distinguishes marginal from conditional criteria, formalizes StraTERM in relation to existing TERM variants, and evaluates the method empirically on the MIMIC-CXR dataset. Experiments use patient-level splits for binary Lung Opacity versus No Finding classification from chest radiographs. This setting provides a concrete testbed for assessing whether objective-level fairness interventions can improve worst-group reliability and conditional fairness with minimal loss of clinically relevant discriminative performance.

Learning from the Unseen: Offline Reinforcement Learning with Hidden Actions

Ying Zhou, University of Connecticut

Abstract: Offline reinforcement learning typically assumes that actions in the dataset are observed without error. In many applications, however, the true actions may be unobserved and only noisy proxies are available, leading to bias in standard off-policy evaluation and potentially misleading conclusions. We study off-policy evaluation in infinite-horizon discounted Markov decision processes with hidden actions. By leveraging the next-state variable as a natural proxy for the unobserved action, we establish identification of the policy value and propose an influence function-based estimator LURE (Learning from the Unseen: Robust Estimator). The LURE estimator is multiply robust, remaining consistent under several combinations of correctly specified nuisance components, and is asymptotically normal, enabling valid inference. To our knowledge, this is the first work on offline reinforcement learning with hidden actions. Simulations and a sepsis management application using the MIMIC-III database show that LURE substantially reduces bias compared to baseline methods.

Session 9: Frontier Methods in Biostatistics

BayesRare: Bayesian Mixture Model for Population-Level Rare Cell Type Detection in Multi-Subject Single-Cell RNA Sequencing

Yinqiao Yan, Beijing University of Technology

Abstract: Rare cell types in single-cell RNA sequencing (scRNA-seq) data often encode essential biological signals, such as early disease markers or key immune regulators. With advancing technologies, large-scale scRNA-seq cohorts from multiple subjects now enable population-level analyses of the prevalence, heterogeneity, and disease associations of rare cell populations. However, existing methods for rare cell detection are typically limited to single datasets and cannot effectively leverage cross-subject information. To tackle this challenge, we present BayesRare, a hierarchical Bayesian framework for population-level rare cell discovery in multi-subject scRNA-seq data. The method augments a Bayesian mixture model with a rare cluster indicator, supporting joint cell-type clustering and rare-population identification. By explicitly characterizing the statistical properties of rare cell types, BayesRare integrates evidence across subjects, quantifies uncertainty via posterior probabilities, and enables inference of group-level differences (e.g., patients versus controls). Across synthetic and three real datasets, BayesRare achieves superior precision, reduces false positives, and uncovers biologically meaningful disease-specific rare subtypes. The R package of BayesRare is available at https://github.com/yinqiaoyan/BayesRare.

An Efficient Two-Dimensional Functional Mixed-Effect Model Framework for Wearable Device Data Analysis in Large Population Studies

Xinyue Li, City University of Hong Kong

Abstract: Recent advances in wearable device technology allow accelerometers to continuously record minute-by-minute physical activity over consecutive days, generating rich, densely sampled curves across a longitudinal design. Such repeatedly measured functional data exhibit complex interactions along two distinct axes: an intraday (functional) dimension capturing within-day activity patterns, and an interday (longitudinal) dimension reflecting how these patterns evolve across the week. Modeling this dual structure poses substantial methodological and computational challenges. In this talk, I will introduce a novel two-dimensional functional mixed-effect model (2dFMM) framework designed to characterize both longitudinal and functional cross-variability while incorporating two-dimensional fixed effects and a four-dimensional correlation structure. To address the computational burden inherent in large-scale wearable datasets, I will present a fast three-stage estimation procedure that delivers accurate fixed-effect inference and preserves model interpretability. Extensive simulation studies demonstrate that the proposed approach outperforms existing methods in both estimation accuracy and computational efficiency. Further application of 2dFMM to a large cohort of Shanghai school adolescents uncovers strong evidence of intraday- and interday-varying associations between physical activity and mental health outcomes. These findings offer actionable insights into intervention strategies targeting daily activity patterns to support adolescent mental health.

Joint Modeling of Longitudinal Biomarkers and Time-to-Event Data via Ordinary Differential Equations

Ziyang Gong, Southwestern University of Finance and Economics

Abstract: Joint modeling of longitudinal biomarker measurements and time-to-event outcomes is widely used in survival analysis. However, event risk may depend not only on the current biomarker level but also on its rate of change, or velocity. Motivated by post-transplant biomarker data from St. Jude Children's Research Hospital, we propose a joint model in which the underlying biomarker process follows a subject-specific second-order ordinary differential equation. The formulation treats biomarker level and velocity as a coupled dynamic state, allowing recovery, damping, overshoot, and oscillation to be represented through interpretable ODE parameters. The event hazard is modeled as a function of this state, so risk is linked directly to the evolving biomarker process. In simulations where the hazard depends on velocity, the proposed model estimates the velocity--hazard association with lower bias and more reliable interval coverage than spline-based joint models that include a slope effect. Applied to the St. Jude data, the model separates post-transplant risk into level- and velocity-associated components, showing how biomarker dynamics can contribute information beyond the current value alone.

LLM-Assisted Clinical Trial Emulation Using Diverse Private Data

Hao Mei, Renmin University of China

Abstract: Clinical trial emulation has emerged as an important approach in real-world drug research, enabling investigators to replicate the design and analysis of randomized controlled trials using observational data. However, traditional emulation typically relies on expert knowledge and extensive literature review to construct a hypothetical trial, a process that is often time-consuming and constrained by limited scalability. In this work, we develop a domain-specific large language model (LLM) trained with advanced direct preference optimization techniques to facilitate semi-automated trial emulation. Given the drugs or interventions of interest, the LLM generates a complete hypothetical trial design, including detailed inclusion and exclusion criteria, treatment allocation strategies, follow-up protocols, and outcome definitions. Building on this, the system leverages LLM to align the hypothetical trial with the user’s diverse private datasets, producing tailored data extraction schemes that enable efficient retrieval of relevant patient cohorts and variables. This LLM-assisted framework significantly improves the efficiency and reduces the cost of conducting emulation studies for researchers with heterogeneous, privately held data sources, expanding the accessibility and scalability of real-world evidence generation.

Session 10: Trustworthy AI and Statistical Governance

Asymptotic Theory and Sequential Test for General Multi-Armed Bandit Process

Li Yang, Xi'an Jiaotong University

Abstract: Multi-armed bandit (MAB) processes constitute a foundational subclass of reinforcement learning problems and represent a central topic in statistical decision theory, but are limited to simultaneous adaptive allocation and sequential test, because of the absence of asymptotic theory under non-i.i.d sequence and sublinear information. To address this open challenge, we propose Urn Bandit (UNB) process to integrate the reinforcement mechanism of urn probabilistic models with MAB principles, ensuring almost sure convergence of resource allocation to optimal arms. We establish the joint functional central limit theorem (FCLT) for consistent estimators of expected rewards under non-i.i.d., non-sub-Gaussian and sublinear reward samples with pairwise correlations across arms. To overcome the limitations of existing methods that focus mainly on cumulative regret, we establish the asymptotic theory along with adaptive allocation that serves powerful sequential test, such as arms comparison, A/B testing, and policy valuation. Simulation studies and real data analysis demonstrate that UNB maintains statistical test performance of equal randomization (ER) design but obtain more average rewards like classical MAB processes.

Generalized Boundary FDR Control under Arbitrary Dependence: An Approach on Closure Principle

Haojie Ren, Shanghai Jiao Tong University

Abstract: False discovery rate (FDR) is a cornerstone of modern multiple testing. However, it often fails to guarantee the reliability of “marginal” discoveries that lie at the boundary of the rejection set, which are often crucial in high-precision applications. While recent works (Soloff et al., 2024; Xiang et al., 2025) introduced the boundary false discovery rate (bFDR) to control the error probability at the marginal discovery, their method relies on restrictive assumptions such as independence or specific prior distributions. In this paper, we first propose k-bFDR, a novel generalization that controls the error probability of the k least significant discoveries. We then provide a systematic investigation into the theoretical relationship between k-bFDR and existing error metrics. Furthermore, building upon the closure principle, we develop Domino, a unified framework that guarantees k-bFDR control under arbitrary dependence, applicable for both p-values and e-values. We prove the theoretical validity of the proposed Domino algorithm and demonstrate through extensive numerical experiments that it consistently achieves rigorous k-bFDR control while identifying trustworthy marginal discoveries. Analyses of real data reveal that k-bFDR control yields higher-quality rejection sets with greater practical significance.

Online Differentially Private Inference with Streaming Data

Jinhan Xie, Yunnan University

Abstract: In this talk, we present a general privacy-preserving optimization-based framework for statistical inference in real-time environments. We first consider online settings in which observations arrive sequentially, and develop a noisy stochastic gradient descent algorithm under local differential privacy. We then introduce an online federated learning framework including synchronous and asynchronous scenarios, where data remain distributed across clients and are generated over time. Our proposed algorithms are one-pass, depending only on the current data and the previous estimate, which effectively reduces both time and space complexity. To construct private confidence intervals efficiently in an online manner, two methods are proposed: private plug-in and random scaling. We also establish the convergence rates and functional central limit theorems for the proposed estimators, providing a theoretical foundation for our online inference tools. Numerical experiments demonstrate the finite-sample performance of our proposed procedures, underscoring the efficacy and reliability.

Neural Wasserstein Two-Sample Tests

Xiaoyu Hu, Xi'an Jiaotong University

Abstract: The two-sample homogeneity testing problem is fundamental in statistics and becomes particularly challenging in high dimensions, where classical tests can suffer substantial power loss. We develop a learning-assisted procedure based on the projection 1-Wasserstein distance, which we call the neural Wasserstein test. The method is motivated by the observation that there often exists a low-dimensional projection under which the two high-dimensional distributions differ. In practice, we learn the projection directions via manifold optimization and a witness function using deep neural networks. To adapt to unknown projection dimensions and sparsity levels, we aggregate a collection of candidate statistics through a max-type construction, avoiding explicit tuning while potentially improving power. We establish the validity and consistency of the proposed test and prove a Berry–Esseen type bound for the Gaussian approximation. In particular, under the null hypothesis, the aggregated statistic converges to the absolute maximum of a standard Gaussian vector yielding an asymptotically pivotal (distribution-free) calibration that bypasses resampling. Simulation studies and a real-data example demonstrate the strong finite-sample performance of the proposed method.