Abstract: This paper presents a framework for causal inference in the presence of censored data, where the failure time is marked by a continuous variable referred to as a mark. The mark is observed after treatment and is not meaningful when the failure time is censored. In addition, due to the continuous nature of the marks, observations at each given mark are sparse. These facts make the identification and estimation of causality a challenging task. To address these issues, we define a new mark-specific treatment effect within the potential outcomes framework and characterize its identifying conditions. We then propose a local smoothing estimator for the causal effects and establish its asymptotic properties. We further develop testing methods to evaluate whether the treatment has an effect on the failure time when controlling the values of the mark at certain points or within a defined interval, and develop a Gaussian approximation method to obtain the critical values. We evaluate our method using simulation studies as well as a real dataset from the Antibody Mediated Prevention trials.
Abstract: When studying microbiome-metabolome association, where microbiome data serve as predictors and metabolites as outcomes, several major analytical challenges arise. One key issue is the presence of zero-inflation in microbiome data. It is essential to distinguish between structural zeros (true absence of the taxa) and random zeros (due to sampling variability), as they may have different relationships with the outcome. Failing to make this distinction will lead to biased inference. Another critical challenge is appropriately handling censored metabolomics data due to detection limits. Common approaches such as deleting censored observations or substituting them with the detection limit can lead to biased estimates. The problem is compounded when censored values stem from both exposed population, where the metabolite is present but falls below the detection limit, and unexposed population, where the metabolite is truly absent. To address these challenges, we propose a joint modeling framework that explicitly separates the effects of structural and random zeros in microbiome predictors while accounting for censoring in metabolomic outcomes. This integrated approach improves the accuracy and efficiency of association estimates and enhances the validity of inference. We also evaluate the proposed approach through extensive simulations and apply it to examine the association between the microbiome and the metabolome in Bogalusa Heart Study.
Abstract: This article introduces a novel method for distributed, differentially private group inference in the high-dimensional generalized linear model. We consider a setting with an untrusted server and a group of trusted, data-holding clients. Each client first constructs a local nonprivate test statistic based on a weighted quadratic function of the regression sub-vector corresponding to the group, and the debiasing and re-weighting techniques. We then formulate our global differentially private test statistic by using the one-shot method, the Gaussian mechanism, and encryption and decryption procedures. Our proposed approach offers several key advantages: By operating under the untrusted server setting, our method resolves the critical weaknesses endemic to trust-based architectures, namely, systemic fragility and exposure to adversarial or governmental targeting. Moreover the approach incorporates a bounded encryption procedure to ensure secure communication, eliminate the risk of the server actively leaking data, enable easy computation of the estimator’s sensitivity. and circumvent the computation of intractable sensitivities of complex statistics. Furthermore, our proposed method is capable of handling highly correlated covariates and preserving high power for identifying dense but weak signals. Unlike conventional methods, it also avoids the need to handle the Hessian and precision matrices. Simulation studies are carried out to examine the finite-sample behaviour of the proposed method. An application to an adult income dataset is provided.
Abstract: Conditional statistical inference for high-dimensional survival data remains a fundamental yet challenging problem, particularly when the effects of nuisance variables are complex and difficult to specify parametrically. In this paper we propose a Deep Partial Linear Cox model that leverages neural networks to flexibly capture nonlinear nuisance effects while preserving interpretability for variables of primary interest. To test the conditional significance of high-dimensional variable sets within this framework, we develop an orthogonalized score test that effectively removes the influence of estimated nuisance components, thereby achieving valid inference in the presence of complex data dependencies. Our method accommodates settings where the number of tested parameters exceeds the sample size, without imposing sparsity assumptions. We establish the limiting null distribution of the proposed test statistic. Extensive numerical studies and an application to the TCGA breast cancer dataset demonstrate the superior performance and practical utility of our approach.
Abstract: Two-time-scale stochastic approximation is a variant of the classic stochastic approximation (SA), devised to find the roots of systems with two interconnected equations based on noisy observations. In this approach, two iterates are updated at varying speeds using different step sizes, with each update influencing the other. Previous studies have shown that, under suitable conditions, the convergence rates of the fast and slow iterates are determined only by their own step sizes. This phenomenon is termed decoupled convergence. However existing results mainly concern convergence properties at a single iteration index, while a trajectory-level counterpart of decoupled convergence remains largely unexplored. In this talk, we establish decoupled functional central limit theorems (FCLTs) for the trajectories of two-time-scale SA. Our results characterize how the coupling between the fast and slow updates influences their asymptotic trajectory behavior. In particular the limiting process on the fast time scale coincides with that of standard SA, whereas the limiting process on the slow time scale can be interpreted as the limit of SA applied to a composed operator with a modified noise term. We further discuss the implications of our FCLTs, including simultaneous convergence over a range of iterates and possible insights for algorithm design.
Abstract: We study design-based causal inference for edge-level outcomes in directed networks under dyadic interference. In this setting, outcomes are defined on directed edges and depend on the joint treatment assignments of pairs of units, inducing a complex dependence structure that invalidates standard estimation and inference procedures developed for node-level data. We construct Horvitz--Thompson estimators for a general class of edge-level causal effects and establish their asymptotic normality under mild regularity conditions. To enable valid inference, we develop variance estimators that exploit identifiable components of network dependence, yielding substantially less conservative bounds than classical approaches. To improve efficiency, we incorporate auxiliary covariates through a sample splitting and cross-fitting procedure. A key technical challenge is that standard two-fold sample splitting fails in the presence of edge-level outcomes due to the dependence induced by shared units. To address this issue, we introduce a three-fold sample splitting and cross-fitting scheme that restores the conditional independence required for unbiased estimation. Under a stability condition, the resulting covariate-adjusted estimator is asymptotically normal and accommodates both linear adjustment and flexible machine learning methods. We further introduce a calibration step that guarantees no asymptotic efficiency loss relative to the unadjusted estimator. Simulation studies and a real-data application confirm the theoretical results and demonstrate substantial efficiency gains.
Abstract: Transfer learning is a promising technique for enhancing the performance of a target task by leveraging information from source data. Most transfer learning methods are designed for homogeneous feature spaces, where the target and source domains share the same feature space. However heterogeneous feature spaces, where the target and sources have different feature spaces, are commonly encountered yet less investigated. In this paper we focus on linear models in which the feature spaces of the sources are subsets of that of the target, considering both low- and high-dimensional settings. An importance weighting transfer learning method combined with projection-based imputation is proposed to effectively utilize source information. Our method includes a sample selection process to reduce the risk of negative transfer and is less sensitive to the choice of the projection matrix used in the imputation step. Both entry-wise and global convergence rates of our estimator are established.Numerical results and real data analysis confirm the effectiveness of our proposed method.
Abstract: A central goal of feature screening in high-dimensional settings is to rank predictors by an importance measure that quantifies their association with the response. Marginal screening methods ignore dependence among predictors, whereas forward screening partially mitigates this issue by conditioning on a selected subset. Howeve# forward screening discards the remaining predictors and typically requires an expanding conditioning set as the iterations proceed.To overcome these limitations, we propose a systematic screening framework via weighted subspace iteration from a fixed-point perspective. The framework aggregates information across predictors while keeping the conditioning variables low-dimensional. Unlike existing one-shot or randomly constructed subspace methods, our framework uses screening feedback to iteratively refine the conditioning subspace and ultimately reaches a fixed point.The proposed framework applies to both linear and nonlinear settings and can be paired with a broad class of association measures. It is shown to achieve rank consistency under mild conditions. Extensive simulations demonstrate that the framework outperforms existing approaches, and its practical advantages are further illustrated through a real data example.
Abstract: Data in various domains, such as neuroimaging and network data analysis, often come in complex forms without possessing a Hilbert structure. The complexity necessitates innovative approaches for effective analysis. We propose a novel measure of heterogeneity, ball impurity, which is designed to work with complex non-Euclidean objects. Our approach extends the notion of impurity to general metric spaces, providing a versatile tool for feature selection and tree models. The ball impurity measure exhibits desirable properties, such as the triangular inequality, and is computationally tractable, enhancing its practicality and usefulness. Extensive experiments on synthetic data and real data from the UK Biobank validate the efficacy of our approach in capturing data heterogeneity. Remarkably, our results compare favorably with state-of-the-art methods in metric spaces, highlighting the potential of ball impurity as a valuable tool for addressing complex data analysis tasks.
Abstract: Low-rank adaptation (LoRA) has emerged as a powerful tool for parameter-efficient fine-tuning of large language models (LLMs). This paper studies LoRA under a federated learning setting, enabling collaborative fine-tuning across clients while preserving parameter efficiency. We focus on a highly heterogeneous regime in which clients share only partial structure and a substantial subset may be contaminated. We propose Collaborative Low-rank Alignment and Identifiable Recovery (CLAIR), a contamination-aware framework that relies only on preliminary local estimators. Its formulation applies broadly, from linear regression to neural network and LLM modules, whenever local adaptation can be represented by matrix-valued updates. CLAIR recovers the shared LoRA subspace and detects contaminated clients via a structured low-rank plus block-sparse decomposition. We prove exact recovery of the shared LoRA subspace in the noiseless case, stable recovery under preliminary estimation erro# and consistent collaborative-set recovery under mild separation conditions. We further quantify the gain from CLAIR refinement: it reduces off-subspace estimation error through cross-client averaging while preserving client-specific variation within the shared LoRA subspace, thus improves over local fine-tuning whenever this oracle gain outweighs the costs of subspace estimation and benign-client heterogeneity. Empirically, we demonstrate the benefits of CLAIR by fine-tuning a Transformer architecture on a text-copying task. The results show accurate contamination detection and improved benign-client performance compared with local fine-tuning and non-robust federated averaging.
Abstract: Multi-response models with group structures and hidden variables are prevalent in complex data analysis. Howeve# existing methods often fall short in capturing the intricate interactions between observed and unobserved components. In this pape# we introduce Hidden Block Regression (HBR), a novel method that provides a unified framework for modeling grouped structures influenced by unobservable factors. By integrating hidden block detection with group structure preservation, HBR offers a flexible strategy that generalizes to various forms of hidden factor involvement, enabling accurate prediction and variable selection. We derive theoretical deviation bounds for the HBR estimato# demonstrating its capacity to identify hidden variables while maintaining essential group structure information. Extensive simulations reveal that HBR consistently outperforms existing methods, particularly in scenarios involving complex interactions between observed and latent factors. A case study on a yeast dataset underscores its practical utility in identifying key predictors in such contexts.
Abstract: Financial analysts increasingly communicate through video, yet statistical models rarely use non-verbal behavior because raw pose trajectories are confounded by camera angle, scale, occlusion, and individual heterogeneity. We propose a geometry-aware statistical framework for constructing invariant and interpretable features from hand-gesture trajectories. The framework treats a gesture window as a landmark trajectory observed up to nuisance transformations, removes translation and scale, aligns rotations through Procrustes geometry, and learns a low-dimensional embedding that approximately preserves the resulting shape distances. Segment-level summaries of the embedding---gesture entropy and kinematic instability---are then used as generated covariates in a predictive analysis of sector-level returns. In a large corpus of financial video broadcasts, coordinate-based measures of physical movement are not statistically informative after adjustment for text, vocal, and market variables, whereas the proposed shape-based summaries show detectable conditional associations. These associations are most visible in segments with neutral textual attitude, suggesting that invariant gesture features may contain behavioral information not captured by transcripts alone. The empirical findings should be interpreted as predictive associations rather than causal evidence of psychological states. More broadly, the paper illustrates how nuisance-invariant feature construction can make high-dimensional behavioral video data usable in statistical modeling.
Abstract: We propose Regression Model with Sparse Group Lasso(RM-SPAGL), a high-dimensional portfolio selection method that combines the unconstrained regression formulation of mean variance optimization with Sparse Group Lasso regularization and factor based covariance structure. The procedure incorporates industry grouping, allows simultaneous group and within group sparsity, and selects the tuning parameter by risk-constrained cross-validation to target a prespecified portfolio risk level. Under standard regularity conditions, we establish convergence of the resulting portfolio return to its theoretical target at a rate determined by both elementwise and group level sparsity. In simulations, RM-SPAGL attains Sharpe ratios close to the theoretical benchmark, keeps realized risk near the target level, and selects fewer than 0.02% white noise assets. Out-of-sample analyses using S&P 500 constituents and Chinese A-Share Market further show comparable risk adjusted performance and lower turnover than several benchmark procedures. The evidence suggests that exploiting group structure can improve the stability and interpretability of high-dimensional portfolio selection.
Abstract: In many regression problems, predictors can influence the response through both dominant linear effects and more complex nonlinear deviations. A central challenge is that nonparametric regression in high dimensions is severely limited by the curse of dimensionality, while common feature screening and selection strategies typically rely on strong sparsity assumptions on the full set of predictors.We propose Residual-Driven Hybrid Regression (RDHR), a two-stage framework for high-dimensional prediction that mitigates both difficulties by separating linear and nonlinear learning.In the first stage, RDHR fits a conventional linear learner to capture the dominant linear signal and form residuals; importantly, this step does not require sparsity of the original predictors. In the second stage, RDHR performs residual-based feature screening to identify a small subset of predictors with remaining nonlinear association, and then applies nonparametric regression to the residuals to refine predictions. This design reduces the effective dimension of the nonparametric task and thereby alleviates the curse of dimensionality, while retaining interpretability and parsimony through variable selection and a mild additive structure in the refinement stage. We establish theoretical guarantees for the second stage, including a sure screening property and consistency with explicit convergence rates, which together imply controlled prediction error. Extensive simulations across diverse correlation structures, dimensional regimes, sample sizes, and outcome types demonstrate that RDHR improves predictive accuracy and yields stable variable identification relative to competing approaches. An application to real data further illustrates the practical effectiveness and flexibility of RDHR.
Abstract: Machine learning models used in high-stakes settings, such as recidivism prediction and hiring, may exhibit substantial disparities across sensitive groups. Fairness auditing seeks to address this issue through two main tasks: certification, which tests whether a model satisfies a fairness criterion, and flagging, which identifies subpopulations experiencing unfair treatment. Existing methods often rely on restrictive assumptions, are computationally expensive, or mainly focus on discrete protected attributes.In this talk, I will present two statistical frameworks for fairness auditing. The first approach is based on empirical likelihood. It is computationally efficient, supported by asymptotic theory for valid inference, and enables both fairness certification and subgroup discovery. The second approach is designed for settings with continuous protected attributes, where standard methods may fail to identify unfair intervals. This framework applies nonparametric techniques to estimate disparity as a function of the protected attribute, based on which we design a test with valid size control. Applications to the COMPAS dataset show that these methods can reveal both intersectional and age-related disparities that may be missed by standard approaches.
Abstract: Multi-armed bandit (MAB) processes constitute a foundational subclass of. reinforcement learning problems and represent a central topic in statistical decision theory, but are limited to simultaneous adaptive allocation and sequential test, because of the absence of asymptotic theory under non-i.i.d sequence and sublinear information. To address this open challenge, we propose Urn Bandit (UNB) process to integrate the reinforcement mechanism of urn probabilistic models with MAB principles, ensuring almost sure convergence of resource allocation to optimal arms. We establish the joint functional central limit theorem (FCLT) for consistent estimators of expected rewards under non-i.i.d., non-sub-Gaussian and sublinear reward samples with pairwise correlations across arms. To overcome the limitations of existing methods that focus mainly on cumulative regret, we establish the asymptotic theory along with adaptive allocation that serves powerful sequential test, such as arms comparison, A/B testing, and policy valuation. Simulation studies and real data analysis demonstrate that UNB maintains statistical test performance of equal randomization (ER) design but obtain more average rewards like classical MAB processes.
Abstract: False discovery rate (FDR) is a cornerstone of modern multiple testing. However it often fails to guarantee the reliability of “marginal” discoveries that lie at the boundary of the rejection set, which are often crucial in high-precision applications. While recent works (Soloff et al., 2024; Xiang et al., 2025) introduced the boundary false discovery rate (bFDR) to control the error probability at the marginal discovery, their method relies on restrictive assumptions such as independence or specific prior distributions. In this paper we first propose k-bFDR, a novel generalization that controls the error probability of the k least significant discoveries. We then provide a systematic investigation into the theoretical relationship between k-bFDR and existing error metrics. Furthermore, building upon the closure principle, we develop Domino, a unified framework that guarantees k-bFDR control under arbitrary dependence, applicable for both p-values and e-values. We prove the theoretical validity of the proposed Domino algorithm and demonstrate through extensive numerical experiments that it consistently achieves rigorous k-bFDR control while identifying trustworthy marginal discoveries. Analyses of real data reveal that k-bFDR control yields higher-quality rejection sets with greater practical significance.
Abstract: In this talk, we present a general privacy-preserving optimization-based framework for statistical inference in real-time environments. We first consider online settings in which observations arrive sequentially, and develop a noisy stochastic gradient descent algorithm under local differential privacy. We then introduce an online federated learning framework including synchronous and asynchronous scenarios, where data remain distributed across clients and are generated over time. Our proposed algorithms are one-pass, depending only on the current data and the previous estimate, which effectively reduces both time and space complexity. To construct private confidence intervals efficiently in an online manne# two methods are proposed: private plug-in and random scaling. We also establish the convergence rates and functional central limit theorems for the proposed estimators, providing a theoretical foundation for our online inference tools. Numerical experiments demonstrate the finite-sample performance of our proposed procedures, underscoring the efficacy and reliability.
Abstract: The two-sample homogeneity testing problem is fundamental in statistics and becomes particularly challenging in high dimensions, where classical tests can suffer substantial power loss. We develop a learning-assisted procedure based on the projection 1-Wasserstein distance, which we call the neural Wasserstein test. The method is motivated by the observation that there often exists a low-dimensional projection under which the two high-dimensional distributions differ. In practice,we learn the projection directions via manifold optimization and a witness function using deep neural networks. To adapt to unknown projection dimensions and sparsity levels, we aggregate a collection of candidate statistics through a max-type construction, avoiding explicit tuning while potentially improving power. We establish the validity and consistency of the proposed test and prove a Berry–Esseen type bound for the Gaussian approximation. In particula# under the null hypothesis, the aggregated statistic converges to the absolute maximum of a standard Gaussian vector yielding an asymptotically pivotal (distribution-free) calibration that bypasses resampling. Simulation studies and a real-data example demonstrate the strong finite-sample performance of the proposed method.
Abstract: Alzheimer’s disease (AD) is a progressive neurodegenerative disorder with complex genetic, cognitive, and functional determinants. Given the limited effectiveness of current treatments, accurate prediction of AD progression is important for early prevention, individualized monitoring, and timely intervention. Howeve# prediction is challenging because AD onset is often interval censored due to intermittent clinical assessments, and available predictors may include high-dimensional genetic variants and repeatedly measured cognitive and functional scores with nonlinear effects. Motivated by data from all four phases of the Alzheimer’s Disease Neuroimaging Initiative (ADNI), this talk presents neural-network-based methods for interval-censored data. For large-scale genetic data, we develop a semiparametric transformation model with sieve estimation and a computationally efficient generalized score test to identify variants associated with AD progression. Selected variants are then used to construct a neural network prediction model for interval-censored outcomes. We further extend the framework to dynamic prediction by integrating multivariate functional principal component analysis with neural networks, allowing longitudinal cognitive and functional scores to update individualized AD risk predictions at each follow-up visit. Simulation studies and ADNI applications show that the proposed methods improve prediction accuracy over existing approaches and identify subgroups with distinct progression risk profiles. An online dynamic prediction platform has also been developed at http://olap.ruc.edu.cn.
Abstract: Randomized clinical trials are the gold standard for analyzing treatment effects, but high costs and ethical concerns can limit recruitment, potentially leading to invalid inferences. Incorporating external trial data with similar characteristics into the analysis using transfer learning appears promising for addressing these issues. In this pape# we present a formal framework for applying transfer learning to the analysis of clinical trials, considering three key perspectives: transfer algorithm, theoretical foundation, and inference method. For the algorithm, we adopt a parameter-based transfer learning approach to enhance the lasso-adjusted stratum-specific estimator developed for estimating treatment effects. A key component in constructing the transfer learning estimator is deriving the regression coefficient estimates within each stratum, accounting for the bias between source and target data. To provide a theoretical foundation, we derive the l1 convergence rate for the estimated regression coefficients and establish the asymptotic normality of the transfer learning estimator. Our results show that when external trial data resembles current trial data, the sample size requirements can be reduced compared to using only the current trial data. Finally, we propose a consistent nonparametric variance estimator to facilitate inference. Numerical studies demonstrate the effectiveness and robustness of our proposed estimator across various scenarios.
Abstract: Fundamental research in deep learning constitutes a paramount field of study, within which the phenomenon-driven approach has emerged as a crucial methodology. By discovering and systematically exploring significant phenomena, fundamental research can more effectively facilitate the comprehension and practical application of neural networks. This report will introduce several key phenomena, including the Frequency Principle and the Condensation Phenomenon, to elucidate the underlying mechanisms of deep learning. The discussion will span from the learning dynamics of elementary perceptrons to the memory and reasoning capabilities of large-scale models.(Reference《 An Introduction to Deep Learning Phenomena》)
Abstract: Randomized clinical trials are the gold standard for analyzing treatment effects, but high costs and ethical concerns can limit recruitment, potentially leading to invalid inferences. Incorporating external trial data with similar characteristics into the analysis using transfer learning appears promising for addressing these issues. In this pape# we present a formal framework for applying transfer learning to the analysis of clinical trials, considering three key perspectives: transfer algorithm, theoretical foundation, and inference method. For the algorithm, we adopt a parameter-based transfer learning approach to enhance the lasso-adjusted stratum-specific estimator developed for estimating treatment effects. A key component in constructing the transfer learning estimator is deriving the regression coefficient estimates within each stratum, accounting for the bias between source and target data. To provide a theoretical foundation, we derive the l1 convergence rate for the estimated regression coefficients and establish the asymptotic normality of the transfer learning estimator. Our results show that when external trial data resembles current trial data, the sample size requirements can be reduced compared to using only the current trial data. Finally, we propose a consistent nonparametric variance estimator to facilitate inference. Numerical studies demonstrate the effectiveness and robustness of our proposed estimator across various scenarios.
Abstract: In this talk, we investigate how mild over-parameterization enhances gradient descent methods for tensor principal component analysis (Tensor PCA). Traditional gradient methods typically require an enormous number of samples to recover a hidden signal. We show that introducing a modest amount of over-parameterization can dramatically reduce sample complexity without relying on expensive initialization procedures, thereby significantly narrowing the statistical-to-computational gap. Specifically, in the symmetric setting, our proposed normalized stochastic gradient ascent method breaks the previous belief that gradient-based algorithms need a prohibitive number of samples, achieving the best-known statistical efficiency for polynomial algorithms under suitable settings. In the asymmetric setting, under a limited memory budget, we show that mild over-parameterization not only improves sample efficiency but also adapts naturally to the underlying problem structure, further reducing the required sample size when signals are more aligned. Togethe# these results provide evidence that mild over-parameterization offers both optimization and statistical benefits, effectively bridging the gap between what is statistically possible and what is computationally tractable.
Abstract: Implicit neural networks define outputs through fixed points, equilibria, or continuous dynamics, offering a flexible alternative to conventional finite-depth architectures. Yet their implicit nature also makes them difficult to analyze and expensive to evaluate. This talk presents a unified perspective on making implicit neural networks explicit, with a focus on Deep Equilibrium Models. I will discuss how their structure can be understood through explicit theoretical characterizations, and how their inference can be accelerated by exploiting the underlying dynamics of fixed-point computation.
Abstract: Rare cell types in single-cell RNA sequencing (scRNA-seq) data often encode essential biological signals, such as early disease markers or key immune regulators. With advancing technologies, large-scale scRNA-seq cohorts from multiple subjects now enable population-level analyses of the prevalence, heterogeneity, and disease associations of rare cell populations. Howeve# existing methods for rare cell detection are typically limited to single datasets and cannot effectively leverage cross-subject information. To tackle this challenge, we present BayesRare, a hierarchical Bayesian framework for population-level rare cell discovery in multi-subject scRNA-seq data. The method augments a Bayesian mixture model with a rare cluster indicato# supporting joint cell-type clustering and rare-population identification. By explicitly characterizing the statistical properties of rare cell types, BayesRare integrates evidence across subjects, quantifies uncertainty via posterior probabilities, and enables inference of group-level differences (e.g., patients versus controls). Across synthetic and three real datasets, BayesRare achieves superior precision, reduces false positives, and uncovers biologically meaningful disease-specific rare subtypes. The R package of BayesRare is available at https://github.com/yinqiaoyan/BayesRare.
Abstract: Recent advances in wearable device technology allow accelerometers to continuously record minute-by-minute physical activity over consecutive days, generating rich, densely sampled curves across a longitudinal design. Such repeatedly measured functional data exhibit complex interactions along two distinct axes: an intraday (functional) dimension capturing within-day activity patterns, and an interday (longitudinal) dimension reflecting how these patterns evolve across the week. Modeling this dual structure poses substantial methodological and computational challenges. In this talk, I will introduce a novel two-dimensional functional mixed-effect model (2dFMM) framework designed to characterize both longitudinal and functional cross-variability while incorporating two-dimensional fixed effects and a four-dimensional correlation structure. To address the computational burden inherent in large-scale wearable datasets, I will present a fast three-stage estimation procedure that delivers accurate fixed-effect inference and preserves model interpretability. Extensive simulation studies demonstrate that the proposed approach outperforms existing methods in both estimation accuracy and computational efficiency. Further application of 2dFMM to a large cohort of Shanghai school adolescents uncovers strong evidence of intraday- and interday-varying associations between physical activity and mental health outcomes. These findings offer actionable insights into intervention strategies targeting daily activity patterns to support adolescent mental health.
Abstract: Joint modeling of longitudinal biomarker measurements and time-to-event outcomes is widely used in survival analysis. Howeve# event risk may depend not only on the current biomarker level but also on its rate of change, or velocity. Motivated by post-transplant biomarker data from St. Jude Children's Research Hospital, we propose a joint model in which the underlying biomarker process follows a subject-specific second-order ordinary differential equation. The formulation treats biomarker level and velocity as a coupled dynamic state, allowing recovery, damping, overshoot, and oscillation to be represented through interpretable ODE parameters. The event hazard is modeled as a function of this state, so risk is linked directly to the evolving biomarker process. In simulations where the hazard depends on velocity, the proposed model estimates the velocity--hazard association with lower bias and more reliable interval coverage than spline-based joint models that include a slope effect. Applied to the St. Jude data, the model separates post-transplant risk into level- and velocity-associated components, showing how biomarker dynamics can contribute information beyond the current value alone.
Abstract: Clinical trial emulation has emerged as an important approach in real-world drug research, enabling investigators to replicate the design and analysis of randomized controlled trials using observational data. Howeve# traditional emulation typically relies on expert knowledge and extensive literature review to construct a hypothetical trial, a process that is often time-consuming and constrained by limited scalability. In this work, we develop a domain-specific large language model (LLM) trained with advanced direct preference optimization techniques to facilitate semi-automated trial emulation. Given the drugs or interventions of interest, the LLM generates a complete hypothetical trial design, including detailed inclusion and exclusion criteria, treatment allocation strategies, follow-up protocols, and outcome definitions. Building on this, the system leverages LLM to align the hypothetical trial with the user’s diverse private datasets, producing tailored data extraction schemes that enable efficient retrieval of relevant patient cohorts and variables. This LLM-assisted framework significantly improves the efficiency and reduces the cost of conducting emulation studies for researchers with heterogeneous, privately held data sources, expanding the accessibility and scalability of real-world evidence generation.
Abstract: In this work, we propose a general framework for testing the conditional distribution equality in a two-sample problem, which is most relevant to covariate shift and causal discovery. Our framework is built on neural network-based generative methods and sample splitting techniques by transforming the conditional distribution testing problem into an unconditional one. We introduce two special tests: the generative permutation-based conditional distribution equality test and the generative classification accuracy-based conditional distribution equality test. Theoretically, we establish a minimax lower bound for statistical inference in testing the equality of two conditional distributions under certain smoothness conditions. We demonstrate that the generative permutation-based conditional distribution equality test and its modified version can attain this lower bound precisely or up to some iterated logarithmic factor. Moreove# we prove the testing consistency of the generative classification accuracy-based conditional distribution equality test. We also establish the convergence rate for the learned conditional generator by deriving new results related to the recently-developed offset Rademacher complexity and approximation properties using neural networks. Empirically, we conduct numerical studies including synthetic datasets and two real-world datasets, demonstrating the effectiveness of our approach.
Abstract: AI is rapidly transforming society and has emerged as the defining technology of our generation. Its influence spans across industries, with significant implications for scientific research, healthcare, and drug development. In this work, we review the historical progression of AI and its growing role in pharmaceutical research, highlighting how AI-driven methodologies are revolutionizing drug discovery and development processes. As the integration of AI in biostatistics and clinical research deepens, scientists must cultivate essential skills to remain effective in this evolving landscape. We discuss four critical competencies for scientists working in the AI era: AI Mindsets, AI Communication, AI Integration, and AI-Enabled Innovation. Looking ahead, we explore the potential future impact of AI on the pharmaceutical data analytics ecosystem. The convergence of AI with biostatistics presents both challenges and opportunities, requiring a thoughtful balance between leveraging AI’s capabilities and maintaining rigorous scientific and ethical standards. By embracing AI-driven approaches while upholding core statistical principles, the next generation of scientists can contribute to more efficient, data-driven advancements in clinical research. This discussion aims to provide insights into the evolving role of AI in biostatistics and inspire forward-thinking strategies for navigating the intersection of AI and scientific discovery in the pharmaceutical industry.
Abstract: Recent years have seen substantial progress in the theoretical analysis of deep neural networks, though the majority of existing results assume independent observations. In contrast, the statistical properties of deep ReLU neural networks for modelling nonlinea# dependent data are investigated, encompassing a broad class of time series and spatial models. Sharp non-asymptotic error bounds for the DNN estimator are established, demonstrating that these bounds depend explicitly on the underlying dependence structure of the data, the architectural characteristics of the network, and the dependence structure under different mixing scenarios. Systematic simulation studies demonstrate that the empirical results align closely with the theoretical findings.
Abstract: This paper studies fairness and robustness in chest X-ray machine learning under class imbalance and distribution shift. Its primary methodological contribution is an extension of tilted empirical risk minimization (TERM), denoted Stratified TERM (StraTERM), which partitions data into clinically coherent strata and applies group-level tilted optimization within each stratum. The resulting objective is designed to emphasize poorly served subgroups within clinically comparable contexts while preserving a practical training procedure. The paper develops a fairness framework that distinguishes marginal from conditional criteria, formalizes StraTERM in relation to existing TERM variants, and evaluates the method empirically on the MIMIC-CXR dataset. Experiments use patient-level splits for binary Lung Opacity versus No Finding classification from chest radiographs. This setting provides a concrete testbed for assessing whether objective-level fairness interventions can improve worst-group reliability and conditional fairness with minimal loss of clinically relevant discriminative performance.
Abstract: Offline reinforcement learning typically assumes that actions in the dataset are observed without error. In many applications, however the true actions may be unobserved and only noisy proxies are available, leading to bias in standard off-policy evaluation and potentially misleading conclusions. We study off-policy evaluation in infinite-horizon discounted Markov decision processes with hidden actions. By leveraging the next-state variable as a natural proxy for the unobserved action, we establish identification of the policy value and propose an influence function-based estimator LURE (Learning from the Unseen: Robust Estimator). The LURE estimator is multiply robust, remaining consistent under several combinations of correctly specified nuisance components, and is asymptotically normal, enabling valid inference. To our knowledge, this is the first work on offline reinforcement learning with hidden actions. Simulations and a sepsis management application using the MIMIC-III database show that LURE substantially reduces bias compared to baseline methods.