Statistics Seminar
The Statistics Seminar was organized by Chuan-Fa Tang and Qiwei Li in Fall 2021.
Date & Location | Speaker | Title |
---|---|---|
Oct 8, 3:30 pm | Chuan-Fa Tang
Department of Mathematical Sciences, University of Texas at Dallas | Taylor’s law for semivariance and higher moments of heavy-tailed distributions |
Oct 15, 3:30 pm | Sam Efromovich
Department of Mathematical Sciences, University of Texas at Dallas |
Missing and modified data in nonparametric curve estimation
|
Oct 22, 3:30 pm | Karl Ho
Department of Political and Policy Science, University of Texas at Dallas | Text analytics applications |
Oct 29, 11:00 am
(In-person only) |
Department of Statistical Science, Southern Methodist University | Using topological shape features to characterize medical images: Case studies on lung and brain cancers |
Nov 5, 11:00 am
(In-person only) | Sunyoung Shin
Department of Political and Policy Science, University of Texas at Dallas | Scalable DNA-protein binding changer test for insertion and deletion of bases in the genome |
Nov 12, 3:30 pm | Yunan Wu
Department of Political and Policy Science, University of Texas at Dallas | Model-assisted uniformly honest inference for optimal treatment regimes in high dimension |
Nov 19, 3:30 pm | Xiaowei Zhan
Department of Population and Data Sciences, University of Texas Southwestern Medical Center | A supervised contrastive learning framework to improve microbiome multi-omics prediction model |
Nov 30, 4:00 pm | Yanxun Xu
Department of Applied Mathematics and Statistics, John Hopkins University | Personalized dynamic treatment regimes in continuous time: A Bayesian approach for optimizing clinical decisions with timing |
Dec 3, 3:30 pm | Feng Chen
Department of Computer Science, University of Texas at Dallas | Multidimensional uncertainty quantification for graph neural networks |
List of talk abstracts in Fall 2021
Taylor’s Law for Semivariance and Higher Moments of Heavy-tailed Distributions
Chuan-Fa Tang (UTD), Oct 8, 2021
The power law relates the population mean and variance is known as Taylor’s law proposed by Taylor in 1961. We generalize Taylor’s law from the light-tailed distributions to heavy-tailed distribution with infinite mean. Instead of population moments, we consider the power law between the sample mean and many other sample statistics, such as the sample upper and lower semivariance, the skewness, the kurtosis, and higher moments of a random sample. We show that, as the sample size increases, the preceding sample statistics increase asymptotically in direct proportion to the power of the sample mean. These power laws characterize the asymptotic behavior of commonly used measures of the risk-adjusted performance of investments, such as the Sortino ratio, the Sharpe ratio, the potential upside ratio, and the Farinelli-Tibiletti ratio, when returns follow a heavy-tailed nonnegative distribution. In addition, we find the asymptotic distribution and moments of the number of observations exceeding the sample mean. We propose estimators of tail-index based on these scaling laws and the number of observations exceeding the sample mean and compare these estimators with some prior estimators.
Missing and Modified Data in Nonparametric Curve Estimation
Sam Efromovich (UTD), Oct 15, 2021
Two topics in modern nonparametric statistics are highlighted. The first one is missing data. Theory and examples are presented. Results are illustrated via regression with missing predictors and responses. The second topic is devoted to survival analysis, specifically to efficient estimation of the hazard rate function for truncated and censored data. All used statistical notions will be explained.
Text Analytics Applications
Karl Ho (UTD), Oct 22, 2021
Text data represent over half of the general big data and most of these data are unstructured, presenting challenges to data scientists in managing and modeling the sizable, noisy data. However, text analytics such as text mining and Natural Language Processing (NLP) constitute one of the most important drivers for big data markets, responsible for over 20 percent growth every year (Technavio 2020). This tech talk introduces the applications of analytics and visualization of text data. It will cover topics including collection and management of text data from social media and internet and unsupervised learning methods including pattern detection and visualization. The latest developments and applications using text data will be illustrated using examples in multiple fields including social sciences and military studies.
Using Topological Shape Features to Characterize Medical Images: Case Studies on Lung and Brain Cancers
Chul Moon (SMU), Oct 29, 2021
Tumor shape is a key factor that affects tumor growth and metastasis. This talk presents a topological feature computed by persistent homology to characterize tumor progression from digital pathology and radiology images and examines its effect on the time-to-event data. The proposed topological features are invariant to scale-preserving transformation and can summarize various tumor shape patterns. The topological features are represented in functional space and used as functional predictors in a functional Cox proportional hazards model. The proposed model enables interpretable inference about the association between topological shape features and survival risks. Two case studies are conducted using consecutive 143 lung cancer and 77 brain tumor patients. The results of both studies show that the topological features predict survival prognosis after adjusting clinical variables, and the predicted high-risk groups have significantly (at the level of 0.001) worse survival outcomes than the low-risk groups. Also, the topological shape features found to be positively associated with survival hazards are irregular and heterogeneous shape patterns, which are known to be related to tumor progression.
Scalable DNA-protein binding changer test for insertion and deletion of bases in the genome
Sunyoung Shin (UTD), Nov 5, 2021
Noncoding regions that do not encode protein are the majority of the genome, e.g., about 99% of the human genome is noncoding DNA. Mutations in the noncoding genome have been crucial to understand disease mechanisms through dysregulation of disease-associated genes. One key element in gene regulation that noncoding mutations mediate is the binding of proteins to DNA sequences. Insertion and deletion of bases (InDels) are the second most common type of mutations that may impact DNA-protein binding. However, no existing methods could be utilized to determine the quantitative effects on DNA-protein binding driven by InDels. We develop a novel statistical test, named binding changer test (BC test), to evaluate the impact of InDels on DNA binding changes using DNA-binding motifs and single sequence modeling. The test predicts binding changer InDels of regulatory importance with an efficient importance sampling algorithm in generating background sequences from an importance distribution more weighting large binding affinity changes. We derive the importance distribution with the optimal tilting parameter. The BC test provides a general statistical framework for any disease types in any species genomes. Simulation studies demonstrate its excellent performance. The application to genome sequencing datasets in human leukemia samples uncovers candidate pathologic InDels by modulating MYC binding in leukemic genomes.
Model-Assisted Uniformly Honest Inference for Optimal Treatment Regimes in High Dimension
Yunan Wu (UTD), Nov 12, 2021
We develop new tools to quantify uncertainty in optimal decision making and to gain insight into which variables one should collect information about given the potential cost of measuring a large number of variables. We investigate simultaneous inference to determine if a group of variables is relevant for estimating an optimal decision rule in a high-dimensional semiparametric framework. The unknown link function permits flexible modeling of the interactions between the treatment and the covariates, but leads to nonconvex estimation in high dimension and imposes significant challenges for inference. We first establish that a local restricted strong convexity condition holds with high probability and that any feasible local sparse solution of the estimation problem can achieve the near-oracle estimation error bound. We further rigorously verify that a wild bootstrap procedure based on a debiased version of the local solution can provide asymptotically honest uniform inference for the effect of a group of variables on optimal decision making. The advantage of honest inference is that it does not require the initial estimator to achieve perfect model selection and does not require the zero and nonzero effects to be well-separated. We also propose an efficient algorithm for estimation. Our simulations suggest satisfactory performance. An example from a diabetes study illustrates the real application.
A Supervised Contrastive Learning Framework to Improve Microbiome Multi-omics Prediction Model
Xiaowei Zhan (UTSW), Nov 19, 2021
The human microbiome consists of trillions of cells that collectively affect host health. Recently, advances in next-generation sequencing technology have enabled the high-throughput profiling of metagenomes and accelerated the study of the microbiome. In biomedical research, the microbiome is a potentially promising non-invasive biomarker for many diseases. However, a microbiome-based prediction model is challenging to construct and often suffers from low performance. We propose a novel supervised contrastive learning model. It can leverage the multi-omics data (e.g., metabolomics) to enhance the accuracy and robustness of microbiome-based prediction models. I will present use cases for this framework using real datasets. Furthermore, I will demonstrate that this model is broadly applicable for multi-omics data models.
Personalized Dynamic Treatment Regimes in Continuous Time: A Bayesian Approach for Optimizing Clinical Decisions with Timing
Yanxun Xu (Hopkins), Nov 30, 2021
Accurate models of clinical actions and their impacts on disease progression are critical for estimating personalized optimal dynamic treatment regimes (DTRs) in medical/health research, especially in managing chronic conditions. Traditional statistical methods for DTRs usually focus on estimating the optimal treatment or dosage at each given medical intervention, but overlook the important question of “when this intervention should happen.” We fill this gap by developing a two-step Bayesian approach to optimize clinical decisions with timing. In the first step, we build a generative model for a sequence of medical interventions—which are discrete events in continuous time—with a marked temporal point process (MTPP) where the mark is the assigned treatment or dosage. Then this clinical action model is embedded into a Bayesian joint framework where the other components model clinical observations including longitudinal medical measurements and time-to-event data conditional on treatment histories. In the second step, we propose a policy gradient method to learn the personalized optimal clinical decision that maximizes the patient survival by interacting the MTPP with the model on clinical observations while accounting for uncertainties in clinical observations learned from the posterior inference of the Bayesian joint model in the first step. A signature application of the proposed approach is to schedule follow-up visitations and assign a dosage at each visitation for patients after kidney transplantation. We evaluate our approach with comparison to alternative methods on both simulated and real-world datasets. In our experiments, the personalized decisions made by the proposed method are clinically useful: they are interpretable and successfully help improve patient survival.
Multidimensional Uncertainty Quantification For Graph Neural Networks
Feng Chen (UTD), Dec 3, 2021
Inherent uncertainties derived from different root causes have been realized as serious hurdles to find effective solutions for real-world problems. Critical safety concerns have been brought due to a lack of considering diverse causes of uncertainties, resulting in high risk due to misinterpretation of uncertainties (e.g., misclassification by an autonomous vehicle). Graph neural networks (GNNs) have received tremendous attention in the data science community. Despite their superior learning performance, they didn’t consider various types of uncertainties in their decision process. In this talk, I will present a general approach to quantifying the inherent