Natural Sciences and Mathematics

Mathematical Sciences

Statistics Seminar F14

UT Dallas > Mathematical Sciences > Data Science and Statistics Seminar > Statistics Seminar F14

Date	Speaker	Institution	Title	Abstract

Sept. 12	Frank Konietschke	Mathematical Sciences, UTD	Multiple Contrast Tests and Simultaneous Confidence Intervals in High-Dimensional Repeated Measures Designs	A high dimensional setting when the number of subjects is substantially smaller than the number of conditions to be tested is widely encountered in a variety of modern longitudinal and repeated measures design studies, with applications ranging from medicine to social sciences. Recently, there have been suggested several global testing procedures for high-dimensional repeated measures designs that can be employed to assess the global null hypothesis, e.g. of no global time effect. In statistical practice, however, frequently the key question of interest is identification of the significant factor levels, along the computation of simultaneous confidence intervals for treatment effects. In this talk we consider different resampling methods, that can be employed to derive multiple contrast tests and simultaneous confidence intervals in a high dimensional setting. We discuss asymptotic properties of the proposed testing procedures and illustrate their finite-sample performance by simulations and case studies. This is joint work with Yulia R. Gel and Markus Pauly.
Sept. 19	Larry Ammann	Mathematical Sciences, UTD	Sparse SVD and Visualization of Variable Interaction Networks	Principal Components and the singular value decomposition play an important role in exploratory data analysis. They provide a natural coordinate system that de-correlates variables and they can provide a starting point for dimension reduction with high-dimensional data by separating it into signal and noise subspaces. However, they can be difficult to interpret in large data sets because the linear combinations defined by principal components typically have some dominant loadings combined with many small loadings. Furthermore, data sets with many variables may contain multiple signals that interact weakly in addition to large amounts of noise. For these reasons there have been efforts recently to develop methods that produce sparse singular value decompositions, that is, linear combinations of relatively small numbers of variables that possess properties close to the optimal properties of the SVD. This talk describes an approach that is motivated by the relationships among least squares regression, the QR decomposition, variable selection, and the SVD to obtain sparse SVD’s. Visualization tools based on sparse SVD’s also are developed that aid in the identification and interpretation of variable interaction networks. These tools are applied to a genomic data set to identify clusters of interacting genes.
Sept. 26	Dunlei Cheng	American Thrombosis and Hemostasis Network, Riverwoods, IL	Testing Hypothesis about Medical Test Accuracy: Considerations for Design and Inference	We introduce a method for sample size determination for studies designed to test hypotheses about medical test or biomarker sensitivity and specificity. We show how a sample size can be determined to guard against making type I and/or type II errors by calculating Bayes factors from multiple data sets simulated under null and/or alternative models. The approach can be implemented across a variety of study designs, including investigations into one test or two conditionally independent or dependent tests. We focus on a general setting that involves non-identifiable models for data when true disease status is unavailable due to the nonexistence of or undesirable side effects from a gold-standard test. We illustrate our approach for sample size determination using a thorough simulation study.
Oct. 3	Joel Dobson	Texas Instruments	Statistical Challenges and Opportunities in the Semiconductor Industry	The semiconductor industry can take pride in our ongoing legacy of seeking to apply the best available statistical modeling methods. In this speech, we will review some tried and true modeling methods and discuss opportunities for new methods. We see a bright hope for the future. The data structures in semiconductor manufacturing are usually hierarchical ones, often include some random effects and lend themselves to mixed model approaches. Our databases are big, often pushing storage technology to its limit. Joining data from different databases may be difficult. A variety of statistical softwares are used by our internal clients from the separate business levels or groups. We must be knowledgeable of many softwares and fluent in several. The semiconductor statistician must collaborate well with information technology engineers. Data science bridges these disciplines. Data is everywhere and will only increase as memory becomes more affordable and faster. Tools have advanced interfaces generating streams of data. In our industry, our clients who are engineers may assume a depth of knowledge of device physics on our part, but some may not be reciprocally deep in their knowledge of statistics and modeling. The democratization of statistics may lead to mistaken assumptions. Some statistical methods must remain secretive, like surrogate modeling of device structure simulators and methods for setting guard bands and specification limits. We must learn to deploy more effectively the statistical methods throughout the organization, consistent with recent trends in the literature: thinking statistically, statistical engineering, and big data. With the ever exploding wisdom available on the internet and on company intranets, the consulting statistician must have a marketing plan for her or his own brand name. Recent arising opportunities include: blogging, GitHub participation, local meetups, and participation in professional societies like the American Statistical Association. Change continues to come at a faster and faster pace. Those who embrace change will reap the benefits that accompany change. Are you ready?
Oct. 10	Brian Lucena	Parkland Health & Hospital System	Real-Time Predictive Analytics for Hospital Decision Support	With the advent of Electronic Medical Record (EMR) systems at hospitals, we are now able to develop decision support algorithms that function in real time. This raises a number of theoretical and practical challenges. Rather than having all of the predictor variables at once, they arrive over the course of time, and our predictions must update accordingly. Performance metrics to compare algorithms are now complicated by the time dimension, as we may want to trade off accuracy for speed. The real-time data on which we deploy may have different properties than the historical data on which we develop the model. We will discuss these challenges and approaches used to deal with them.
Oct. 17	Vyacheslav Lyubchich	University of Waterloo, Canada and Mathematical Sciences, UTD	A new local regression nonparametric test for trend synchronism in multiple time series	The problem of identifying joint trend dynamics in multiple time series is essential in a wide spectrum of applications, from economics and finance to climate and environmental studies. However, most of the available tests for comparing multiple mean functions either deal with independent errors or are applicable only to a case of two time series, which constitutes a substantial limitation in many high-dimensional studies. We propose a new nonparametric test for synchronism of trends exhibited by multiple linear time series where the number of time series $N$ can be large but $N$ is fixed. The core idea of our new approach is based on employing the local regression test statistic, which allows to detect possibly non-monotonic non-linear trends. The finite sample performance of the new synchronism test statistic is enhanced by a nonparametric hybrid bootstrap approach. The proposed methodology is illustrated by simulations and a case study on climate dynamics.
Oct. 24	Daniel A. Griffith	School of Economic, Political and Policy Sciences, UTD	Positive Spatial Autocorrelation, Mixture Distributions, and Histograms Constructed With Geospatial Data	Researchers commonly construct histograms as a first step in representing and visualizing their geospatial data, or when simulating geospatial data. Because of the presence of spatial autocorrelation in these data, these graphs fail to closely align with any of the several hundred existing ideal frequency distributions. The purpose of this paper is to address how positive spatial autocorrelation the most frequently encountered in practice can distort histograms constructed with geospatial data. Following the auto-normal parameter specification employed in WinBUGS for Bayesian analysis, this paper summarizes results for normal, Poisson, and binomial random variables (RVs)three of the most commonly employed ones by geospatial scientists in terms of mixture distributions. An eigenvector spatial filter description of positive spatial autocorrelation is shown to approximate a normal distribution in its initial form, a gamma distribution when exponentiated, and a beta distribution when embedded in a logistic equation. In turn, these conceptualizations allow: the mean for a normal distribution to be distributed as a normal random variable (RV) with a zero mean and a specific variance; the mean for a Poisson distribution to be distributed as a gamma RV with specific parameters (i.e., a negative binomial distribution); and, the probability for a binomial distribution to be distributed as a beta RV with specific parameters (i.e., a beta-binomial distribution). Results allow impacts of positive spatial autocorrelation on histograms and geospatial simulation experiments to be better understood.
Oct. 31	Uditha Wijesuriya	Mathematical Sciences, UTD	Nonparametric quantiles and outlier detection for functional data using the spatial depth approach	The spatial depth and outlyingness approach with multivariate data has been very successful for its tractability, computational ease, and convenient asymptotics. Here its extension to the setting of outlier identification in functional data analysis is treated. Computations may be carried out in the Hilbert space of curves or in a corresponding Euclidean space obtained by discretization. For a data set of real-valued curves, methods are described for useful display of the sample median curve, the 50% central region of curves, and sample outlier curves, including both location and shape outliers. A spatial functional boxplot approach is used to identify outliers. Here we illustrate with several actual and simulated data sets, comparing the spatial approach with several leading competing methods, with respect to the false positive rate, the false negative rate, and the computational burden as criteria. It is seen that the spatial approach is among the very best in performance. (Joint work with Professor Robert Serfling)
Nov. 7	Lilia L. Ramirez Ramirez	Department of Statistics and Actuarial Science, ITAM, Mexico	Infectious Outbreak Prediction based on network epidemic models, official and social media information	Developing a timely prediction on the evolution of an outbreak is a challenge that health institutions face in order to define line of actions and secure the health resources that the population could need. This problem is escalated when the outbreak is related to an infectious agent that is highly transmissible or virulent. This talk revolves around the prediction of cases in outbreaks where the infectious agent is transmitted directly between individuals. The epidemic model includes the use of networks to model the population interaction. This allows to generalize the homogeneous susceptibility hypothesis, present in some models. Since it is usual not to have public updated available data on the number of confirmed cases, the proposed epidemic model employs alternative sources of information such as social media. For example the Google Flu Trend project uses the number of searches for flu-related words to describe the flu activity. Twitter, on the other hand has proven to be a source of information that can be salvaged plus location and socio-demographic information of the members. In this talk we discuss a proposed epidemic model that incorporate social networks and information from Twitter to delineate short-term evolutionary scenarios for influenza outbreaks.
Nov. 14	Ejaz Ahmed	Brock University, Canada	Big Data Big Bias Big Surprise?	In high-dimensional statistics settings where number of variables is greater than observations, or when number of variables are increasing with the sample size, many penalized regularization strategies were studied for simultaneous variable selection and post-estimation. However, a model may have sparse signals as well as with number predictors with weak signals. In this scenario variable selection methods may not distinguish predictors with weak signals and sparse signals. The prediction based on a selected submodel may not be preferable in such cases. For this reason, we propose a high-dimensional shrinkage estimation strategy to improve the prediction performance of a submodel. Such a high-dimensional shrinkage estimator (HDSE) is constructed by shrinking a full model ridge estimator in the direction of a candidate submodel. We demonstrate that the proposed HDSE performs uniformly better than the ridge estimator. Interestingly, it improves the prediction performance of given candidate submodel generated from existing variable selection methods. The relative performance of the proposed HDSE strategy is appraised by both simulation studies and the real data analysis.
Nov. 21	Shanshan Wang	Mathematical Sciences, UTD	Masking and Swamping Robustness of Outlier Detection Procedures	In the wide-ranging scope of modern statistical data analysis, a key task is identification of outliers. For any outlier identification procedure, one needs to know its robustness against masking (an outlier is undetected as such) and swamping (a nonoutlier is classified as an outlier). Masking and swamping robustness are interrelated aspects which must be studied together. For such purposes, we provide a general framework applicable in any data space. Implementation, however, with particular outlier identifiers in particular types of data space, requires additional theoretical development specialized to the chosen setting. Even the case of univariate data presents nontrivial challenges. Here we apply the framework to the leading types of multivariate outlyingness functions: Mahalanobis distance outlyingness, spatial outlyingness, Mahalanobis spatial outlyingness, and projection outlyingness, as well as their univariate counterparts. Our results shed new light on choices of outlyingness functions regarding outlier identification. Also, our findings explain how the boxplot, a leading descriptive tool, performs using a hybrid outlyingness function incorporating a quantile-based component to describe the middle half of a data set and a scaled deviation outlyingness component for outlier detection. For both of these goals, the boxplot greatly favors swamping robustness over masking robustness. We formulate a variant boxplot offering a more favorable trade-off between these two robustness criteria. (Joint work with Professor Robert Serfling)