Statistical Modeling for Biomedical Researchers: A Simple Introduction to the Analysis of Complex Data

William Dupont’s Statistical Modeling for Biomedical Researchers, Second Edition is ideal for a one-semester graduate course in biostatistics and epidemiology. Dupont assumes only a basic knowledge of statistics, such as that obtained from a standard introductory statistics course. Stata is used extensively throughout the text, making it possible to introduce computationally complex methods with little or no higher-level mathematics. As a result, Dupont focuses on concepts and model assumptions, rather than on the underlying mathematics. The text covers linear regression, logistic regression, Poisson regression, survival analysis, and analysis of variance. Two chapters are devoted to each topic: an introductory chapter that uses simple data to develop the concept and a more advanced chapter devoted to explaining more complex models, case studies, diagnostic measures, etc.


Dupont pays equal attention to the methods and to using Stata to apply them. When Stata output is displayed, the most important elements of the output are highlighted and explained in notes that follow the output. These notes help the reader make sense of the output by providing the appropriate focus for the problem at hand. The notes also include instructions for reproducing the analysis via Stata’s point-and-click user interface. The text, replete with examples featuring real medical data, uses Stata graphics extensively, providing ample explanation and detail for reproduction.

1 Introduction
1.1 Algebraic notation
1.2 Descriptive statistics

1.2.1 Dot plot
1.2.2 Sample mean
1.2.3 Residual
1.2.4 Sample variance
1.2.5 Sample standard deviation
1.2.6 Percentile and median
1.2.7 Box plot
1.2.8 Histogram
1.2.9 Scatter plot

1.3 The Stata Statistical Software Package

1.3.1 Downloading data from my website
1.3.2 Creating histograms with Stata
1.3.3 Stata command syntax
1.3.4 Obtaining interactive help from Stata
1.3.5 Stata log files
1.3.6 Stata graphics and schemes
1.3.7 Stata do files
1.3.8 Stata pulldown menus
1.3.9 Displaying other descriptive statistics with Stata

1.4 Inferential statistics

1.4.1 Probability density function
1.4.2 Mean, variance, and standard deviation
1.4.3 Normal distribution
1.4.4 Expected value
1.4.5 Standard error
1.4.6 Null hypothesis, alternative hypothesis, and P-value
1.4.7 95% confidence interval
1.4.8 Statistical power
1.4.9 The z and Student’s t distributions
1.4.10 Paired t test
1.4.11 Performing paired t tests with Stata
1.4.12 Independent t test using a pooled standard error estimate
1.4.13 Independent t test using separate standard error estimates
1.4.14 Independent t tests using Stata
1.4.15 The chi-squared distribution

1.5 Overview of methods discussed in this text

1.5.1 Models with one response per patient
1.5.2 Models with multiple responses per patient

1.6 Additional reading
1.7 Exercises


2 Simple linear regression
2.1 Sample covariance
2.2 Sample correlation coefficient
2.3 Population covariance and correlation coefficient
2.4 Conditional expectation
2.5 Simple linear regression model
2.6 Fitting the linear regression model
2.7 Historical trivia: origin of the term regression
2.8 Determining the accuracy of linear regression estimates
2.9 Ethylene glycol poisoning example
2.10 95% confidence interval for y[x] = α + βx evaluated at x
2.11 95% prediction interval for the response of a new patient
2.12 Simple linear regression with Stata
2.13 Lowess regression
2.14 Plotting a lowess regression curve in Stata
2.15 Residual analyses
2.16 Studentized residual analysis using Stata
2.17 Transforming the x and y variables

2.17.1 Stabilizing the variance
2.17.2 Correcting for non-linearity
2.17.3 Example: research funding and morbidity for 29 diseases

2.18 Analyzing transformed data with Stata
2.19 Testing the equality of regression slopes

2.19.1 Example: the Framingham Heart Study

2.20 Comparing slope estimates with Stata
2.21 Density-distribution sunflower plots
2.22 Creating density-distribution sunflower plots with Stata
2.23 Additional reading
2.24 Exercises


3 Multiple linear regression
3.1 The model
3.2 Confounding variables
3.3 Estimating the parameters for a multiple linear regression model
3.4 R2 statistic for multiple regression models
3.5 Expected response in the multiple regression model
3.6 The accuracy of multiple regression parameter estimates
3.7 Hypothesis tests
3.8 Leverage
3.9 95% confidence interval for ŷi
3.10 95% prediction intervals
3.11 Example: the Framingham Heart Study

3.11.1 Preliminary univariate analyses

3.12 Scatter plot matrix graphs

3.12.1 Producing scatter plot matrix graphs with Stata

3.13 Modeling interaction in multiple linear regression

3.13.1 The Framingham example

3.14 Multiple regression modeling of the Framingham data
3.15 Intuitive understanding of a multiple regression model

3.15.1 The Framingham example

3.16 Calculating 95% confidence and prediction intervals
3.17 Multiple linear regression with Stata
3.18 Automatic methods of model selection

3.18.1 Forward selection using Stata
3.18.2 Backward selection
3.18.3 Forward stepwise selection
3.18.4 Backward stepwise selection
3.18.5 Pros and cons of automated model selection

3.19 Collinearity
3.20 Residual analyses
3.21 Influence

3.21.1 Δβ_hat influence statistic
3.21.2 Cook’s distance
3.21.3 The Framingham example

3.22 Residual and influence analyses using Stata
3.23 Using multiple linear regression for non-linear models
3.24 Building non-linear models with restricted cubic splines

3.24.1 Choosing the knots for a restricted cubic spline model

3.25 The SUPPORT Study of hospitalized patients

3.25.1 Modeling length-of-stay and MAP using restricted cubic splines
3.25.2 Using Stata for non-linear models with restricted cubic splines

3.26 Additional reading
3.27 Exercises


4 Simple logistic regression
4.1 Example: APACHE score and mortality in patients with sepsis
4.2 Sigmoidal family of logistic regression curves
4.3 The log odds of death given a logistic probability function
4.4 The binomial distribution
4.5 Simple logistic regression model
4.6 Generalized linear model
4.7 Contrast between logistic and linear regression
4.8 Maximum likelihood estimation

4.8.1 Variance of maximum likelihood parameter estimates

4.9 Statistical tests and confidence intervals

4.9.1 Likelihood ratio tests
4.9.2 Quadratic approximations to the log likelihood ratio function
4.9.3 Score tests
4.9.4 Wald tests and confidence intervals
4.9.5 Which test should you use?

4.10 Sepsis example
4.11 Logistic regression with Stata
4.12 Odds ratios and the logistic regression model
4.13 95% confidence interval for the odds ratio associated with a unit increase in x

4.13.1 Calculating this odds ratio with Stata

4.14 Logistic regression with grouped response data
4.15 95% confidence interval for π[x]
4.16 Exact 100(1 − α)% confidence intervals for proportions
4.17 Example: the Ibuprofen in Sepsis Study
4.18 Logistic regression with grouped data using Stata
4.19 Simple 2 × 2 case–control studies

4.19.1 Example: the Ille-et-Vilaine study of esophageal cancer and alcohol
4.19.2 Review of classical case–control theory
4.19.3 95% confidence interval for the odds ratio: Woolf’s method
4.19.4 Test of the null hypothesis that the odds ratio equals one
4.19.5 Test of the null hypothesis that two proportions are equal

4.20 Logistic regression models for 2 × 2 contingency tables

4.20.1 Nuisance parameters
4.20.2 95% confidence interval for the odds ratio: logistic regression

4.21 Creating a Stata data file
4.22 Analyzing case–control data with Stata
4.23 Regressing disease against exposure
4.24 Additional reading
4.25 Exercises


5 Multiple logistic regression
5.1 Mantel–Haenszel estimate of an age-adjusted odds ratio
5.2 Mantel–Haenszel χ2 statistic for multiple 2 × 2 tables
5.3 95% confidence interval for the age-adjusted odds ratio
5.4 Breslow–Day–Tarone test for homogeneity
5.5 Calculating the Mantel–Haenszel odds ratio using Stata
5.6 Multiple logistic regression model

5.6.1 Likelihood ratio test of the influence of the covariates on the response variable

5.7 95% confidence interval for an adjusted odds ratio
5.8 Logistic regression for multiple 2 × 2 contingency tables
5.9 Analyzing multiple 2 × 2 tables with Stata
5.10 Handling categorical variables in Stata
5.11 Effect of dose of alcohol on esophageal cancer risk

5.11.1 Analyzing model (5.25) with Stata

5.12 Effect of dose of tobacco on esophageal cancer risk
5.13 Deriving odds ratios from multiple parameters
5.14 The standard error of a weighted sum of regression coefficients
5.15 Confidence intervals for weighted sums of coefficients
5.16 Hypothesis tests for weighted sums of coefficients
5.17 The estimated variance–covariance matrix
5.18 Multiplicative models of two risk factors
5.19 Multiplicative model of smoking, alcohol, and esophageal cancer
5.20 Fitting a multiplicative model with Stata
5.21 Model of two risk factors with interaction
5.22 Model of alcohol, tobacco, and esophageal cancer with interaction terms
5.23 Fitting a model with interaction using Stata
5.24 Model fitting: nested models and model deviance
5.25 Effect modifiers and confounding variables
5.26 Goodness-of-fit tests

5.26.1 The Pearson χ2 goodness-of-fit statistic

5.27 Hosmer–Lemeshow goodness-of-fit test

5.27.1 An example: the Ille-et-Vilaine cancer data set

5.28 Residual and influence analysis

5.28.1 Standardized Pearson residual
5.28.2 Δβ_hatj influence statistic
5.28.3 Residual plots of the Ille-et-Vilaine data on esophageal cancer

5.29 Using Stata for goodness-of-fit tests and residual analyses
5.30 Frequency matched case–control studies
5.31 Conditional logistic regression
5.32 Analyzing data with missing values

5.32.1 Imputing data that is missing at random
5.32.2 Cardiac output in the Ibuprofen in Sepsis Study
5.32.3 Modeling missing values with Stata

5.33 Logistic regression using restricted cubic splines

5.33.1 Odds ratios from restricted cubic spline models
5.33.2 95% confidence intervals for ψ_hat[x]

5.34 Modeling hospital mortality in the SUPPORT Study
5.35 Using Stata for logistic regression with restricted cubic splines
5.36 Regression methods with a categorical response variable

5.36.1 Proportional odds logistic regression
5.36.2 Polytomous logistic regression

5.37 Additional reading
5.38 Exercises


6 Introduction to survival analysis
6.1 Survival and cumulative mortality functions
6.2 Right censored data
6.3 Kaplan–Meier survival curves
6.4 An example: genetic risk of recurrent intracerebral hemorrhage
6.5 95% confidence intervals for survival functions
6.6 Cumulative mortality function
6.7 Censoring and bias
6.8 Log-rank test
6.9 Using Stata to derive survival functions and the log-rank test
6.10 Log-rank test for multiple patient groups
6.11 Hazard functions
6.12 Proportional hazards
6.13 Relative risks and hazard ratios
6.14 Proportional hazards regression analysis
6.15 Hazard regression analysis of the intracerebral hemorrhage data
6.16 Proportional hazards regression analysis with Stata
6.17 Tied failure times
6.18 Additional reading
6.19 Exercises


7 Hazard regression analysis
7.1 Proportional hazards model
7.2 Relative risks and hazard ratios
7.3 95% confidence intervals and hypothesis tests
7.4 Nested models and model deviance
7.5 An example: the Framingham Heart Study

7.5.1 Kaplan–Meier survival curves for DBP
7.5.2 Simple hazard regression model for CHD risk and DBP
7.5.3 Restricted cubic spline model of CHD risk and DBP
7.5.4 Categorical hazard regression model of CHD risk and DBP
7.5.5 Simple hazard regression model of CHD risk and gender
7.5.6 Multiplicative model of DBP and gender on risk of CHD
7.5.7 Using interaction terms to model the effects of gender and DBP on CHD
7.5.8 Adjusting for confounding variables
7.5.9 Interpretation
7.5.10 Alternative models

7.6 Proportional hazards regression analysis using Stata
7.7 Stratified proportional hazards models
7.8 Survival analysis with ragged study entry

7.8.1 Kaplan–Meier survival curve and the log-rank test with ragged entry
7.8.2 Age, sex, and CHD in the Framingham Heart Study
7.8.3 Proportional hazards regression analysis with ragged entry
7.8.4 Survival analysis with ragged entry using Stata

7.9 Predicted survival, log–log plots, and the proportional hazards assumption

7.9.1 Evaluating the proportional hazards assumption with Stata

7.10 Hazard regression models with time-dependent covariates

7.10.1 Testing the proportional hazards assumption
7.10.2 Modeling time-dependent covariates with Stata

7.11 Additional reading
7.12 Exercises


8 Introduction to Poisson regression: inferences on morbidity and mortality rates
8.1 Elementary statistics involving rates
8.2 Calculating relative risks from incidence data using Stata
8.3 The binomial and Poisson distributions
8.4 Simple Poisson regression for 2 × 2 tables
8.5 Poisson regression and the generalized linear model
8.6 Contrast between Poisson, logistic, and linear regression
8.7 Simple Poisson regression with Stata
8.8 Poisson regression and survival analysis

8.8.1 Recoding survival data on patients as patient–year data
8.8.2 Converting survival records to person–years of follow-up using Stata

8.9 Converting the Framingham survival data set to person–time data
8.10 Simple Poisson regression with multiple data records
8.11 Poisson regression with a classification variable
8.12 Applying simple Poisson regression to the Framingham data
8.13 Additional reading
8.14 Exercises


9 Multiple Poisson regression
9.1 Multiple Poisson regression model
9.2 An example: the Framingham Heart Study

9.2.1 A multiplicative model of gender, age, and coronary heart disease
9.2.2 A model of age, gender, and CHD with interaction terms
9.2.3 Adding confounding variables to the model

9.3 Using Stata to perform Poisson regression
9.4 Residual analyses for Poisson regression models

9.4.1 Deviance residuals

9.5 Residual analysis of Poisson regression models using Stata
9.6 Additional reading
9.7 Exercises


10 Fixed effects analysis of variance
10.1 One-way analysis of variance
10.2 Multiple comparisons
10.3 Reformulating analysis of variance as a linear regression model
10.4 Non-parametric methods
10.5 Kruskal–Wallis test
10.6 Example: a polymorphism in the estrogen receptor gene
10.7 User contributed software in Stata
10.8 One-way analyses of variance using Stata
10.9 Two-way analysis of variance, analysis of covariance, and other models
10.10 Additional reading
10.11 Exercises


11 Repeated-measures analysis of variance
11.1 Example: effect of race and dose of isoproterenol on blood flow
11.2 Exploratory analysis of repeated measures data using Stata
11.3 Response feature analysis
11.4 Example: the isoproterenol data set
11.5 Response feature analysis using Stata
11.6 The area-under-the-curve response feature
11.7 Generalized estimating equations
11.8 Common correlation structures
11.9 GEE analysis and the Huber–White sandwich estimator
11.10 Example: analyzing the isoproterenol data with GEE
11.11 Using Stata to analyze the isoproterenol data set using GEE
11.12 GEE analyses with logistic or Poisson models
11.13 Additional reading
11.14 Exercises




A Summary of statistical models discussed in this text
A.1 Models for continuous response variables with one response per patient
A.2 Models for dichotomous or categorical response variables with one response per patient
A.3 Models for survival data (follow-up time plus fate at exit observed on each patient)
A.4 Models for response variables that are event rates or the number of events during a specified number of patient–years of follow-up. The event must be rare
A.5 Models with multiple observations per patient or matched or clustered patients


B Summary of Stata commands used in this text
B.1 Data manipulation and description
B.2 Analysis commands
B.3 Graph commands
B.4 Common options for graph commands (insert after comma)
B.5 Post-estimation commands (affected by preceding regression-type command)
B.6 Command prefixes
B.7 Command qualifiers (insert before comma)
B.8 Logical and relational operators and system variables (see Stata User’s Guide)
B.9 Functions (see Stata Data Management Manual)


Author: William D. Dupont
Edition: Second Edition
ISBNN-13: 978-0-521-61480-1
©Copyright: 2009 Cambridge University Press

William Dupont’s Statistical Modeling for Biomedical Researchers, Second Edition is ideal for a one-semester graduate course in biostatistics and epidemiology. Dupont assumes only a basic knowledge of statistics, such as that obtained from a standard introductory statistics course. Stata is used extensively throughout the text, making it possible to introduce computationally complex methods with little or no higher-level mathematics.