William Dupont’s Statistical Modeling for Biomedical Researchers, Second Edition is ideal for a one-semester graduate course in biostatistics and epidemiology. Dupont assumes only a basic knowledge of statistics, such as that obtained from a standard introductory statistics course. Stata is used extensively throughout the text, making it possible to introduce computationally complex methods with little or no higher-level mathematics. As a result, Dupont focuses on concepts and model assumptions, rather than on the underlying mathematics. The text covers linear regression, logistic regression, Poisson regression, survival analysis, and analysis of variance. Two chapters are devoted to each topic: an introductory chapter that uses simple data to develop the concept and a more advanced chapter devoted to explaining more complex models, case studies, diagnostic measures, etc.
Dupont pays equal attention to the methods and to using Stata to apply them. When Stata output is displayed, the most important elements of the output are highlighted and explained in notes that follow the output. These notes help the reader make sense of the output by providing the appropriate focus for the problem at hand. The notes also include instructions for reproducing the analysis via Stata’s point-and-click user interface. The text, replete with examples featuring real medical data, uses Stata graphics extensively, providing ample explanation and detail for reproduction.
1.2 Descriptive statistics
1.2.2 Sample mean
1.2.3 Residual
1.2.4 Sample variance
1.2.5 Sample standard deviation
1.2.6 Percentile and median
1.2.7 Box plot
1.2.8 Histogram
1.2.9 Scatter plot
1.3 The Stata Statistical Software Package
1.3.2 Creating histograms with Stata
1.3.3 Stata command syntax
1.3.4 Obtaining interactive help from Stata
1.3.5 Stata log files
1.3.6 Stata graphics and schemes
1.3.7 Stata do files
1.3.8 Stata pulldown menus
1.3.9 Displaying other descriptive statistics with Stata
1.4 Inferential statistics
1.4.2 Mean, variance, and standard deviation
1.4.3 Normal distribution
1.4.4 Expected value
1.4.5 Standard error
1.4.6 Null hypothesis, alternative hypothesis, and P-value
1.4.7 95% confidence interval
1.4.8 Statistical power
1.4.9 The z and Student’s t distributions
1.4.10 Paired t test
1.4.11 Performing paired t tests with Stata
1.4.12 Independent t test using a pooled standard error estimate
1.4.13 Independent t test using separate standard error estimates
1.4.14 Independent t tests using Stata
1.4.15 The chi-squared distribution
1.5 Overview of methods discussed in this text
1.5.2 Models with multiple responses per patient
1.6 Additional reading
1.7 Exercises
2.2 Sample correlation coefficient
2.3 Population covariance and correlation coefficient
2.4 Conditional expectation
2.5 Simple linear regression model
2.6 Fitting the linear regression model
2.7 Historical trivia: origin of the term regression
2.8 Determining the accuracy of linear regression estimates
2.9 Ethylene glycol poisoning example
2.10 95% confidence interval for y[x] = α + βx evaluated at x
2.11 95% prediction interval for the response of a new patient
2.12 Simple linear regression with Stata
2.13 Lowess regression
2.14 Plotting a lowess regression curve in Stata
2.15 Residual analyses
2.16 Studentized residual analysis using Stata
2.17 Transforming the x and y variables
2.17.2 Correcting for non-linearity
2.17.3 Example: research funding and morbidity for 29 diseases
2.18 Analyzing transformed data with Stata
2.19 Testing the equality of regression slopes
2.20 Comparing slope estimates with Stata
2.21 Density-distribution sunflower plots
2.22 Creating density-distribution sunflower plots with Stata
2.23 Additional reading
2.24 Exercises
3.2 Confounding variables
3.3 Estimating the parameters for a multiple linear regression model
3.4 R2 statistic for multiple regression models
3.5 Expected response in the multiple regression model
3.6 The accuracy of multiple regression parameter estimates
3.7 Hypothesis tests
3.8 Leverage
3.9 95% confidence interval for ŷi
3.10 95% prediction intervals
3.11 Example: the Framingham Heart Study
3.12 Scatter plot matrix graphs
3.13 Modeling interaction in multiple linear regression
3.14 Multiple regression modeling of the Framingham data
3.15 Intuitive understanding of a multiple regression model
3.16 Calculating 95% confidence and prediction intervals
3.17 Multiple linear regression with Stata
3.18 Automatic methods of model selection
3.18.2 Backward selection
3.18.3 Forward stepwise selection
3.18.4 Backward stepwise selection
3.18.5 Pros and cons of automated model selection
3.19 Collinearity
3.20 Residual analyses
3.21 Influence
3.21.2 Cook’s distance
3.21.3 The Framingham example
3.22 Residual and influence analyses using Stata
3.23 Using multiple linear regression for non-linear models
3.24 Building non-linear models with restricted cubic splines
3.25 The SUPPORT Study of hospitalized patients
3.25.2 Using Stata for non-linear models with restricted cubic splines
3.26 Additional reading
3.27 Exercises
4.2 Sigmoidal family of logistic regression curves
4.3 The log odds of death given a logistic probability function
4.4 The binomial distribution
4.5 Simple logistic regression model
4.6 Generalized linear model
4.7 Contrast between logistic and linear regression
4.8 Maximum likelihood estimation
4.9 Statistical tests and confidence intervals
4.9.2 Quadratic approximations to the log likelihood ratio function
4.9.3 Score tests
4.9.4 Wald tests and confidence intervals
4.9.5 Which test should you use?
4.10 Sepsis example
4.11 Logistic regression with Stata
4.12 Odds ratios and the logistic regression model
4.13 95% confidence interval for the odds ratio associated with a unit increase in x
4.14 Logistic regression with grouped response data
4.15 95% confidence interval for π[x]
4.16 Exact 100(1 − α)% confidence intervals for proportions
4.17 Example: the Ibuprofen in Sepsis Study
4.18 Logistic regression with grouped data using Stata
4.19 Simple 2 × 2 case–control studies
4.19.2 Review of classical case–control theory
4.19.3 95% confidence interval for the odds ratio: Woolf’s method
4.19.4 Test of the null hypothesis that the odds ratio equals one
4.19.5 Test of the null hypothesis that two proportions are equal
4.20 Logistic regression models for 2 × 2 contingency tables
4.20.2 95% confidence interval for the odds ratio: logistic regression
4.21 Creating a Stata data file
4.22 Analyzing case–control data with Stata
4.23 Regressing disease against exposure
4.24 Additional reading
4.25 Exercises
5.2 Mantel–Haenszel χ2 statistic for multiple 2 × 2 tables
5.3 95% confidence interval for the age-adjusted odds ratio
5.4 Breslow–Day–Tarone test for homogeneity
5.5 Calculating the Mantel–Haenszel odds ratio using Stata
5.6 Multiple logistic regression model
5.7 95% confidence interval for an adjusted odds ratio
5.8 Logistic regression for multiple 2 × 2 contingency tables
5.9 Analyzing multiple 2 × 2 tables with Stata
5.10 Handling categorical variables in Stata
5.11 Effect of dose of alcohol on esophageal cancer risk
5.12 Effect of dose of tobacco on esophageal cancer risk
5.13 Deriving odds ratios from multiple parameters
5.14 The standard error of a weighted sum of regression coefficients
5.15 Confidence intervals for weighted sums of coefficients
5.16 Hypothesis tests for weighted sums of coefficients
5.17 The estimated variance–covariance matrix
5.18 Multiplicative models of two risk factors
5.19 Multiplicative model of smoking, alcohol, and esophageal cancer
5.20 Fitting a multiplicative model with Stata
5.21 Model of two risk factors with interaction
5.22 Model of alcohol, tobacco, and esophageal cancer with interaction terms
5.23 Fitting a model with interaction using Stata
5.24 Model fitting: nested models and model deviance
5.25 Effect modifiers and confounding variables
5.26 Goodness-of-fit tests
5.27 Hosmer–Lemeshow goodness-of-fit test
5.28 Residual and influence analysis
5.28.2 Δβ_hatj influence statistic
5.28.3 Residual plots of the Ille-et-Vilaine data on esophageal cancer
5.29 Using Stata for goodness-of-fit tests and residual analyses
5.30 Frequency matched case–control studies
5.31 Conditional logistic regression
5.32 Analyzing data with missing values
5.32.2 Cardiac output in the Ibuprofen in Sepsis Study
5.32.3 Modeling missing values with Stata
5.33 Logistic regression using restricted cubic splines
5.33.2 95% confidence intervals for ψ_hat[x]
5.34 Modeling hospital mortality in the SUPPORT Study
5.35 Using Stata for logistic regression with restricted cubic splines
5.36 Regression methods with a categorical response variable
5.36.2 Polytomous logistic regression
5.37 Additional reading
5.38 Exercises
6.2 Right censored data
6.3 Kaplan–Meier survival curves
6.4 An example: genetic risk of recurrent intracerebral hemorrhage
6.5 95% confidence intervals for survival functions
6.6 Cumulative mortality function
6.7 Censoring and bias
6.8 Log-rank test
6.9 Using Stata to derive survival functions and the log-rank test
6.10 Log-rank test for multiple patient groups
6.11 Hazard functions
6.12 Proportional hazards
6.13 Relative risks and hazard ratios
6.14 Proportional hazards regression analysis
6.15 Hazard regression analysis of the intracerebral hemorrhage data
6.16 Proportional hazards regression analysis with Stata
6.17 Tied failure times
6.18 Additional reading
6.19 Exercises
7.2 Relative risks and hazard ratios
7.3 95% confidence intervals and hypothesis tests
7.4 Nested models and model deviance
7.5 An example: the Framingham Heart Study
7.5.2 Simple hazard regression model for CHD risk and DBP
7.5.3 Restricted cubic spline model of CHD risk and DBP
7.5.4 Categorical hazard regression model of CHD risk and DBP
7.5.5 Simple hazard regression model of CHD risk and gender
7.5.6 Multiplicative model of DBP and gender on risk of CHD
7.5.7 Using interaction terms to model the effects of gender and DBP on CHD
7.5.8 Adjusting for confounding variables
7.5.9 Interpretation
7.5.10 Alternative models
7.6 Proportional hazards regression analysis using Stata
7.7 Stratified proportional hazards models
7.8 Survival analysis with ragged study entry
7.8.2 Age, sex, and CHD in the Framingham Heart Study
7.8.3 Proportional hazards regression analysis with ragged entry
7.8.4 Survival analysis with ragged entry using Stata
7.9 Predicted survival, log–log plots, and the proportional hazards assumption
7.10 Hazard regression models with time-dependent covariates
7.10.2 Modeling time-dependent covariates with Stata
7.11 Additional reading
7.12 Exercises
8.2 Calculating relative risks from incidence data using Stata
8.3 The binomial and Poisson distributions
8.4 Simple Poisson regression for 2 × 2 tables
8.5 Poisson regression and the generalized linear model
8.6 Contrast between Poisson, logistic, and linear regression
8.7 Simple Poisson regression with Stata
8.8 Poisson regression and survival analysis
8.8.2 Converting survival records to person–years of follow-up using Stata
8.9 Converting the Framingham survival data set to person–time data
8.10 Simple Poisson regression with multiple data records
8.11 Poisson regression with a classification variable
8.12 Applying simple Poisson regression to the Framingham data
8.13 Additional reading
8.14 Exercises
9.2 An example: the Framingham Heart Study
9.2.2 A model of age, gender, and CHD with interaction terms
9.2.3 Adding confounding variables to the model
9.3 Using Stata to perform Poisson regression
9.4 Residual analyses for Poisson regression models
9.5 Residual analysis of Poisson regression models using Stata
9.6 Additional reading
9.7 Exercises
10.2 Multiple comparisons
10.3 Reformulating analysis of variance as a linear regression model
10.4 Non-parametric methods
10.5 Kruskal–Wallis test
10.6 Example: a polymorphism in the estrogen receptor gene
10.7 User contributed software in Stata
10.8 One-way analyses of variance using Stata
10.9 Two-way analysis of variance, analysis of covariance, and other models
10.10 Additional reading
10.11 Exercises
11.2 Exploratory analysis of repeated measures data using Stata
11.3 Response feature analysis
11.4 Example: the isoproterenol data set
11.5 Response feature analysis using Stata
11.6 The area-under-the-curve response feature
11.7 Generalized estimating equations
11.8 Common correlation structures
11.9 GEE analysis and the Huber–White sandwich estimator
11.10 Example: analyzing the isoproterenol data with GEE
11.11 Using Stata to analyze the isoproterenol data set using GEE
11.12 GEE analyses with logistic or Poisson models
11.13 Additional reading
11.14 Exercises
A.2 Models for dichotomous or categorical response variables with one response per patient
A.3 Models for survival data (follow-up time plus fate at exit observed on each patient)
A.4 Models for response variables that are event rates or the number of events during a specified number of patient–years of follow-up. The event must be rare
A.5 Models with multiple observations per patient or matched or clustered patients
B.2 Analysis commands
B.3 Graph commands
B.4 Common options for graph commands (insert after comma)
B.5 Post-estimation commands (affected by preceding regression-type command)
B.6 Command prefixes
B.7 Command qualifiers (insert before comma)
B.8 Logical and relational operators and system variables (see Stata User’s Guide)
B.9 Functions (see Stata Data Management Manual)