The Workflow of Data Analysis Using Stata, by J. Scott Long, is an essential productivity tool for data analysts. Aimed at anyone who analyzes data, this book presents an effective strategy for designing and doing data-analytic projects.
In this book, Long presents lessons gained from his experience with numerous academic publications, as a coauthor of the immensely popular Regression Models for Categorical Dependent Variables Using Stata, and as a coauthor of the SPOST routines, which are downloaded over 20,000 times a year.
A workflow of data analysis is a process for managing all aspects of data analysis. Planning, documenting, and organizing your work; cleaning the data; creating, renaming, and verifying variables; performing and presenting statistical analyses; producing replicable results; and archiving what you have done are all integral parts of your workflow.
Long shows how to design and implement efficient workflows for both one-person projects and team projects. Long guides you toward streamlining your workflow, because a good workflow is essential for replicating your work, and replication is essential for good science.
An efficient workflow reduces the time you spend doing data management and lets you produce datasets that are easier to analyze. When you methodically clean your data and carefully choose names and effective labels for your variables, the time you spend doing statistical and graphical analyses will be more productive and more enjoyable.
After introducing workflows and explaining how a better workflow can make it easier to work with data, Long describes planning, organizing, and documenting your work. He then introduces how to write and debug Stata do-files and how to use local and global macros. Long presents conventions that greatly simplify data analysis—conventions for naming, labeling, documenting, and verifying variables. He also covers cleaning, analyzing, and protecting your data.
While describing effective workflows, Long also introduces the concepts of basic data management using Stata and writing Stata do-files. Using real-world examples, Stata commands, and Stata scripts, Long illustrates effective techniques for managing your data and analyses. If you analyze data, this book is recommended for you.
© Copyright 1996–2023 StataCorp LLC
List of tables
List of figures
List of examples
Preface (PDF)
A word about fonts, files, commands, and examples
1.2 Steps in the workflow
1.2.2 Running analysis
1.2.3 Presenting results
1.2.4 Protecting files
1.3 Tasks within each step
1.3.2 Organization
1.3.3 Documentation
1.3.4 Execution
1.4 Criteria for choosing a workflow
1.4.2 Efficiency
1.4.3 Simplicity
1.4.4 Standardization
1.4.5 Automation
1.4.6 Usability
1.4.7 Scalability
1.5 Changing your workflow
1.6 How the book is organized
2.2 Planning
2.3 Organization
2.3.2 Organizing files and directories
2.3.3 Creating your directory structure
A directory structure for a large, one-person project
Directories for collaborative projects
Special-purpose directories
Remembering what directories contain
Planning your directory structure
Naming files
Batch files
2.3.4 Moving into a new directory structure (advanced topic)
2.4 Documentation
2.4.2 Levels of documentation
2.4.3 Suggestions for writing documentation
2.4.4 The research log
A template for research logs
2.4.5 Codebooks
2.4.6 Dataset documentation
2.5 Conclusions
3.1 Three ways to execute commands
3.1.2 Dialog boxes
3.1.3 Do-files
3.2 Writing effective do-files
3.2.1 Making do-files robust
Use version control
Exclude directory information
Include seeds for random numbers
3.2.2 Making do-files legible
Use alignment and indentation
Use short lines
Limit your abbreviations
Be consistent
3.2.3 Templates for do-files
A template for simple do-files
A more complex do-file template
3.3 Debugging do-files
3.3.1 Simple errors and how to fix them
Log file already exists
Incorrect command name
Incorrect variable name
Incorrect option
Missing comma before options
3.3.2 Steps for resolving errors
Step 2: Start with a clean slate
Step 3: Try other data
Step 4: Assume everything could be wrong
Step 5: Run the program in steps
Step 6: Exclude parts of the do-file
Step 7: Starting over
Step 8: Sometimes it is not your mistake
3.3.3 Example 1: Debugging a subtle syntax error
3.3.4 Example 2: Debugging unanticipated results
3.3.5 Advanced methods for debugging
3.4 How to get help
3.5 Conclusions
4.1 Macros
4.1.1 Local and global macros
Global macros
Using double quotes when defining macros
Creating long strings
4.1.2 Specifying groups of variables and nested models
4.1.3 Setting options with locals
4.2 Information returned by Stata commands
4.3 Loops: foreach and forvalues
The forvalues command
4.3.1 Ways to use loops
Loop example 2: Creating interaction variables
Loop example 3: Fitting models with alternative measures of education
Loop example 4: Recoding multiple variables the same way
Loop example 5: Creating a macro that holds accumulated information
Loop example 6: Retrieving information returned by Stata
4.3.2 Counters in loops
4.3.3 Nested loops
4.3.4 Debugging loops
4.4 The include command
4.4.2 Recoding data using include files
4.4.3 Caution when using include files
4.5 Ado-files
4.5.2 Loading and deleting ado-files
4.5.3 Listing variable names and labels
4.5.4 A general program to change your working directory
4.5.5 Words of caution
4.6 Help files
4.6.2 help me
4.7 Conclusions
5.2 The dual workflow of data management and statistical analysis
5.3 Names, notes, and labels
5.4 Naming do-files
5.4.2 Naming do-files to reproduce statistical analysis
5.4.3 Using master do-files
5.4.4 A template for naming do-files
5.5 Naming and internally documenting datasets
5.5.2 Datasets for larger projects
5.5.3 Labels and notes for datasets
5.5.4 The datasignature command
Changes datasignature does not detect
5.6 Naming variables
5.6.2 Systems for naming variables
Source naming systems
Mnemonic naming systems
5.6.3 Planning names
5.6.4 Principles for selecting names
Use simple, unambiguous names
Try names before you decide
5.7 Labeling variables
5.7.1 Listing variable labels and other information
5.7.2 Syntax for label variable
5.7.3 Principles for variable labels
Test labels before you post the file
5.7.4 Temporarily changing variable labels
5.7.5 Creating variable labels that include the variable name
5.8 Adding notes to variables
5.8.1 Commands for working with notes
Removing notes
Searching notes
5.8.2 Using macros and loops with notes
5.9 Value labels
5.9.1 Creating value labels is a two-step process
Step 2: Assigning labels
Why a two-step system?
Removing labels
5.9.2 Principles for constructing value labels
2) Include the category number
3) Avoid special characters
4) Keeping track of where labels are used
5.9.3 Cleaning value labels
5.9.4 Consistent value labels for missing values
5.9.5 Using loops when assigning value labels
5.10 Using multiple languages
5.10.2 Using label language for short and long labels
5.11 A workflow for names and labels
Step 2: Archive, clone, and rename
Step 3: Revise variable labels
Step 4: Revise value labels
Step 5: Verify the changes
5.11.1 Step 1: Check the source data
Step 1b: Try the current names and labels
5.11.2 Step 2: Create clones and rename variables
Step 2b: Create rename commands
Step 2c: Rename variables
5.11.3 Step 3: Revise variable labels
Step 3b: Revise variable labels
5.11.4 Step 4: Revise value labels
Step 4b: Create label define commands to edit
Step 4c: Revise labels and add them to dataset
5.11.5 Step 5: Check the new names and labels
5.12 Conclusions
6.1 Importing data
6.1.1 Data formats
Binary-data formats
6.1.2 Ways to import data
Using other statistical packages to export data
Using a data conversion program
6.1.3 Verifying data conversion
6.2 Verifying variables
6.2.1 Values review
Values review of data on family values
6.2.2 Substantive review
Examining high-frequency values
Links among variables
Changes in survey questions
6.2.3 Missing-data review
Creating indicators of whether cases are missing
Using extended missing values
Verifying and expanding missing-data codes
Using include files
6.2.4 Internal consistency review
6.2.5 Principles for fixing data inconsistencies
6.3 Creating variables for analysis
6.3.1 Principles for creating new variables
Verify that new variables are correct
Document new variables
Keep the source variables
6.3.2 Core commands for creating variables
The clonevar command
The replace command
6.3.3 Creating variables with missing values
6.3.4 Additional commands for creating variables
The egen command
The tabulate, generate() command
6.3.5 Labeling variables created by Stata
6.3.6 Verifying that variables are correct
Listing variables
Plotting continuous variables
Tabulating variables
Constructing variables multiple ways
6.4 Saving datasets
6.4.1 Selecting observations
6.4.2 Dropping variables
6.4.3 Ordering variables
6.4.4 Internal documentation
6.4.5 Compressing variables
6.4.6 Running diagnostics
Checking for unique ID variables
6.4.7 Adding a data signature
6.4.8 Saving the file
6.4.9 After a file is saved
6.5 Extended example of preparing data for analysis
Creating binary indicators of positive attitudes
Creating four-category scales of positive attitudes
6.6 Merging files
6.6.1 Match-merging
6.6.2 One-to-one merging
6.6.3 Forgetting to match-merge
6.7 Conclusions
7.1 Planning and organizing statistical analysis
7.1.2 Planning in the middle
7.1.3 Planning in the small
7.2 Organizing do-files
7.2.2 What belongs in your do-file?
7.3 Documentation for statistical analysis
7.3.2 Documenting the provenance of results
7.4 Analyzing data using automation
7.4.2 Loops for repeated analyses
Loops for alternative model specifications
7.4.3 Matrices to collect and print results
Saving results from nested regressions
Saving results from different transformations of articles
7.4.4 Creating a graph from a matrix
7.4.5 Include files to load data and select your sample
7.5 Baseline statistics
7.6 Replication
7.6.2 Software and version control
7.6.3 Unknown seed for random numbers
Letting Stata set the seed
Training and confirmation samples
7.6.4 Using a global that is not in your do-file
7.7 Presenting results
7.7.1 Creating tables
Regression tables with esttab
7.7.2 Creating graphs
Font size
7.7.3 Tips for papers and presentations
Presentations
7.8 A project checklist
7.9 Conclusions
8.2 Causes of data loss and issues in recovering a file
8.3 Murphy’s law and rules for copying files
8.4 A workflow for file protection
Part 2: Offline backups
8.5 Archival preservation
8.6 Conclusions
A.1 How Stata works
The working directory
A.2 Working on a network
A.3 Customizing Stata
A.3.2 Commands to change preferences
Options that need to be set each session
A.3.3 profile.do
A.4 Additional resources
References
Author index (PDF)
Subject index (PDF)
© Copyright 1996–2023 StataCorp LLC