A very first introduction to Stata

Introduction to Stata

Mr Aidan Horn 

University of Cape Town

aidan@econometrics.co.za

Note: Please use the Stata User's Guide. The latest version can be found at https://www.stata.com/manuals/u.pdf . This page is structured around the Stata User's Guide (for v17). You can use the .do file version of this page.


Contents

0. Code to use at the beginning of a script

15. Saving and printing output — log files


Stata basics

1. Read this — it will help

2. A brief description of Stata

3. Resources for learning and using Stata

4. Stata's help and search facilities

5. Editions of Stata

6. Managing memory

7. more conditions

8. Error messages and return codes

9. The Break key

10. Keyboard use

28. Commands everyone should know


Elements of Stata

11. Language syntax

12. Data

13. Functions and expressions

16. Do files

18. Programming Stata

20. Estimation and postestimation commands

21. Creating reports


Advice

22. Entering and importing data

23. Combining datasets



0. Code to use at the beginning of a script

Click on View > Wrap lines, so that long lines get displayed as a paragraph.


* Comments can by made by starting the line with an asterisk, or by ending a line with // and writing a comment on the same line after the code, or by surrounding a block of text with 

/* multi-line 

text */ 

You can include comments so that others understand your code better, and so that you remember what you were trying to do when you read your code again later on. It is best practice to comment above or on the same line as the corresponding code. For example:

* When the .do file is run, the results window will come forward.

window manage forward results


* Clear datasets in memory, so that analysis can start afresh.

clear all

pause on // switch off for no pauses


From here onwards, on this web-page I will keep comments mostly on the regular HTML text format, for better presentation.


My suggestion for how to set up a project directory:

Project

├─ DataIN

├─ DataOUT

├─ Scripts

│    ├─ Logs

│    └─ Graphs

├─ Writing

└─ Info


For this tutorial, go to the DataFirst website (go to the Open Data Portal), and search for and download PALMS. Log in with your account, and state that you are exploring PALMS for educational purposes (when they ask for a reason why you are downloading the dataset).


Copy the folder path from the file explorer. Use forward-slashes in folder paths, as that works on Mac, and eliminates the chance of characters being 'escaped'. When collaborating with others, you can both run the same .do file and use different directories if you ask the script to check the username.

if c(username)=="hrnaid001" {

* Data source

global PALMS "C:/Users/hrnaid001/Dropbox/Economics/Survey data/PALMS"

* Project folder

global USER "C:/Users/hrnaid001/Dropbox/Economics/Tutoring/SoE/2023ECO5011F/Stata"

}


cd "$USER/Scripts"

It is easiest if you create the appropriate folders on your computer manually.


15. Saving and printing output — log files

cap log close _all

cap means that the program won't stop if the line is an error. log close closes the running log if it is open (so that we can re-start the log from the top). The _all specification closes all the open log files (e.g. if there are multiple logs being kept).

A log file saves the results into a file, so that we can check the results later if needed, without having to run the script again (which could take time for large, complex analyses).

The Stata Markup and Control Language (smcl format) log file is responsive to the screen size, and has colours.

log using "$USER/Scripts/Logs/Practice", smcl replace name(smcl1)

Supervisors or clients sometimes don't use Stata themselves, so it is easier for them to open a text version of the log file.

log using "$USER/Scripts/Logs/Practice", text replace name(text1)

You can also convert .smcl log files to their text-based version with translate filename.smcl filename.log, replace, which saves filename.log (the text version). You can also convert an SMCL file to PDF, to share it more easily with others: translate filename.smcl filename.pdf


Loading data into memory

Press Ctrl + Shift + Esc, and look at the available RAM (random access memory) in the computer. When a statistical package uses a dataset, it loads the data from the hard drive to the RAM, because reading data on RAM is quicker. The following command loads data into memory. Watch how the space used on RAM goes up, as the data is loaded in.

use "$PALMS/palmsv3.3.dta", clear

Data analysts need more RAM than regular computer users, because of this.



Stata basics

1. Read this — it will help

If you are confused about a command, type help <command> in the console. It is very important that you use Stata's help files to learn how to use commands, and the correct syntax to use. You will need to continue using the help command to remind yourself of different commands' usage, even after you have become accomplished at using Stata. Run the following line:

help


2. A brief description of Stata

Stata is a statistical package for managing, analyzing, and graphing data.


Throughout this tutorial, we will introduce commands.

Create a variable:

generate x = 5

(many observations, with 5 for each observation) The variable name here is "x".


realearnings base month is December 2017. 

Go to https://www.statssa.gov.za/publications/P0141/CPIHistory.pdf to see the indicies, and https://www.aidanhorn.co.za/inflation/app for tidy inflation data. Only run the following command once (or generate a new variable instead)

replace realearnings = realearnings/84.3*104.2

Now the base period is 2022. A "global macro" saves a small piece of information (a "local macro" does not carry on saving it after the .do file has finished running).

global BASEyr "2022"

We can abbreviate commands, and they will still run normally (see the underlined part of a command in a help file). "lab var" stands for "label variable".

lab var realearnings "Real gross monthly earnings, in $BASEyr rands"


You need to use two equals signs when doing boolean logic. You should only use one equals sign when defining a variable.

count if realearnings==0


gen logrealearnings = log(realearnings)


The summarize command quickly computes the mean. The detail option quickly shows the distribution, and moments. Do you know what the moments of a distribution are?

summarize realearnings

summarize realearnings, detail


Note that .  i.e. "missing", has a value of infinity, so when an if statement has a condition that a variable needs to be greater than an amount, you also need to include "and the variable is also less than missing". There are multiple missing categories (.a, .b, .c, etc.) above just the standard .


Inspect the data for outliers

Sort the dataset in descending order.

gen minusrealearnings = -realearnings

sort minusrealearnings

list realearnings if realearnings > 10^7 & realearnings <.

* Note that in Python, the power symbol must be two asterisks: **

format realearnings %12.0fc

list realearnings if realearnings > 10^7 & realearnings <. // These are monthly earnings values for individuals, from the labour force surveys

count if realearnings > 10^6 & realearnings <.  // In the raw data, 177 people have earnings above R 1 million per month, over the years 1993-2017.

The != means 'not equal to'

levelsof year if realearnings !=. // Note that years 2008 and 2009 do not have earnings data.

DataFirst has imputed earnings values for outliers, which we show at the bottom of this tutorial.


You can view the actual dataset by typing

browse

or browse <varlist>


3. Resources for learning and using Stata


4. Stata's help and search facilities

Use the help function whenever you are unsure about how to write code.


5. Editions of Stata

There are three editions/sizes for Stata. In order from most expensive to cheapest: Stata/MP, Stata/SE and Stata/BE. It costs in the region of ZAR 4700 to purchase a Stata licence.


6. Managing memory

Number the observations from 1 to the end:

gen n = _n

Preserve… restore

If you type preserve in your .do file, then you can change your dataset, save it, and restore it. The original data that you were working with will not have changed afterwards. For example, I often use collapse or reshape within this environment.

preserve

help pause

help collapse

pause  // type "end"

keep if year >= 2000

* Median real earnings by gender and main occupational category, over the entire time period.

collapse (median) realearnings (count) n, by(gender jobocccode)

lab var realearnings "Median gross real earnings ($BASEyr rands)"

format realearnings %12.0fc

browse

pause

save "collapse_medianrearn_jobocc.dta", replace

help import excel

export excel using "Median_realearnings_occupation.xlsx", sheet("Gender") sheetreplace keepcellfmt firstrow(varlabels)

restore


7. more conditions

The results will run at full speed by default.


8. Error messages and return codes

If there is a (small) mistake in your code, then Stata will stop running the script where the error occurs. Make sure to read the error message carefully, in order to debug what has gone wrong with your code. This way, you can fix problems yourself, without necessarily having to ask others for help.


9. The Break key

Click on the red cross to stop the execution (for example, in case you realise your code wasn't adequate). You can try this with the above preserve section, as I noticed that that section is slow.


10. Keyboard use

Make sure to save your .do file regularly while typing (Ctrl + S), in case Stata crashes after you have developed code. I do this as frequently as every 20–40 seconds, while typing, or after a few lines (because losing work that took mental effort and creativity is frustrating). On a side note, you should sync your files to a cloud, to avoid losing work. See the "Software > Cloud storage" section in http://toolkit.uctecossoc.co.za/


10.6. Tab expansion of variable names. 

A quick way to enter a variable name is to take advantage of Stata’s tab-completion feature. Simply type the first few letters of the variable name in the Command Window and press the Tab key. Stata will automatically type the rest of the variable name for you. If more than one variable name matches the letters you have typed, Stata will complete as much as it can and beep at you to let you know that you have typed a nonunique variable abbreviation. The tab-completion feature also applies to typing filenames.


28. Commands everyone should know

To make sure that you have fre installed, run

capture fre

if _rc == 199 {

ssc install fre

}

fre helps with quickly inspecting the values of a variable (similar to tab, an abbreviation for tabulate).


Here is a list of commands that "everyone" should know (go through this list, with the help files):


Getting help

help, net search, search Stata’s help and search facilities


Operating system interface

pwd, cd


Using and saving data from disk

save

use

compress


Inputting data into Stata

import

edit


Basic data reporting

describe

codebook

list

browse

count

inspect

table

tabulate [R] tabulate oneway and tabulate twoway

fre Similar to tabulate, but includes missing values

summarize


append, merge [U] 23 Combining datasets

generate, replace

egen

rename

clear

drop, keep

sort

encode, decode

order

by [U] 11.5 by varlist: construct

reshape

frames [D] frames


Graphing data

graph


Keeping track of your work

log [U] 15 Saving and printing output—log files

notes [D] notes


Convenience

display

Elements of Stata

11. Language syntax

NB: Oxford Languages defines "syntax" as (2): "The structure of statements in a computer language." 

It is very important that you know what "syntax" means! This is what you're looking for when you read the help files.


11.1. Overview

With few exceptions, the basic Stata language syntax is:

by varlist: command varlist=exp if exp in range [weight], options

Take note of how weights are included in estimation, from the line above. A command can be customised with options that come after the comma. There are often multiple options (settings) available, which can be found in the help file. When an option takes an argument, the argument is enclosed in parentheses.


11.1.8. numlist

A numlist is a list of numbers. Stata allows certain shorthands to indicate ranges. Practice editing and running the following loop, with (some of) the various examples listed below.

forvalues v = numlist {

display `v'

}


Numlist Meaning

2 just one number

1 2 3 three numbers

3 2 1 three numbers in reversed order

.5 1 1.5 three different numbers

1 3 -2.17 5.12 four numbers in jumbled order

1/3 three numbers: 1, 2, 3

3/1 the same three numbers in reverse order

5/8 four numbers: 5, 6, 7, 8

-8/-5 four numbers: −8, −7, −6, −5

-5/-8 four numbers: −5, −6, −7, −8

-1/2 four numbers: −1, 0, 1, 2

1 2 to 4 four numbers: 1, 2, 3, 4

4 3 to 1 four numbers: 4, 3, 2, 1

10 15 to 30 five numbers: 10, 15, 20, 25, 30

1 2:4 same as 1 2 to 4

4 3:1 same as 4 3 to 1

10 15:30 same as 10 15 to 30

1(1)3 three numbers: 1, 2, 3

1(2)9 five numbers: 1, 3, 5, 7, 9

1(2)10 the same five numbers, 1, 3, 5, 7, 9

9(-2)1 five numbers: 9, 7, 5, 3, and 1

-1(.5)2.5 the numbers −1, −.5, 0, .5, 1, 1.5, 2, 2.5

1[1]3 same as 1(1)3

1[2]9 same as 1(2)9

1[2]10 same as 1(2)10

9[-2]1 same as 9(−2)1

-1[.5]2.5 same as −1(.5)2.5

1 2 3/5 8(2)12 eight numbers: 1, 2, 3, 4, 5, 8, 10, 12

1,2,3/5,8(2)12 the same eight numbers

1 2 3/5 8 10 to 12 the same eight numbers

1,2,3/5,8,10 to 12 the same eight numbers

1 2 3/5 8 10:12 the same eight numbers

11.1.10. Prefix commands

The quietly prefix suppresses output on the results window (and log file), which is usually used when you merely want to collect an estimate or perform an operation, but not make the screen to busy. I have used this when running many loops in a program.


11.2. Abbreviation rules

As mentioned in Section 2, you can run commands even if they are abbreviated. The minimum abbreviation is shown by underlining in the help files. For example:

summ hrslstwk, detail

Here, summ is an abbreviation for summarize. Other commonly-used abbreviations include gen for generate, and lab for label.

11.2.3. Variable-name abbreviation

Variable names may be abbreviated to the shortest string of characters that uniquely identifies them given the data currently loaded in memory. For example:

summarize hrs // Shows count, mean, standard deviation, min and max of hrslstwk.

11.2.4. This can be unexpected, in long, complicated scripts, when you think that you've typed the full variable name. In such cases, you can turn off this feature with novarabbrev.


11.4.1. You may use * to indicate that "zero or more characters go here". For instance, if you suffix * to a partial variable name (for example, educ*), you are referring to all variable names that start with that letter combination. If you prefix * to a letter combination (for example, *_derived), you are referring to all variables that end in that letter combination. If you put * in the middle (for example, self*emp), you are referring to all variables that begin and end with the specified letters.

You may use ? to specify that one character goes here.

You may place a dash - between two variable names to specify all the variables stored between the two listed variables (in the order saved in the dataset), inclusive. You can determine storage order by using describe, which lists variables in the order in which they are stored.


11.4.3. Factor variables

You do not need to create categorical dummies for variables, as when doing estimation, i.varname separates out the variable into categorical dummies, and ## gives interaction terms, as: i.varname##c.varname  where c. indicates a continuous variable. For example:

fre province

reg realearnings i.province##c.year c.age##c.age

11.4.3.2. When we typed i.group in a regression command, group = 1 became the base level. When we do not specify otherwise, the smallest level becomes the base level. You can specify the base level of a factor variable by using the ib. operator.

reg realearnings ib4.province##c.year c.age##c.age // 4. Free State is set as the base.


11.4.4. Time-series variables

You would need to specify what the time variable is, with tsset (e.g. tsset time). Then, L., F., D. and S. are the lag, lead, difference and seasonal operators respectively. For panel data, xtset unit time sets up the dataset for panel data analysis.


11.5. by varlist: construct

by varlist: or by varlist, sort: runs the estimation within groups defined by the interactions of the variables in varlist. During each iteration, the values of the system variables _n and _N are set in relation to the first observation in the by-group. (_n is the row of the observation in the dataset.)


11.6. You can load a data file from the Internet. For example (covid-19 deaths by country):

import delimited "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv", clear

rename v4 longitude

rename lat latitude

help reshape

reshape long v, i(countryregion provincestate latitude longitude) j(time) 

rename v deaths

lab var deaths "Covid-19 deaths"

browse

replace time = time - 4

gen datetime = date("2020-01-22", "YMD") if time==1

by countryregion provincestate, sort: replace datetime = datetime[1] + time-1  if time>1


Let's go back to using the PALMS data.

use "$PALMS/palmsv3.3.dta", clear

replace realearnings = realearnings/84.3*104.2

* Note that the global macro is still in memory, if this is the same session.

lab var realearnings "Real gross monthly earnings, in $BASEyr rands"


11.6.1. The characters .. refer to the folder containing the current folder. Thus ../myfile refers to myfile in the folder containing the current folder, and "../nextdoor/myfile" refers to "myfile" in the folder "nextdoor" in the folder containing the current folder. This can be useful when saving files in a project.

11.6.2. Stata understands ~ to mean your home directory.


12. Data

12.1. A dataset is data plus labelings, formats, notes, and characteristics.

12.2.1. Some data collectors use "extended" missing values to indicate why a certain value is unknown: the question was not asked, the person refused to answer, etc. The ordering of extended missing values is: 

all numbers < . < .a < .b < ··· < .z

Thus,

count if age>50

may not return the wanted result, as it includes missing ages. You should remember to include "and less than missing" when using a greater-than sign, or a not equals sign:

count if age>50 & age<.

Compare

count if age!=50

to

count if age!50 & age<.


In a regression, if any of the variables have missing values, those observations will not be in the sample. That is why it is wise to understand the amount of missing data in each variable, before doing analysis. And, how the nonmissing subsets intersect, as the intersection determines the sample size of your analysis. It makes sense to draw a Venn diagram as part of your report.


12.2.2. Numeric storage types

Numbers can be stored in one of five variable types: byte, int, long, float (the default), or double. The number storage type can be set when generating a variable, for example:

gen byte age50 = age>=50 & age<.

replace age50=. if age==.

Storage types will be shrunk in size when the compress command is used. bytes are stored in 1 byte. ints are stored in 2 bytes; longs and floats in 4 bytes (float is short for 'floating point'), and doubles in 8 bytes. The table below shows the minimum and maximum values for each storage type.

12.4. Strings

A "string" is a sequence of characters. They are defined within double-quotation marks (Python can also use single quotation marks), so if you want quotation marks in your string, use `" and "'. For example:

label define reason 17 `"Their mobile phone was "lost"."', modify

label list reason

12.4.4. String data in Stata is usually encoded: stored as numbers, but labelled with the string interpretation. This makes programming and analysis more efficient. Encoding can be done with the encode (and undone with the decode) command, but then the programmer will not have control over the order of the values, if order matters (for example, from survey responses).


12.5.1. Numeric formats

I suggest that you use %15.0fc as the numeric format, to look at large values. For example, compare

quietly summ realearnings if year==2017

di r(sum)

di %15.0fc  r(sum)

table year, contents(sum realearnings)

br realearnings // View the data before and after we change the format!

format realearnings %15.0fc

help table

* Note that we still have outliers in these data.

table year, contents(sum realearnings) format(%15.0fc)


The number to the left of the decimal point is the total number of digits, including the decimal point (if you specify a positive number of decimal places).

Example 2:

cap drop y

gen double y = 234567890.34 in 1

replace y = 654321098765.5432 in 2

list y in 1/2

format y %20.1fc

list y in 1/2


12.6. Dataset, variable, and value labels

Labels are strings used to label elements in Stata, such as labels for datasets, variables, and values.


12.6.2. Variable labels

You can label variables, as we have already shown, with

label variable y "My variable"

This is useful for users of your dataset to understand in more detail what the variables mean, as variable names can only be a short string.

12.6.3. Value labels

Variables usually take on numerical values, and these values are labelled, with 'value labels'. This labelling is important, and is often the main activity of cleaning, once variables are created. For example:

cap drop age10cats

cap label drop bin10


gen age10cats = int(age/10)*10

lab var age10cats "Age categories (bins of 10)"

tab age10cats if age<130

// Run both the following lines at the same time. The /// can be used to break a line.

label define bin10 0 "0-9" 10 "10-19" 20 "20-29" 30 "30-39" 40 "40-49" 50 "50-59" /// 

60 "60-60" 70 "70-79" 80 "80-89" 90 "90-99" 100 "100-109" 110 "110-119" 120 "120-129"

You can list the contents of a value label with label list.

The bin10 value label can now be used on multiple variables. It must be attached to the variable, to put it to work:

label values age10cats bin10

tab age10cats if age<130

Practice highlighting the table in the results window, right click > Copy table, and paste the table into Excel.


12.7. Notes attached to data

You can attach notes to the dataset, with

note: realearnings inflated to $BASEyr

* Display notes

notes

* You can attach notes to variables as well:

note age10cats: There may be outlier ages above 129.

notes age10cats

This enables you to save information longer than just a variable label. See help notes for more guidance!


12.10. Data frames

Similar to R, Stata can now hold multiple datasets in memory.


13. Functions and expressions

13.2.4. Logical operators

Note that & means 'and', | means 'or', and ! means 'not'. For example:

summarize pweight if (age<=15 | (age>=65 & age<130)) & year==2016 // Non-working-age population in 2015

di %-15.0fc r(sum)/4 // the QLFS is conducted every quarter, so we divide the weight by four.

scalar nonworking2016 = r(sum) // saves the number, taking up a small amount of space.

summ pweight if year==2016 // comparison

di %-15.0fc r(sum)/4

di nonworking2016/r(sum) // proportion of non-working age population, out of total population


13.6. Accessing results from commands

Note that you can view what results are saved in memory, by running return list  or  ereturn list (for estimation results).


13.7. Explicit subscripting

You can access the value of a variable, by suffixing the variable with square brackets, and putting the observation number in the brackets. E.g.:

di age[1000]

* The last observation:

di age[_N]


13.12. It's sometimes better to use the float() operator when making a conditional statement, as Stata usually calculates with double precision on float numerals.


16. Do files

16.1.3. In order to break lines, when writing a long command, write your code in-between

#delimit ;

; #delimit cr

This changes the "line break" that Stata recognizes to a semi-colon so you must use a semi-colon at the end of the command! Writing #delimit cr changes the "line break" back to a "carriage return". Note that a comment written with * or // will not be read as a separate "line" to the rest of the code, so you need to put comments within /* and */ within this environment. It is good to use this environment when writing code for graphs.


18. Programming Stata

You can save a small piece of information (e.g. a number or a string) in a global or local "macro". The global macro will continue to be saved after the (section of the) .do file has run, but the local macro will be discarded once the .do file is not running. For example:

global Intro "Hello world!"

display "$Intro"

display `"I told the computer to say, "$Intro""'

local Short 4.567

display `Short'*4

Note that when running the last line by itself, `Short' does not exist.


You can save entire sections of code, in case you want to re-run it multiple times, in a "program". E.g.:

program earnings_tenths

forvalues y=2011(1)2017 {

keep if year==`y'

xtile earningstenth = realearnings [pw=bracketweight], nquantiles(10)

preserve

collapse (mean) earnings_mean = realearnings [pw=bracketweight], by(earningstenth)

save "earnings_means_tenths_`1'_`y'.dta", replace

restore

}

end


forvalues j=1/10 {

preserve

keep if jobindcode==`j'

earnings_tenths `j'

restore

}

Note that the first argument of the program earnings_tenths is turned into a local variable `1' in the program in this case, we save the collapsed dataset with the main industry code in the file name. Global macros can be used with ${global}restoffile in file names, and forward-slashes should be used for folder paths to avoid escaping the dollar sign or back-tick.


You can delete the files you just created, with

forvalues d=1/10 {

forvalues y=2011/2017 {

erase "earnings_means_tenths_`d'_`y'.dta

}

}


20. Estimation and postestimation commands

20.11. Obtaining predicted values

After running a regression, you can type predict dephat to calculate the predicted values for each observation, based on the estimated coefficients, where dephat is the name of the variable you want to create. To restrict the variable to the sample, state predict dephat if e(sample)


20.12. Accessing estimated coefficients

After estimation, you can access the coefficients with _b[varname]


See 20.13. for performing tests after estimation (such as F-tests).


20.14. Obtaining linear combinations of coefficients

lincom computes point estimates, standard errors, t or z statistics, p-values, and confidence intervals for a linear combination of coefficients after estimation.


Please make sure that you have covered the material in Wooldridge, J. 2013. Introduction to econometrics. (EMEA edition).


21. Creating reports

See putexcel and putdocx for creating files programmatically.


Advice

22. Entering and importing data

A dataset can easily be opened in Stata by clicking File > Import.


23. Combining datasets

append combines datasets vertically.

merge combines datasets horizontally. This can be done 1:1, 1:m, m:1, or m:m, i.e. merging the observations: 


joinby combines datasets as a product within groups.