data:image/s3,"s3://crabby-images/32ce0/32ce04d5e856d767cef91c4ddd2f924fcfcf4a77" alt="Data Analysis with Stata"
Variables and data types
There are different types of variables and data types, which we are going to see in this section.
Indicators or data variables
To find the insights and the data conclusions, the browse
/edit
command is helpful. Data variables store the fundamental data. As shown in the following table, the income data for different nations is stored in the Cccgdp
variable and the country (Countrycode
) data is stored in the pop variable. If we want to get an idea about the details of all kinds of data, then one indicator variable is needed. In the following case, Countrycode
and yr
will provide information regarding the country, the year, the country's GDP, and the population data (pops
). The data might be as follows:
data:image/s3,"s3://crabby-images/29865/298653c774c070f0d2c1d88e9ab9aa813dfae563" alt=""
After importing the data in Stata, it is always a good practice to examine the data. It gives you an advantage in any modeling or visualization exercise.
Examining the data
Examining the data is always recommended. It is a good idea to examine your data when you first read it into Stata; you should check whether all the variables and observations are present and are in the correct format.
While the browse
/edit
command is used to examine the raw data, the list
command is used to see the results of the data. Listing small data is possible through this command. For bigger datasets, options are used to track the data. An example is shown as follows:
List country* yr pops
Country countrycode yr pops India IND 2010 23452.9 | U.S. USA 2010 22222.1 | Pakistan PAK 2010 11111.2 | China CHN 2010 98765 | Russia RUS 2010 19876 | Germany GER 2010 23467 |
In the preceding table, the star is called the placeholder, and it instructs Stata to incorporate the entire data with the country. Alternatively, we could focus on all variables but list only a limited number of observations, for example, the observation from 14th to 19th row:
The following table contains the country, country code, year, and pops 14/19:
data:image/s3,"s3://crabby-images/5b64b/5b64bfb62f25d166243f57d65e9f85be528a3b07" alt=""
In the previous part, the in
qualifier was used; it makes sure that the subset pertains to selected data. A lot of observations follow after this, for example:
- The list in 14/19
- The list in 90/l
- The list in 30/l
As is clear from the preceding example, there are three observations:
- The first command lists observations from 14 to 19
- The second command lists 90 observations
- The third command lists observations from 30 till the last observation
The if
statement is the other way of subsetting data; it generally has values of true or false. The following is an example from the observation of the year 2010, where the variable name is yr:
list if yr == 2010
In order to examine the raw data, the browse
window is used. However, a problem occurs when only selected variables are to be viewed; this happens in big datasets. So, in this condition, create a list of the variables you want to examine before browsing. This is done through the following command:
browse country yr popscon
It is important to note that this edit
command will help change the dataset manually. The assert
command helps Stata examine the observation. This is because when the bigger data (or big data, as it is called in today's world) arrives, checking single data through browse
or edit
commands becomes difficult. In this case, the assert
command is helpful. There are a couple of advantages: it helps identify whether a data statement is right or wrong. For example, in the case of the population of the country (popscon
), it will tell us that the values are positive:
assert popscon>0, assert popscon<0
If the preceding command results in the value true, then assert
does not give any output. However, if the command value is false, then an error message will appear.
The describe
command accounts for various fundamental information regarding datasets and variables, such as the total size of the dataset and the variable, the total number of variables in the dataset, and different formats of the variables. This can be denominated as describe
. It can only be applied to an unread file in Stata. An example is given as follows:
describe using "E:\Ind-Health-sample.dta"
Codebook can give information on variables in the dataset without the list of variables; an example of this is codebook country.
The summarize
command delivers the statistics summary: means, standard deviation, and so on. The following table represents this tab:
summarize table Variable Obs Mean Std. Dev. Min Max
data:image/s3,"s3://crabby-images/1ee01/1ee01e69598837e8a58cba81722fedf511553a25" alt=""
As we can see in the preceding table, string variables such as Cntry
and Countrycode
do not have numbers; this is why no summary details are available. Yr
is a numeric variable; therefore, we can see that it has a statistics summary. For more details, the summarize detail option can be used.
The wide range of graphic qualities makes Stata a unique tool. One can easily get help by typing the help
command in Stata. A histogram graph can be created through the following command:
graph twoway histogram cccgdps
For a scatter plot, you have to leverage the following command:
graph two-way scatter ccccgdps popscon
Even though there is some benefit of having advanced graphs in Stata, this makes it work slowly. In certain cases, it is better to use version 7 graphics because they help visualize the data properly without using papers or presentations. This can be seen as follows:
graph7 cccgdps popscon
Saving the dataset is a very easy command, and it is represented as follows:
Save "E:\Stata1\t1 less India pwt 80-2010.dta", replace
If we have sets of files of the same content, then the replace
tab/option can be helpful. It will swap the last version and save it. If the old version is to be stored for some reason, then save it with a different name. One thing that should be kept in mind is that the original file content can be changed if it is saved with revised datasets. Therefore, after changes are made to the revised file, in order to open the file and restart it, just reopen it.
There are two ways to preserve and store the data. One option is to save the current data and revise it, and later, if you don't want to keep the data, then reopen
the saved data version. Another option is to use the preserve
and restore
functions/commands; they will take an image of the data, and the data will come back after you type restore
.