SAS is a useful programming language for people who use large datasets. The SAS datastep is good at scoring new/existing data (assuming you’ve found some model you like) and things like the Teradata interface are also quite nice. Yet, the SAS programming language is by its nature extremely bulky. Often times, when I’m coding in SAS I find myself annoyed that steps that take 1 line in other languages (like R), take 20-30 lines.
This post offers one piece of programming code that has helped me reduce the complexity of my SAS programs and as a result make them much more understandable/replicable. As the title of this post suggests, I’m going to illustrate an easier way to merge two datasets together.
To illustrate this code, let’s create two 5-row datasets. The first dataset has variables x1 and x2, the second dataset has variables x1 and x3. The following dataline programs create these datasets.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
data work.table1; input x1 x2; datalines; 1 3 2 12.2 3 1.1 5 8.2 4 0; run; data work.table2; input x1 x3; readlines; 2 10 4 3 3 -14 1 -100 5 3.14; run;
Clearly, we want to merge the two datasets on the variable (column) x1 and create a new dataset that has x1, x2, and x3. Typically in SAS you would write the following code to accomplish this task.
1 2 3 4 5 6 7 8 9 10 11 12
proc sort data = work.table1; by x1; run; proc sort data = work.table2; by x1; run; data work.table_merged; merge work.table1 work.table2; by x1; run;
So in all you are required to write 10 lines of code (12 if you include adequate spacing) that includes 2 sort procedures and one datastep to merge two datasets together. To be sure, this is not a ridiculous amount of code. Yet, when rereading what I’ve done I find code like this to be overly confusing. As a result, I was open to alternatives that would help me escape some of this unnecessary complexity.
An alternative approach to this would be to use proc sql. SQL is a language that many people already know. Even if you don’t, the language itself is extremely simple and you could learn enough to use the sql procedure in SAS fairly easily. The code below uses proc sql to merge together my two hypothetical databases above. Notice that it does the sorts automatically without me telling SAS to do them. This saves code and makes my program easier to read.
1 2 3 4 5 6 7
proc sql noprint; create table work.table_merged as ( select t1.x1, t1.x2, t2.x3 from work.table1 as t1 inner join work.table2 as t2 on t1.x1=t2.x1) quit;
Some caveats. First, a join is a merge in sql. Second, an inner join will only keep observations that have a match in both datasets. Be sure to look up figures which diagram what an inner, left, and right join do respectively.