Coding Merges Efficiently in SAS using Proc SQL

SAS is a useful programming language for people who use large datasets. The SAS datastep is good at scoring new/existing data (assuming you’ve found some model you like) and things like the Teradata interface are also quite nice. Yet, the SAS programming language is by its nature extremely bulky. Often times, when I’m coding in SAS I find myself annoyed that steps that take 1 line in other languages (like R), take 20-30 lines.

This post offers one piece of programming code that has helped me reduce the complexity of my SAS programs and as a result make them much more understandable/replicable. As the title of this post suggests, I’m going to illustrate an easier way to merge two datasets together.

To illustrate this code, let’s create two 5-row datasets. The first dataset has variables x1 and x2, the second dataset has variables x1 and x3. The following dataline programs create these datasets.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
data work.table1;
    input x1 x2;
    datalines;
    1 3
    2 12.2
    3 1.1
    5 8.2
    4 0;
run;
 
data work.table2;
    input x1 x3;
    readlines;
    2 10
    4 3
    3 -14
    1 -100
    5 3.14;
run;

Clearly, we want to merge the two datasets on the variable (column) x1 and create a new dataset that has x1, x2, and x3. Typically in SAS you would write the following code to accomplish this task.

1
2
3
4
5
6
7
8
9
10
11
12
proc sort data = work.table1;
    by x1;
run;
 
proc sort data = work.table2;
    by x1;
run;
 
data work.table_merged;
    merge work.table1 work.table2;
    by x1;
run;

So in all you are required to write 10 lines of code (12 if you include adequate spacing) that includes 2 sort procedures and one datastep to merge two datasets together. To be sure, this is not a ridiculous amount of code. Yet, when rereading what I’ve done I find code like this to be overly confusing. As a result, I was open to alternatives that would help me escape some of this unnecessary complexity.

An alternative approach to this would be to use proc sql. SQL is a language that many people already know. Even if you don’t, the language itself is extremely simple and you could learn enough to use the sql procedure in SAS fairly easily. The code below uses proc sql to merge together my two hypothetical databases above. Notice that it does the sorts automatically without me telling SAS to do them. This saves code and makes my program easier to read.

1
2
3
4
5
6
7
proc sql noprint;
    create table work.table_merged as (
    select t1.x1, t1.x2, t2.x3
    from work.table1 as t1
    inner join work.table2 as t2
    on t1.x1=t2.x1)
quit;

Some caveats. First, a join is a merge in sql. Second, an inner join will only keep observations that have a match in both datasets. Be sure to look up figures which diagram what an inner, left, and right join do respectively.

Plot Function for PS Matching Sensitivity Analysis

SocialĀ scientists (as well as other researchers) increasingly rely on matching methods to assess correlation. While the embrace of this methodology represents a useful endeavor, several scholars suggest two methods of improvement. First, scholars should report the balance achieved after a matching procedure has been used. Simply put, it’s possible that the matched sample possesses a larger degree of imbalance between the treatment and control group than the non-matched sample. Second, several authors suggest that reported findings must be robust to many model specifications.

The R function included on this page (ps.robust) easily provides information regarding these concerns. This tool utilizes Sekhon (2011)’s Matching package for R. However, instead of producing one matched result, the function plots as many estimates as the user desires. This function generates many matched samples by sampling from the user provided predictors of the treatment assignment. In other words, if the user provides 20 variables that can be used in the propensity model, this function randomly samples those 20 variables as many time as the user desires. At the end, the function plots these estimates.

How is balance addressed? This function evaluates the matched data and determines how many variables possess a statistically significant difference between the treatment and control group. The balance metric represents the percent of the variables in the sample that do not posses a statistically significant difference (t-test). To be sure, other balance metrics can and should be used. At this time however, the function only includes the t-test results.

Please keep in mind that a lack of statistical significance does not imply that there is no difference between the treatment and control groups for some covariate. Ā It’s best to consider this graph in the context of the specific application you have decided to use (i.e. choice of matching variables, matching method, etc).

Zip File with R Function