SAS and R perform Merge differently

Question:

What's the difference between the way R and SAS merge?
SAS's Merge command returns 205546 rows, R's returns 207208 rows.
Here is an example.

I'm working with the IBGE file available at:
ftp://ftp.ibge.gov.br/PNS/2013/microdados/pns_2013_microdados.zip

The DOMPNS2013.txt and PESPNS2013.txt databases will be used

SAS:
1) Assignment of variables: execute the files "input DOMPNS2013" and "input PESPNS2013"
2) Selection of an interest value and Merge:

data dompns2013v3;  
set dompns2013;  
if V0015 = 1;  
run;  
/*NOTE: There were 81187 observations read from the data set WORK.DOMPNS2013.
NOTE: The data set WORK.DOMPNS2013V2 has 64348 observations and 20 variables.*/  

data arq.dompes2013v3;  
merge dompns2013v3 pespns2013;   
by v0001 v0024 upa_pns v0006;  
run;  
/*NOTE: There were 64348 observations read from the data set WORK.DOMPNS2013V2.
NOTE: There were 205546 observations read from the data set WORK.PESPNS2013.
NOTE: The data set ARQ.DOMPES2013V2 has 205546 observations and 388 variables.
NOTE: DATA statement used (Total process time):*/  

#

A: 1) assignment of variables:

d2013 = read.fwf(file='DOMPNS2013.txt',widths=c(2,8,7,4,2,6,1,1))  

names(d2013) = c("v0001","v0024","upa_pns","v0006","v0015","skip1","v0026","v0031")  

d2013 = subset(d2013,select=c("v0001","v0024","upa_pns","v0006","v0015","v0026","v0031"))  

p2013 = read.fwf(file='PESPNS2013.txt',widths=c(2,8,7,4,1,2,2,2,1,8,3))  

names(p2013)=c("v0001","v0024","upa_pns","v0006","v0025","skip1","c00301","c004","c006","skip2","c008")  

p2013=subset(p2013,select=c("v0001","v0024","upa_pns","v0006","v0025","c00301","c004","c006","c008"))  

2) Selection of an interest value and Merge:

dim(d2013)  
[1] 81187     7  

d2013 = subset(d2013, d2013$v0015 == 1)  
dim(d2013)  
[1] 64348     7  

dim(p2013)  
[1] 205546      9  

dpmerge = merge( p2013,d2013,by=c("v0001","v0024","upa_pns","v0006"))  
dim(dpmerge)  
[1] 207208     12  

Answer:

SAS is removing duplicate records from DOMPNS before merging.

If you do d2013 <- unique(d2013) before merging into R, the number of observations will be the same.

Scroll to Top