/*--------------------------------------------------------------------------- © Copyright 2011-2013 University of Manitoba This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . ----------------------------------------------------------------------------- Referential Integrity Check Macro The following program is used to test data tables that were originally derived from a relational data base or two database tables where you expect there to be referential integrity. Referential integrity means that there is a matching key between two databases. One database (primary) contains a single record for each key variable - client information and variable labels (formats) are examples of primary tables. The other database (foreign) may contain any number of records for each key variable. Primary Key should contain only unique values - no missing values are allowed. Foreign key may contain any number of values but all existing values must be in the primary table. If tables are built from a RDBM system that has enforced referential integrity contstraints then the there should always be a match in the foreign table for every primary table key. If this is not the case then there was an error in the install, documentation, original database (it may originally have been a sloppy model), some data is missing such as a secondary linkage table, or the key is more complex (requries another variable for example). This macro may be useful: - Installing or rebuilding tables provided from an RDBM source - Installing or checking tables that come from different souces but you expect a primary/foreign key relationship. e.g. md mumbers in Physician master file compared to physician billing, MIMS, Hospital abstracts, etc... PHIN attached to clients in education, housing, etc... after linkage - Checking linkage quality between multiple sources - When you are going to be automatically building label formats from datasets (before proc format with cntlin option). This macro check for the following criteria. 1. Primary Key is checked for any duplicate or missing values. 2. Values in the foreign table are matched to the primary table. Orphan values (those in the foreign table but not in the primary table) are identified. A warning is written to the log and a printed table created if either condition is found (failure of referential integrity). Users should note that there are times when referential integrity is not expected - from RDBM systems this should be rare. In other systems or comparison of values between systems this may be fairly common as there was no way to inforce matching values between the systems. It is possible to run this macro on two (or more) variables in the same table - where you expect one variable in the table to have a key relationship to another variable. An example would be MDNO to BASENO in a single physican master table. Once datasets are installed you can enforce referential integrity constraints within SAS. I originally wrote this macro when we started to look at installing data provided from RDBM sources in SAS while maintaining the original table structure (not reducing the data to a datamart or data warehouse type model). SAS had lots of documentation on setting integrity constraints either at the dataset level or within SQL but there didn't seem to be any convienent way to check (I guess setting the constraints would do this). Usage: ------ The macro can be called as a statement style macro. If there are multiple keys then they must be quoted in the statement ref_int_check primary = Primary Dataset containing the primary key. This data set should contain only one record/key value. Primary key must not contain any missing values foreign = Foreign dataset. This dataset may contain multiple values of the key variable. All key values in the foreign data must appear in the primary dataset. key = Key variable or variables. In statement style call this must be quoted. f_key = OPTIONAL foreign key variable(s). This option is not required if the key variables have the same name on both datasets. The order of the key variables must be the same in key and f_key. debug OPTIONSAL. If the word debug or debug=1 or debug=YES is passed then mprint and notes are turned on. Otherwise notes and mprint are turned off. odsout = location and name to save the output (output must end with an extension of .rtf). If blank, then output will be shown in sas output windows only. ------------------------------------------------------------------------------------- This macro has been modified to exclude/include the following: 1. Exclude the original macro parameter print= 2. This macro can now run for a group of foreign tables given that all foreign tables have the same foreign keys. 3. Allow to filter each datasets from within macro call. 4. Duplicate or missing values of primary keys and orphan values of foreign keys if found is summarized in table form 5. Adding option to output the summarize table in RTF format. Say Hong Please direct questions and comments to info@cpe.umanitoba.ca Date : August 12, 2013. ------------------------------------------------------------------------------------- ***/ %macro dq_ref_int_check(primary= , foreign= , key=, f_key= , debug=0, odsout=); %put NOTE: Referential Integrity Check v0.1 ; %put NOTE: Charles Burchill ; %put ; **** Get and print size of input datasets ; %if %index(&primary, %str(%()) ^= 0 %then %do; %if %qsubstr(&primary, 1, 1) ^= %str(%") & %qsubstr(&primary, 1, 1) ^= %str(%') %then %do; %put WARNING: Primary dataset must be quoted when where options is used; %put WARNING: This macro stop without executing; %return; %end; %let primary = %substr(&primary, 2, %length(&primary) - 2); proc sql noprint; select count(*) into :p_num from &primary; quit; %end; %else %do; %if %qsubstr(&primary, 1, 1) = %str(%") or %qsubstr(&primary, 1, 1) = %str(%') %then %let primary = %substr(&primary, 2, %length(&primary) - 2); %let dsid=%sysfunc(open(&primary)); %let p_num=%sysfunc(attrn(&dsid,nlobs)); %let rc=%sysfunc(close(&dsid)); %end; %put NOTE: There are &p_num observations in &primary.; /**/ **** Loop through all foreign datasets get size; %do i = 1 %to %sysfunc(countw(&foreign, %str( ), q)); %let _foreign&i = %scan(&foreign, &i, %str( ), q); %if %index(&&_foreign&i, %str(%()) ^= 0 %then %do; %let foreign&i = %substr(&&_foreign&i, 2, %length(&&_foreign&i) - 2); proc sql noprint; select count(*) into :f_num&i from &&foreign&i; quit; %end; %else %do; %if %qsubstr(&&_foreign&i, 1, 1) = %str(%") or %qsubstr(&&_foreign&i, 1, 1) = %str(%') %then %let foreign&i = %substr(&&_foreign&i, 2, %length(&&_foreign&i) - 2); %else %let foreign&i = &&_foreign&i; %let dsid = %sysfunc(open(&&foreign&i)); %let f_num&i = %sysfunc(attrn(&dsid, nlobs)); %let rc = %sysfunc(close(&dsid)); %end; %end; /**/ * Strip quotes from key variable list ;; %if &key ^= %str() %then %do ; %let qte = %qsubstr(&key,1,1) ; %if &qte = %str(%') | &qte = %str(%") %then %do ; %let rq=%length(&key) ; %let key=%substr(&key,2,%eval(&rq-2)); %end ; %end; * Strip quotes from key variable list ;; %if &f_key ^= %str() %then %do ; %let qte = %qsubstr(&f_key,1,1) ; %if &qte = %str(%') | &qte = %str(%") %then %do ; %let rq=%length(&f_key) ; %let f_key=%substr(&f_key,2,%eval(&rq-2)); %end ; %end; *** Add error checking here. Check if f_key is missing and re-set to key ; %if &f_key=%str() %then %let f_key=&key ; %do i = 1 %to %sysfunc(countw(&foreign, %str( ), q)); %put NOTE: There are &&f_num&i observations in &&foreign&i.; %end; /*%put NOTE: Primary Key %upcase(&key) in %upcase(&primary) ; %put NOTE: Foreign Key %upcase(&f_key) in %upcase(&foreign) ;*/ %put ; *** capture options here into &options ; %if &debug=1 or %upcase(&debug)=YES %then options notes mprint ; %else options nonotes nomprint ; ; *** Create a macro variable for frequency tables if multiple key values are provided ; %let key_freq=%scan(&key,1) ; %let f_key_freq=%scan(&f_key,1) ; %let n=2 ; %do %while (%scan(&key,&n)^=%str()) ; %let key_freq = &key_freq * %scan(&key,&n) ; %let n = %eval(&n+1) ; %end ; %let n=2 ; %do %while (%scan(&f_key,&n)^=%str()) ; %let f_key_freq = &f_key_freq * %scan(&f_key,&n) ; %let n = %eval(&n+1) ; %end ; **** Create comma list for cmiss function - I know there is a function that returns a comma list but I can not remember what it is ; %let key_cmiss=%scan(&key,1) ; %let f_key_cmiss=%scan(&f_key,1) ; %let n=2 ; %do %while (%scan(&key,&n)^=%str()) ; %let key_cmiss = &key_cmiss %str(,) %scan(&key,&n) ; %let n = %eval(&n+1) ; %end ; %let n=2 ; %do %while (%scan(&f_key,&n)^=%str()) ; %let f_key_cmiss = &f_key_cmiss %str(,) %scan(&f_key,&n) ; %let n = %eval(&n+1) ; %end ; **** create rename for freq ; %let key_rename = %scan(&f_key,1) = %scan(&key,1) ; %let n= 2; %do %while (%scan(&f_key,&n)^=%str()) ; %let key_rename = &key_rename %scan(&f_key,&n) = %scan(&key,&n) ; %let n = %eval(&n+1) ; %end ; *** add code to allow multiple key variables here ; *** code will need to be added to rename all key variables to a constant name - if we could assume that key variables were always the same in both datasets this would not be necessary ; /* too many key variables will crash proc freq use proc summary instead*/ /*proc freq data=&primary ; tables &key_freq / out=_primary noprint missing ; run;*/ proc summary data=&primary nway missing; class &key; output out=_primary(rename=(_freq_=Count) drop=_type_); run; data _all ; set _primary(obs=1) ; where count >= 2 or cmiss(&key_cmiss) >0 ; run; %let dsid=%sysfunc(open(_all)); %let p_failed=%sysfunc(attrn(&dsid,nlobs)); %let rc=%sysfunc(close(&dsid)); *** might want to bail if the primary key failed ; title ; *** modify the rename in the next line for a rename for each value of key ; %do i = 1 %to %sysfunc(countw(&foreign, %str( ), q)); /* too many key variables crash proc freq use proc summary instead */ /*proc freq data=&&foreign&i ; tables &f_key_freq / out=_foreign&i(rename=(&key_rename )) noprint ; run;*/ proc summary data=&&foreign&i nway missing; class &f_key; output out=_foreign&i(rename=(_freq_=Count &key_rename) drop=_type_); run; data _all&i ; merge _primary(in=primaryin keep=&key) _foreign&i(in=foreignin rename=(count=F_count /*percent=F_Percent*/)) ; by &key ; *** put in original &key ; Primary = primaryin ; Foreign = Foreignin ; retain primary_n &p_num foreign_n &&f_num&i ; if f_count < 6 then do ; f_count=. ; /*f_percent=.*/ ; end ; run; *** split the foreign datasets name and the where option; %let f_pos&i = %index(&&foreign&i, %str(%()); %if &&f_pos&i ^= 0 %then %do; %if %index(&&foreign&i, .) ^= 0 %then %do; %let f_dsn&i = %upcase(%scan(%substr(&&foreign&i, 1, %index(&&foreign&i, %str(%())-1), 2, .)); %end; %else %let f_dsn&i = %upcase(%substr(&&foreign&i, 1, %index(&&foreign&i, %str(%())-1)); %let f_filter&i = %sysfunc(compbl(%upcase(%substr(&&foreign&i, %index(&&foreign&i, %str(%())+1, %length(&&foreign&i)-%index(&&foreign&i, %str(%())-1 )))); %end; %else %do; %if %index(&&foreign&i, .) ^= 0 %then %let f_dsn&i = %upcase(%scan(&&foreign&i, 2, .)); %else %let f_dsn&i = %upcase(&&foreign&i); %let f_filter&i = %str( ); %end; proc sql; create table orphan&i as select "&&f_dsn&i" as Foreign length=32 label = 'Foreign Table', %if &&f_pos&i ^= 0 %then %do; "%bquote(&&f_filter&i)" as filter length=250 label = 'Filter Condition', %end; count(*) as orphan label = 'Orphan Values' format=15., &&f_num&i as records label = 'Total Records' format=15. from _all&i where ^primary & Foreign; quit; %end; data orphan; set orphan1-orphan%sysfunc(countw(&foreign, %str( ), q)); run; /**/ *** split the primary dataset name and where option; %let p_pos = %index(&primary, %str(%()); %if &p_pos ^= 0 %then %do; %if %index(&primary, .) ^= 0 %then %let dsn = %upcase(%scan(%substr(&primary, 1, %index(&primary, %str(%())-1 ), 2, .)); %else %let dsn = %upcase(%substr(&primary, 1, %index(&primary, %str(%())-1 )); %let p_filter = %sysfunc(compbl(%upcase(%substr(&primary, %index(&primary, %str(%())+1, %length(&primary)-%index(&primary, %str(%())-1 )))); %end; %else %do; %if %index(&primary, .) ^= 0 %then %let dsn = %upcase(%scan(&primary, 2, .)); %else %let dsn = %upcase(&primary); %let p_filter = %str( ); %end; %if &p_failed ^= 0 %then %do; *** generate summarize dataset for output; proc freq data=_primary(rename=(count=cnt)) noprint; where cnt >= 2 and not cmiss(&key_cmiss); tables cnt/out=cntdup; run; data _null_; set _primary; if cmiss(&key_cmiss) then call symputx('miss', count); run; data cntdup; set cntdup; if _n_ = 1 then do; primary = "&dsn"; %if &p_pos ^= 0 %then %do; filter = "&p_filter"; %end; %if %symexist(miss) = 1 %then missing = symget('miss')%str(;); %else missing = '0'%str(;); records = "&p_num"; end; duplicate = catx(' ', count, cats('(x', cnt, ')')); run; %end; %else %do; data cntdup; primary = "&dsn"; %if &p_pos ^= 0 %then %do; filter = "&p_filter"; %end; missing = '0'; records = "&p_num"; duplicate = 0; output; run; %end; %if &odsout ^= %then %do; %let qt = %qsubstr(&odsout,1,1); %if &qt ^= %str(%') & &qt ^= %str(%") %then %let odsout = "&odsout"; *** output to RTF format; options nodate nonumber orientation=landscape leftmargin=0.5in rightmargin=0.5in topmargin=0.25in bottommargin=0.25in; ods rtf file=&odsout bodytitle startpage=no; ODS ESCAPECHAR='^'; footnote; title "Primary Key: &key"; proc print data=cntdup noobs label; var primary/style=[just=c cellwidth=2.5in]; %if &p_pos ^= 0 %then %do; var filter/style=[just=c]; label filter='Filter Condition'; %end; var duplicate missing/style=[just=c]; var records/style=[just=c cellwidth=1.2in]; label primary='Primary Table' duplicate = 'Duplicate' missing = 'Missing' records = 'Total Records'; run; title "Foreign Key: &f_key"; proc print data=orphan noobs label; var foreign; %if &f_pos1 ^= 0 %then %str(var filter;); var orphan records/style=[just=c]; run; ods rtf close; options date number; %end; %else %do; title "Primary Key: &key"; proc print data=cntdup noobs label; var primary; %if &p_pos ^= 0 %then %do; var filter; label filter='Filter Condition'; %end; var duplicate missing records; label primary='Primary Table' duplicate = 'Duplicate' missing = 'Missing' records = 'Total Records'; run; title "Foreign Key: &f_key"; proc print data=orphan noobs label; var foreign; %if &f_pos1 ^= 0 %then %str(var filter;); var orphan records; run; %end; /**/ %do i = 1 %to %sysfunc(countw(&foreign, %str( ), q)); data _all&i ; set _all&i(obs=1) ; where ^primary & Foreign ; run; %let dsid=%sysfunc(open(_all&i)); %let f_failed&i=%sysfunc(attrn(&dsid,nlobs)); %let rc=%sysfunc(close(&dsid)); %end; title ; *** Write note to log if primary key data has failed ; *** Write note to log if there is a missing primary key for a foreign key ; *** put error checking code here and reset options ; proc datasets library=work nolist ; delete _all _primary cntdup orphan %do i = 1 %to %sysfunc(countw(&foreign, %str( ), q)); orphan&i _all&i _foreign&i %end; ; quit ; *** reset options here ; %if &debug=1 or %upcase(&debug)=YES %then options notes nomprint ; %else options notes nomprint ; ; %if &p_failed=0 %then %put NOTE: Primary Key (%upcase(&key)) passed in %upcase(&primary) ; %else %put WARNING: Primary Key (%upcase(&key)) failed in %upcase(&primary) ; ; %do i = 1 %to %sysfunc(countw(&foreign, %str( ), q)); %if &&f_failed&i ^=0 %then %put WARNING: Orphan Foreign Key (%upcase(&f_key)) values exist in %upcase(&&foreign&i) ; ; %end; %put ; %put NOTE: Finished Referential Integrity Check ; %mend ; /*** * Example of running reference integrity check for a group of foreign tables with common foreign key variables; %dq_ref_int_check(primary=health.Wrha_edis_client_2007jan , foreign=health.Wrha_edis_status_2007jan health.Wrha_edis_provider_2007jan health.Wrha_edis_nacrs_2007jan health.Wrha_edis_consults_2007jan health.Wrha_edis_plan_2007jan health.Wrha_edis_location_2007jan, key=CLIENT_VISIT_GUID, debug=1, odsout='T:\Sayh\test\rtftest1.rtf') ; * Example of filtering date variable from within macro call; data med(index=(md mdyear)) ; set health.mhmed_1997apr(where=(acqdt>'01jan2010'd) keep=acqdt md servdt acqdt); drop servdt ; if servdt< '30sep2010'd then mdyear='201009' ; else if servdt<'31dec2010'd then mdyear='201012' ; else if servdt<'31mar2011'd then mdyear='201103' ; else mdyear='201106' ; run; %dq_ref_int_check(primary="health.mhpmf_1998sep(where=(acqdt>'01jan2011'd))", foreign = "med(where=(acqdt>'01jan2011'd))", key=mdno mdyear, f_key=md mdyear, debug=1, odsout='T:\Sayh\test\rtftest2.rtf') * Example of filtering datetime variable from within macro call; %dq_ref_int_check(primary="health.wrha_edis_client_2007jan(WHERE=(discharged_dttm > '31dec2010:0:0'dt))", foreign="health.wrha_edis_status_2007jan(WHERE=(status_end_dttm > '31dec2010:0:0'dt))" "health.wrha_edis_provider_2007jan(where=(provider_end_dttm > '31dec2010:0:0'dt))", key=CLIENT_VISIT_GUID, debug=1, odsout="T:\Sayh\test\rtftest3.rtf") ***/