This paper discusses methods of joining SAS data sets. The different methods and the reasons for choosing a particular method of joining are contrasted and compared. Potential problems and limitations when joining data sets are also discussed. The need to combine data sets constantly arises during software development, as does the need to. Multiple SAS data sets can be merged based on a specific common variable to give a single data set. This is done using the MERGE statement and BY statement. The total number of observations in the merged data set is often less than the sum of the number of observations in the original data sets.
|Previous Page||Next Page|
|Merging SAS Data Sets|
|Merging with a BY Statement|
Merging with a BY statement enables youto match observations according to the values of the BY variables that youspecify. Before you can perform a match-merge, all data sets must be sortedby the variables that you want to use for the merge.
In order to understand match-merging, you must understandthree key concepts:
is a variable named in a BY statement.
is the value of a BY variable.
is the set of all observations withthe same value for the BY variable (if there is only one BY variable). Ifyou use more than one variable in a BY statement, then a BY group is the setof observations with a unique combination of values for those variables. Indiscussions of match-merging, BY groups commonly span more than one data set.
|Input SAS Data Set for Examples|
For example, the director of a small repertory theatercompany, the Little Theater, maintains company records in two SAS data sets,COMPANY and FINANCE.
|IdNumber||player's employee ID number|
|Salary||player's annual salary|
The following program creates,sorts, and displays COMPANYand FINANCE:
The following output displays the data sets. Notice that the FINANCEdata set does not contain an observation for Michael Morrison.
To avoid having to maintain two separate data sets,the director wants to merge the records for each player from both data setsinto a new data set that contains all the variables. The variable that iscommon to both data sets is Name. Therefore, Name is the appropriate BY variable.
The data sets are already sorted by NAME, so no furthersorting is required. The following program merges them by NAME:
The following output displays the merged data set:
The new data set contains one observation for each playerin the company. Each observation contains all the variables from both datasets. Notice in particular the fourth observation. The data set FINANCE doesnot have an observation for Michael Morrison. In this case, the values ofthe variables that are unique to FINANCE (IdNumber and Salary) are missing.
|Match-Merging Data Sets with Multiple Observations in a BY Group|
Input SAS Data Set for Examples
The Little Theaterhas a third data set, REPERTORY, that tracks the casting assignments in eachof the season's plays. REPERTORY contains these variables:
is the name of one of the plays inthe repertory.
is the name of a character in Play.
is the employee ID number of theplayer playing Role.
The following program creates and displays REPERTORY:
The following output displays the REPERTORY data set:
To maintain confidentiality during preliminary casting,this data set identifies players by employee ID number. However, casting decisionsare now final, and the manager wants to replace each employee ID number withthe player's name. Of course, it is possible to re-create the data set, enteringeach player's name instead of the employee ID number in the raw data. However,it is more efficient to make use of the data set FINANCE, which already containsthe name and employee ID number of all players (see The COMPANY and FINANCE Data Sets). When the data sets aremerged, SAS takes care of adding the players' names to the data set.
Of course, before you can merge the data sets, you mustsort them by IdNumber.
The following output displays the FINANCE and REPERTORY datasets, sorted by IdNumber:
These two data sets contain seven BY groups; that is,among the 23 observations are seven different values for the BY variable,IdNumber. The first BY group has a value of 029-46-9261 for IdNumber. FINANCEhas one observation in this BY group; REPERTORY has two. The last BY grouphas a value of 929-75-0218 for IdNumber. FINANCE has one observation in thisBY group; REPERTORY has three.
The following program merges the data sets FINANCE andREPERTORY and illustrates what happens when a BY group in one data set hasmore observations in it than the same BY group in the other data set.
The resulting data set contains all variables from bothdata sets.
Note: The OPTIONS statement extends theline size to 120 so that PROC PRINT can display all variables on one line.Most output in this section is created with line size set to 76 in the OPTIONSstatement. An OPTIONS statement appears only in examples using a differentline size. When you set the LINESIZE= option, it remains in effect untilyou reset it or end the SAS session.
The following output displays the merged data set:
Carefully examine the first few observations in thenew data set and consider how SAS creates them.
Before executing the DATA step, SAS reads thedescriptor portion of the two data sets and creates a program data vectorthat contains all variables from both data sets:
IdNumber, Name, and Salary from FINANCE
Play and Role fromREPERTORY.
Program Data Vector before Reading from Data Sets
SAS looks at the first BY group in each data setto determine which BY group should appear first. In this case, the first BYgroup, observations with the value 029-46-9261 for IdNumber, is the same inboth data sets.
SAS reads and copies the first observation fromFINANCE into the program data vector, as the next figure illustrates.
Program Data Vector after Reading FINANCE Data Set
SAS reads and copies the first observation fromREPERTORY into the program data vector, as the next figure illustrates. Ifa data set does not have any observations in a BY group, then the programdata vector contains missing values for the variables that are unique to thatdata set.
Program Data Vector after Reading REPERTORY Data Set
SAS writes the observation to the new data setand retains the values in the program data vector. (If the program data vectorcontained variables created by the DATA step, then SAS would set them to missingafter writing to the new data set.)
SAS looks for a second observation in the BY groupin each data set. REPERTORY has one; FINANCE does not. The MERGE statementreads the second observation in the BY group from REPERTORY. Because FINANCEhas only one observation in the BY group, the statement uses the values ofName (Rudelich , Herbert) and Salary (35000) retained in the programdata vector for the second observation in the new data set. The next figureillustrates this behavior.
Program Data Vector with Second Observation in the BY Group
SAS writes the observation to the new data set.Neither data set contains any more observations in this BY group. Therefore,as the final figure illustrates, SAS sets all values in the program data vectorto missing and begins processing the next BY group. It continues processingobservations until it exhausts all observations in both data sets.
Program Data Vector before New BY Groups
|Match-Merging Data Sets with Dropped Variables|
Now that casting decisions are final,the director wants to post the casting list, but does not want to includesalary or employee ID information. As the next program illustrates, Salaryand IdNumber can be eliminated by using the DROP= data set option when creatingthe new data set.
Sas Data Set Example
Note: The difference in placement of thetwo DROP= data set options is crucial. Dropping IdNumber in the DATA statementmeans that the variable is available to the MERGE and BY statements (to whichit is essential) but that it does not go into the new data set. Dropping Salaryin the MERGE statement means that the MERGE statement does not even read thisvariable, so Salary is unavailable to the program statements. Because thevariable Salary is not needed for processing, it is more efficient to preventit from being read into the PDV in the first place.The followingoutput displays the merged data set without the IdNumber and Salary variables:
|Match-Merging Data Sets with the Same Variables|
You canmatch-merge data sets that contain the same variables(variables with the same name) by using the RENAME= data set option, justas you would when performing a one-to-one merge (see Performing a One-to-One Merge on Data Sets with the Same Variables).
If you do not use the RENAME= option and avariableexists in more than one data set, then the value of that variable in the lastdata set read is the value that goes into the new data set.
|Match-Merging Data Sets That Lack a Common Variable|
You can name any number of datasets in the MERGE statement. However, if you are match-merging the data sets,then you must be sure they all have a common variable and are sorted by thatvariable. If the data sets do not have a common variable, then you mightbe able to use another data set that has variables common to the originaldata sets to merge them.
For instance, consider the data sets that are used inthe match-merge examples. The table that follows shows the names of the datasets and the names of the variables in each data set.
|COMPANY||Name, Age, Gender|
|FINANCE||Name, IdNumber, Salary|
|REPERTORY||Play, Role, IdNumber|
These data sets donot share a common variable. However,COMPANY and FINANCE share the variable Name. Similarly, FINANCE and REPERTORYshare the variable IdNumber. Therefore, as the next program shows, you canmerge the data sets into one with two separate DATA steps. As usual, you mustsort the data sets by the appropriate BY variable. (REPERTORY is already sortedby IdNumber.)
In order to merge the three data sets, this program
sorts FINANCE and COMPANY byName
merges COMPANY and FINANCE into a temporary dataset, TEMP
sorts TEMP by IdNumber
merges TEMP and REPERTORY by IdNumber.
|Previous Page||Next Page||Top of Page|
|Previous Page||Next Page|
|Reading, Combining, and Modifying SAS Data Sets|
|What You Need to Know before Combining Information Stored In Multiple SAS Data Sets|
Many applications require input data to be in aspecificformat before the data can be processed to produce meaningful results. Thedata typically comes from multiple sources and might be in different formats.Therefore, you often, if not always, have to take intermediate steps to logicallyrelate and process data before you can analyze it or create reports from it.
Application requirements vary, but there are commonfactors for all applications that access, combine, and process data. Onceyou have determined what you want the output to look like, you must
determine how the input data is related
ensurethat the data is properly sorted or indexed,if necessary
select the appropriate access method to processthe input data
select the appropriate SAS tools to complete thetask
Sas Join Datasets
|The Four Ways That Data Can Be Related|
Relationshipsamong multiple sources of input data exist when each of the sources containscommon data, either at the physical or logical level. For example, employeedata and department data could be related through an employee ID variablethat shares common values. Another data set could contain numeric sequencenumbers whose partial values logically relate it to a separate data set byobservation number.
You must be able to identify the existing relationshipsin your data. This knowledge is crucial for understanding how to process inputdata in order to produce desired results. All related data fall into one ofthese four categories, characterized by how observations relate among thedata sets:
Toobtain the results you want, you should understandhow each of these methods combines observations, how each method treats duplicatevalues of common variables, and how each method treats missing values or nonmatchedvalues of common variables. Some of the methods also require that you preprocessyour data sets by sorting them or by creating indexes. See the descriptionof each method in Combining SAS Data Sets: Methods.
In a one-to-one relationship, typically a single observationin one data set is related to a single observation from another based on thevalues of one or more selected variables. A one-to-one relationship impliesthat each value of the selected variable occurs no more than once in eachdata set. When you work with multiple selected variables, this relationshipimplies that each combination of values occurs no more than once in each dataset.
In the following example, observations in data setsSALARY and TAXES are related by common values for EmployeeNumber.
One-to-Many and Many-to-One
A one-to-many or many-to-one relationship between input datasets implies that one data set has at most one observation with a specificvalue of the selected variable, but the other input data set can have morethan one occurrence of each value. When you work with multiple selected variables,this relationship implies that each combination of values occurs no more thanonce in one data set, but can occur more than once in the other data set.The order in which the input data sets are processed determines whether therelationship is one-to-many or many-to-one.
In the following example, observations in data setsONE and TWO are related by common values for variable A. Values of A are uniquein data set ONE but not in data set TWO.
In the following example, observations in data setsONE, TWO, and THREE are related by common values for variable ID. Values ofID are unique in data sets ONE and THREE but not in TWO. For values 2 and3 of ID, a one-to-many relationship exists between observations in data setsONE and TWO, and a many-to-one relationship exists between observations indata sets TWO and THREE.
One-to-Many and Many-to-One Relationships
The many-to-many category implies that multiple observationsfrom each input data set can be related based on values of one or more commonvariables.
In the following example, observations in data setsBREAKDOWN and MAINTENANCE are related by common values for variable Vehicle.Values of Vehicle are not unique in either data set. A many-to-many relationshipexists between observations in these data sets for values AAA and CCC of Vehicle.
|Access Methods: Sequential versus Direct|
Once you have established data relationships, the nextstep is to determine the best mode of data access to relate the data. Youcan access observations sequentially in the order in which they appear inthe physical file. Or you can access them directly, that is, you can go straightto an observation in a SAS data set without having to process each observationthat precedes it.
The simplest and perhaps most common way to process data witha DATA step is to read observations in a data set sequentially. You can readobservations sequentially using the SET, MERGE, UPDATE, or MODIFY statements.You can also use the SAS File I/O functions, such as OPEN, FETCH, and FETCHOBS.
Direct access allows a program to access specific observationsbased on one of two methods:
by an observation number
by the value of one or more variables througha simple or composite index.
To access observations directly by their observationnumber, use the POINT= option with the SET or MODIFY statement. The POINT=option names a variable whose current value determines which observation aSET or MODIFY statement reads.
Toaccess observations directly based on the values of one or more specifiedvariables, you must first create an index for the variables and then readthe data set using the KEY= statement option with the SET or MODIFY statement.An index is a separate structure that contains the data valuesof the key variable or variables, paired with a location identifier for theobservations containing the value.
Note: You can alsouse the SAS File I/O functions such as CUROBS, NOTE, POINT, and FETCHOBS toaccess observations by observation number.
|Overview of Methods for Combining SAS Data Sets|
You can usethese methods to combine SAS data sets:
Thefollowing figure shows the results of concatenating two SAS data sets. Concatenatingthe data sets appends the observations from one data set to another data set.The DATA step reads DATA1 sequentially until all observations have been processed,and then reads DATA2. Data set COMBINED contains the results of the concatenation.Note that the data sets are processed in the order in which they are listedin the SET statement.
Concatenating Two Data Sets
Thefollowing figure shows the results of interleaving two SAS data sets. Interleavingintersperses observations from two or more data sets, based on one or morecommon variables. Data set COMBINED shows the result.
Interleaving Two Data Sets
One-to-One Reading and One-to-One Merging
The following figure shows theresults of one-to-one reading and one-to-one merging. One-to-one reading combinesobservations from two or more SAS data sets by creating observations thatcontain all of the variables from each contributing data set. Observationsare combined based on their relative position in each data set, that is, thefirst observation in one data set with the first in the other, and so on.The DATA step stops after it has read the last observation from the smallestdata set. One-to-one merging is similar to a one-to-one reading, with twoexceptions: you use the MERGE statement instead of multiple SET statements,and the DATA step reads all observations from all data sets. Data set COMBINEDshows the result.
One-to-One Reading and One-to-One Merging
The following figure showsthe results of match-merging. Match-merging combines observations from twoor more SAS data sets into a single observation in a new data set based onthe values of one or more common variables. Data set COMBINED shows the results.
Match-Merging Two Data Sets
The followingfigure shows the results of updating a master data set. Updating uses informationfrom observations in a transaction data set to delete, add, or alter informationin observations in a master data set. You can update a master data set byusing the UPDATE statement or the MODIFY statement. If you use the UPDATEstatement, your input data sets must be sorted by the values of the variableslisted in the BY statement. (In this example, MASTER and TRANSACTION are bothsorted by Year.) If you use the MODIFY statement, your input data does notneed to be sorted.
UPDATE replaces an existing file with a new file, allowingyou to add, delete, or rename columns. MODIFY performs an update in placeby rewriting only those records that have changed, or by appending new recordsto the end of the file.
Note that by default, UPDATE and MODIFY do not replacenonmissing values in a master data set with missing values from a transactiondata set.
Updating a Master Data Set
|Overview of Tools for Combining SAS Data Sets|
Using Statements and Procedures
Once you understand the basics of establishing relationshipsamong data, the ways to access data, and the ways that you can combine SASdata sets, you can choose from a variety of SAS tools for accessing, combining,and processing your data. The following table lists and briefly describesthe DATA step statements and the procedures that you can use for combiningSAS data sets.
|Statement or Procedure||Action Performed||Sequential||Direct||Can Use with BY statement||Comments|
|BY||Controls the operation of a SET, MERGE, UPDATE, or MODIFYstatement in the DATA step and sets up special grouping variables.||NA||NA||NA||BY-group processing is a means of processing observationsthat have the same values of one or more variables.|
|MERGE||Reads observations from two or more SAS data sets andjoins them into a single observation.||X||X||When using MERGE with BY, the data must be sorted orindexed on the BY variable.|
|MODIFY||Processes observations in a SAS data set in place. (Contrastwith UPDATE.)||X||X||X||Sorted or indexed data are not required for use withBY, but are recommended for performance.|
|SET||Reads an observation from one or more SAS data sets.||X||X||X||Use KEY= or POINT= statement options for directly accessingdata.|
|UPDATE||Applies transactions to observations in a master SASdata set. UPDATE does not update observations in place; it produces an updatedcopy of the current data set.||X||X||Both the master and transaction data sets must be sortedby or indexed on the BY variable.|
|PROC APPEND||Adds the observations from one SAS data set to the endof another SAS data set.||X|
|PROC SQL||Reads an observation from one or more SAS data sets;reads observations from up to 32 SAS data sets and joins them into singleobservations; manipulates observations in a SAS data set in place; easilyproduces a Cartesian product.||X||X||X||All three access methods are available in PROC SQL,but the access method is chosen by the internal optimizer.|
TABLE NOTE 1: PROC SQL is the SAS implementation of Structured Query Language. In additionto expected SQL capabilities, PROC SQL includes additional capabilities specificto SAS, such as the use of formats and SAS macro language.
Using Error Checking
You can use the _IORC_ automatic variable and the SYSRC autocallmacro to perform error checking in a DATA step. Use these tools with the MODIFYstatement or with the SET statement and the KEY= option. For more informationabout these tools, see Error Checking When Using Indexes to Randomly Access or Update Data.
|How to Prepare Your Data Sets|
Before combining SASdata sets, follow these guidelines to producethe results you want:
Sas Set Two Datasets
Know the structure and the contents of the datasets.
Look at sources of common problems.
Ensure that observations are in the correct order,or that they can be retrieved in the correct order (for example, by usingan index).
Test your program.
Knowing the Structure and Contents of the Data Sets
Tohelp determine how your data is related, look at the structureof the data sets. To see the data set structure, execute the DATASETS procedure,the CONTENTS procedure, or access the SAS Explorer window in your windowingenvironment to display the descriptor information. Descriptor informationincludes the number of observations in each data set, the name and attributesof each variable, and which variables are included in indexes. To print asample of the observations, use the PRINT procedure or the REPORT procedure.
You can also use functions such as VTYPE, VLENGTH, andVLENGTHX to show specific descriptor information. For complete informationabout these functions, see SAS Language Reference: Dictionary.
Looking at Sources of Common Problems
If your programdoes not execute correctly, review your inputdata for the following errors:
variables that have the same name but thatrepresent different data
SAS includes only one variable of a given name in thenew data set. If you are merging two data sets that have variables with thesame names but different data, the values from the last data set that wasread are written over the values from other data sets.
To correct the error, you can rename variables beforeyou combine the data sets by using the RENAME= data set option in the SET,UPDATE, or MERGE statement, or you can use the DATASETS procedure.
common variables with the same data butdifferent attributes
The way SAS handles these differences depends on whichattributes are different:
If the type attribute is different, SAS stops processingthe DATA step and issues an error message stating that the variables are incompatible.
To correct this error, you must use a DATA step to re-createthe variables. The SAS statements you use depend on the nature of the variable.
If the lengthattribute is different, SAS takes thelength from the first data set that contains the variable. In the followingexample, all data sets that are listed in the MERGE statement contain thevariable Mileage. In QUARTER1, the length of the variable Mileage is fourbytes; in QUARTER2, it is eight bytes and in QUARTER3 and QUARTER4, it issix bytes. In the output data set YEARLY, the length of the variable Mileageis four bytes, which is the length derived from QUARTER1.
To override the default and set the length yourself,specify the appropriate length in a LENGTH statement that precedesthe SET, MERGE, or UPDATE statement.
Note: If the length of a variable changes as a result of combining datasets, SAS prints a warning message to the log and issues a nonzero returncode (for example, on z/OS, SYSRC=4). If you expect truncation of data--forexample, when removing insignificant blanks from the end of character values,the warning is expected and you do not want SAS to issue a nonzero returncode. In this case, you can turn this warning off by setting the VARLENCHKsystem option to NOWARN. For more information, see the VARLENCHK system option in SAS Language Reference: Dictionary.
label, format, and informat attributes
If any of theseattributes are different, SAS takesthe attribute from the first data set that contains the variable with thatattribute. However, any label, format, or informat that you explicitly specifyoverrides a default. If all data sets contain explicitly specified attributes,the one specified in the first data set overrides the others. To ensure thatthe new output data set has the attributes you prefer, use an ATTRIB statement.
You can also use the SAS File I/O functions such asVLABEL, VLABELX, and other Variable Information functions to access this information. For complete information about these functions, see SAS Language Reference: Dictionary.
Ensuring Correct Order
If you use BY-group processingwith the UPDATE, SET, and MERGEstatements to combine data sets, ensure that the observations in the datasets are sorted in the order of the variables that are listed in the BY statement,or that the data sets have an appropriate index. If you use BY-group processingin a MODIFY statement, your data does not need to be sorted, but sorting thedata improves efficiency. The BY variable or variables must be common to bothdata sets, and they must have the same attributes. For more information, seeBY-Group Processing in the DATA Step.
Sas Join Datasets
Testing Your Program
Sas Merge Datasets Left Join
Asa final step in preparing your data sets, you should test your program. Createsmall temporary SAS data sets that contain a sample of observations that testall of your program's logic. If your logic is faulty and you get unexpectedoutput, you can use the DATA step debugger to debug your program. For completeinformation about the DATA Step Debugger, see SAS Language Reference: Dictionary.
Sas Add Variable To Dataset
|Previous Page||Next Page||Top of Page|