Sas Join Datasets

This paper discusses methods of joining SAS data sets. The different methods and the reasons for choosing a particular method of joining are contrasted and compared. Potential problems and limitations when joining data sets are also discussed. The need to combine data sets constantly arises during software development, as does the need to. Multiple SAS data sets can be merged based on a specific common variable to give a single data set. This is done using the MERGE statement and BY statement. The total number of observations in the merged data set is often less than the sum of the number of observations in the original data sets.

Previous PageNext Page
Sas
Merging SAS Data Sets
Merging with a BY Statement

Merging with a BY statement enables youto match observations according to the values of the BY variables that youspecify. Before you can perform a match-merge, all data sets must be sortedby the variables that you want to use for the merge.

In order to understand match-merging, you must understandthree key concepts:

BY variable

is a variable named in a BY statement.

BY value

is the value of a BY variable.

BY group

is the set of all observations withthe same value for the BY variable (if there is only one BY variable). Ifyou use more than one variable in a BY statement, then a BY group is the setof observations with a unique combination of values for those variables. Indiscussions of match-merging, BY groups commonly span more than one data set.

Input SAS Data Set for Examples

For example, the director of a small repertory theatercompany, the Little Theater, maintains company records in two SAS data sets,COMPANY and FINANCE.

Data SetVariableDescription
COMPANYNameplayer's name
Ageplayer's age
Genderplayer's gender
FINANCENameplayer's name
IdNumberplayer's employee ID number
Salaryplayer's annual salary

The following program creates,sorts, and displays COMPANYand FINANCE:

The following output displays the data sets. Notice that the FINANCEdata set does not contain an observation for Michael Morrison.

The Program

To avoid having to maintain two separate data sets,the director wants to merge the records for each player from both data setsinto a new data set that contains all the variables. The variable that iscommon to both data sets is Name. Therefore, Name is the appropriate BY variable.

The data sets are already sorted by NAME, so no furthersorting is required. The following program merges them by NAME:

The following output displays the merged data set:

Explanation

The new data set contains one observation for each playerin the company. Each observation contains all the variables from both datasets. Notice in particular the fourth observation. The data set FINANCE doesnot have an observation for Michael Morrison. In this case, the values ofthe variables that are unique to FINANCE (IdNumber and Salary) are missing.

Match-Merging Data Sets with Multiple Observations in a BY Group

Input SAS Data Set for Examples

The Little Theaterhas a third data set, REPERTORY, that tracks the casting assignments in eachof the season's plays. REPERTORY contains these variables:

Play

is the name of one of the plays inthe repertory.

Role

is the name of a character in Play.

IdNumber

is the employee ID number of theplayer playing Role.

The following program creates and displays REPERTORY:

The following output displays the REPERTORY data set:

To maintain confidentiality during preliminary casting,this data set identifies players by employee ID number. However, casting decisionsare now final, and the manager wants to replace each employee ID number withthe player's name. Of course, it is possible to re-create the data set, enteringeach player's name instead of the employee ID number in the raw data. However,it is more efficient to make use of the data set FINANCE, which already containsthe name and employee ID number of all players (see The COMPANY and FINANCE Data Sets). When the data sets aremerged, SAS takes care of adding the players' names to the data set.

Of course, before you can merge the data sets, you mustsort them by IdNumber.

The following output displays the FINANCE and REPERTORY datasets, sorted by IdNumber:

These two data sets contain seven BY groups; that is,among the 23 observations are seven different values for the BY variable,IdNumber. The first BY group has a value of 029-46-9261 for IdNumber. FINANCEhas one observation in this BY group; REPERTORY has two. The last BY grouphas a value of 929-75-0218 for IdNumber. FINANCE has one observation in thisBY group; REPERTORY has three.


The Program

The following program merges the data sets FINANCE andREPERTORY and illustrates what happens when a BY group in one data set hasmore observations in it than the same BY group in the other data set.

The resulting data set contains all variables from bothdata sets.

Note: The OPTIONS statement extends theline size to 120 so that PROC PRINT can display all variables on one line.Most output in this section is created with line size set to 76 in the OPTIONSstatement. An OPTIONS statement appears only in examples using a differentline size. When you set the LINESIZE= option, it remains in effect untilyou reset it or end the SAS session.

The following output displays the merged data set:


Explanation

Carefully examine the first few observations in thenew data set and consider how SAS creates them.

Proc datasets repair example
  1. Before executing the DATA step, SAS reads thedescriptor portion of the two data sets and creates a program data vectorthat contains all variables from both data sets:

    • IdNumber, Name, and Salary from FINANCE

    • Play and Role fromREPERTORY.

    IdNumber is already in the program data vector because itis in FINANCE. SAS sets the values of all variables to missing, as the followingfigure illustrates.

    Program Data Vector before Reading from Data Sets

  2. SAS looks at the first BY group in each data setto determine which BY group should appear first. In this case, the first BYgroup, observations with the value 029-46-9261 for IdNumber, is the same inboth data sets.

  3. SAS reads and copies the first observation fromFINANCE into the program data vector, as the next figure illustrates.

    Program Data Vector after Reading FINANCE Data Set

  4. SAS reads and copies the first observation fromREPERTORY into the program data vector, as the next figure illustrates. Ifa data set does not have any observations in a BY group, then the programdata vector contains missing values for the variables that are unique to thatdata set.

    Program Data Vector after Reading REPERTORY Data Set

  5. SAS writes the observation to the new data setand retains the values in the program data vector. (If the program data vectorcontained variables created by the DATA step, then SAS would set them to missingafter writing to the new data set.)

  6. SAS looks for a second observation in the BY groupin each data set. REPERTORY has one; FINANCE does not. The MERGE statementreads the second observation in the BY group from REPERTORY. Because FINANCEhas only one observation in the BY group, the statement uses the values ofName (Rudelich , Herbert) and Salary (35000) retained in the programdata vector for the second observation in the new data set. The next figureillustrates this behavior.

    Program Data Vector with Second Observation in the BY Group

  7. SAS writes the observation to the new data set.Neither data set contains any more observations in this BY group. Therefore,as the final figure illustrates, SAS sets all values in the program data vectorto missing and begins processing the next BY group. It continues processingobservations until it exhausts all observations in both data sets.

    Program Data Vector before New BY Groups

Match-Merging Data Sets with Dropped Variables

Now that casting decisions are final,the director wants to post the casting list, but does not want to includesalary or employee ID information. As the next program illustrates, Salaryand IdNumber can be eliminated by using the DROP= data set option when creatingthe new data set.

Sas Data Set Example

Note: The difference in placement of thetwo DROP= data set options is crucial. Dropping IdNumber in the DATA statementmeans that the variable is available to the MERGE and BY statements (to whichit is essential) but that it does not go into the new data set. Dropping Salaryin the MERGE statement means that the MERGE statement does not even read thisvariable, so Salary is unavailable to the program statements. Because thevariable Salary is not needed for processing, it is more efficient to preventit from being read into the PDV in the first place.

The followingoutput displays the merged data set without the IdNumber and Salary variables:
Match-Merging Data Sets with the Same Variables

You canmatch-merge data sets that contain the same variables(variables with the same name) by using the RENAME= data set option, justas you would when performing a one-to-one merge (see Performing a One-to-One Merge on Data Sets with the Same Variables).

If you do not use the RENAME= option and avariableexists in more than one data set, then the value of that variable in the lastdata set read is the value that goes into the new data set.

Match-Merging Data Sets That Lack a Common Variable

You can name any number of datasets in the MERGE statement. However, if you are match-merging the data sets,then you must be sure they all have a common variable and are sorted by thatvariable. If the data sets do not have a common variable, then you mightbe able to use another data set that has variables common to the originaldata sets to merge them.

For instance, consider the data sets that are used inthe match-merge examples. The table that follows shows the names of the datasets and the names of the variables in each data set.

Data SetVariables
COMPANYName, Age, Gender
FINANCEName, IdNumber, Salary
REPERTORYPlay, Role, IdNumber

These data sets donot share a common variable. However,COMPANY and FINANCE share the variable Name. Similarly, FINANCE and REPERTORYshare the variable IdNumber. Therefore, as the next program shows, you canmerge the data sets into one with two separate DATA steps. As usual, you mustsort the data sets by the appropriate BY variable. (REPERTORY is already sortedby IdNumber.)

In order to merge the three data sets, this program

  • sorts FINANCE and COMPANY byName

  • merges COMPANY and FINANCE into a temporary dataset, TEMP

  • sorts TEMP by IdNumber

  • merges TEMP and REPERTORY by IdNumber.

The followingoutput displays the resulting data set, ALL:
Previous PageNext PageTop of Page
Previous PageNext Page
Reading, Combining, and Modifying SAS Data Sets
What You Need to Know before Combining Information Stored In Multiple SAS Data Sets

Many applications require input data to be in aspecificformat before the data can be processed to produce meaningful results. Thedata typically comes from multiple sources and might be in different formats.Therefore, you often, if not always, have to take intermediate steps to logicallyrelate and process data before you can analyze it or create reports from it.

Application requirements vary, but there are commonfactors for all applications that access, combine, and process data. Onceyou have determined what you want the output to look like, you must

Datasets
  • determine how the input data is related

  • ensurethat the data is properly sorted or indexed,if necessary

  • select the appropriate access method to processthe input data

  • select the appropriate SAS tools to complete thetask

Sas Join Datasets

The Four Ways That Data Can Be Related

Relationshipsamong multiple sources of input data exist when each of the sources containscommon data, either at the physical or logical level. For example, employeedata and department data could be related through an employee ID variablethat shares common values. Another data set could contain numeric sequencenumbers whose partial values logically relate it to a separate data set byobservation number.

You must be able to identify the existing relationshipsin your data. This knowledge is crucial for understanding how to process inputdata in order to produce desired results. All related data fall into one ofthese four categories, characterized by how observations relate among thedata sets:

  • one-to-one

  • one-to-many

  • many-to-one

  • many-to-many

Toobtain the results you want, you should understandhow each of these methods combines observations, how each method treats duplicatevalues of common variables, and how each method treats missing values or nonmatchedvalues of common variables. Some of the methods also require that you preprocessyour data sets by sorting them or by creating indexes. See the descriptionof each method in Combining SAS Data Sets: Methods.


One-to-One

In a one-to-one relationship, typically a single observationin one data set is related to a single observation from another based on thevalues of one or more selected variables. A one-to-one relationship impliesthat each value of the selected variable occurs no more than once in eachdata set. When you work with multiple selected variables, this relationshipimplies that each combination of values occurs no more than once in each dataset.

In the following example, observations in data setsSALARY and TAXES are related by common values for EmployeeNumber.

One-to-One Relationship


One-to-Many and Many-to-One

A one-to-many or many-to-one relationship between input datasets implies that one data set has at most one observation with a specificvalue of the selected variable, but the other input data set can have morethan one occurrence of each value. When you work with multiple selected variables,this relationship implies that each combination of values occurs no more thanonce in one data set, but can occur more than once in the other data set.The order in which the input data sets are processed determines whether therelationship is one-to-many or many-to-one.

In the following example, observations in data setsONE and TWO are related by common values for variable A. Values of A are uniquein data set ONE but not in data set TWO.

One-to-Many Relationship

In the following example, observations in data setsONE, TWO, and THREE are related by common values for variable ID. Values ofID are unique in data sets ONE and THREE but not in TWO. For values 2 and3 of ID, a one-to-many relationship exists between observations in data setsONE and TWO, and a many-to-one relationship exists between observations indata sets TWO and THREE.

One-to-Many and Many-to-One Relationships


Join

Many-to-Many

The many-to-many category implies that multiple observationsfrom each input data set can be related based on values of one or more commonvariables.

In the following example, observations in data setsBREAKDOWN and MAINTENANCE are related by common values for variable Vehicle.Values of Vehicle are not unique in either data set. A many-to-many relationshipexists between observations in these data sets for values AAA and CCC of Vehicle.

Many-to-Many Relationship

Access Methods: Sequential versus Direct

Overview

Once you have established data relationships, the nextstep is to determine the best mode of data access to relate the data. Youcan access observations sequentially in the order in which they appear inthe physical file. Or you can access them directly, that is, you can go straightto an observation in a SAS data set without having to process each observationthat precedes it.


Sequential Access

The simplest and perhaps most common way to process data witha DATA step is to read observations in a data set sequentially. You can readobservations sequentially using the SET, MERGE, UPDATE, or MODIFY statements.You can also use the SAS File I/O functions, such as OPEN, FETCH, and FETCHOBS.


Direct Access

Direct access allows a program to access specific observationsbased on one of two methods:

  • by an observation number

  • by the value of one or more variables througha simple or composite index.

To access observations directly by their observationnumber, use the POINT= option with the SET or MODIFY statement. The POINT=option names a variable whose current value determines which observation aSET or MODIFY statement reads.

Toaccess observations directly based on the values of one or more specifiedvariables, you must first create an index for the variables and then readthe data set using the KEY= statement option with the SET or MODIFY statement.An index is a separate structure that contains the data valuesof the key variable or variables, paired with a location identifier for theobservations containing the value.

Note: You can alsouse the SAS File I/O functions such as CUROBS, NOTE, POINT, and FETCHOBS toaccess observations by observation number.

Overview of Methods for Combining SAS Data Sets

You can usethese methods to combine SAS data sets:

  • concatenating

  • interleaving

  • one-to-onereading

  • one-to-one merging

  • matchmerging

  • updating


Concatenating

Thefollowing figure shows the results of concatenating two SAS data sets. Concatenatingthe data sets appends the observations from one data set to another data set.The DATA step reads DATA1 sequentially until all observations have been processed,and then reads DATA2. Data set COMBINED contains the results of the concatenation.Note that the data sets are processed in the order in which they are listedin the SET statement.

Concatenating Two Data Sets


Interleaving

Thefollowing figure shows the results of interleaving two SAS data sets. Interleavingintersperses observations from two or more data sets, based on one or morecommon variables. Data set COMBINED shows the result.

Interleaving Two Data Sets


One-to-One Reading and One-to-One Merging

The following figure shows theresults of one-to-one reading and one-to-one merging. One-to-one reading combinesobservations from two or more SAS data sets by creating observations thatcontain all of the variables from each contributing data set. Observationsare combined based on their relative position in each data set, that is, thefirst observation in one data set with the first in the other, and so on.The DATA step stops after it has read the last observation from the smallestdata set. One-to-one merging is similar to a one-to-one reading, with twoexceptions: you use the MERGE statement instead of multiple SET statements,and the DATA step reads all observations from all data sets. Data set COMBINEDshows the result.

One-to-One Reading and One-to-One Merging


Match-Merging

The following figure showsthe results of match-merging. Match-merging combines observations from twoor more SAS data sets into a single observation in a new data set based onthe values of one or more common variables. Data set COMBINED shows the results.

Match-Merging Two Data Sets


Updating

The followingfigure shows the results of updating a master data set. Updating uses informationfrom observations in a transaction data set to delete, add, or alter informationin observations in a master data set. You can update a master data set byusing the UPDATE statement or the MODIFY statement. If you use the UPDATEstatement, your input data sets must be sorted by the values of the variableslisted in the BY statement. (In this example, MASTER and TRANSACTION are bothsorted by Year.) If you use the MODIFY statement, your input data does notneed to be sorted.

UPDATE replaces an existing file with a new file, allowingyou to add, delete, or rename columns. MODIFY performs an update in placeby rewriting only those records that have changed, or by appending new recordsto the end of the file.

Note that by default, UPDATE and MODIFY do not replacenonmissing values in a master data set with missing values from a transactiondata set.

Updating a Master Data Set

Overview of Tools for Combining SAS Data Sets

Using Statements and Procedures

Once you understand the basics of establishing relationshipsamong data, the ways to access data, and the ways that you can combine SASdata sets, you can choose from a variety of SAS tools for accessing, combining,and processing your data. The following table lists and briefly describesthe DATA step statements and the procedures that you can use for combiningSAS data sets.

Statements or Procedures for Combining SAS Data Sets


Access Method

Statement or ProcedureAction PerformedSequentialDirectCan Use with BY statementComments
BYControls the operation of a SET, MERGE, UPDATE, or MODIFYstatement in the DATA step and sets up special grouping variables.NANANABY-group processing is a means of processing observationsthat have the same values of one or more variables.
MERGEReads observations from two or more SAS data sets andjoins them into a single observation.X
XWhen using MERGE with BY, the data must be sorted orindexed on the BY variable.
MODIFYProcesses observations in a SAS data set in place. (Contrastwith UPDATE.)XXXSorted or indexed data are not required for use withBY, but are recommended for performance.
SETReads an observation from one or more SAS data sets.XXXUse KEY= or POINT= statement options for directly accessingdata.
UPDATEApplies transactions to observations in a master SASdata set. UPDATE does not update observations in place; it produces an updatedcopy of the current data set.X
XBoth the master and transaction data sets must be sortedby or indexed on the BY variable.
PROC APPENDAdds the observations from one SAS data set to the endof another SAS data set.X


PROC SQLReads an observation from one or more SAS data sets;reads observations from up to 32 SAS data sets and joins them into singleobservations; manipulates observations in a SAS data set in place; easilyproduces a Cartesian product.XXXAll three access methods are available in PROC SQL,but the access method is chosen by the internal optimizer.

TABLE NOTE 1: PROC SQL is the SAS implementation of Structured Query Language. In additionto expected SQL capabilities, PROC SQL includes additional capabilities specificto SAS, such as the use of formats and SAS macro language.


Using Error Checking

You can use the _IORC_ automatic variable and the SYSRC autocallmacro to perform error checking in a DATA step. Use these tools with the MODIFYstatement or with the SET statement and the KEY= option. For more informationabout these tools, see Error Checking When Using Indexes to Randomly Access or Update Data.

How to Prepare Your Data Sets
Sas data set example

Before combining SASdata sets, follow these guidelines to producethe results you want:

Sas Set Two Datasets

  • Know the structure and the contents of the datasets.

  • Look at sources of common problems.

  • Ensure that observations are in the correct order,or that they can be retrieved in the correct order (for example, by usingan index).

  • Test your program.


Knowing the Structure and Contents of the Data Sets

Tohelp determine how your data is related, look at the structureof the data sets. To see the data set structure, execute the DATASETS procedure,the CONTENTS procedure, or access the SAS Explorer window in your windowingenvironment to display the descriptor information. Descriptor informationincludes the number of observations in each data set, the name and attributesof each variable, and which variables are included in indexes. To print asample of the observations, use the PRINT procedure or the REPORT procedure.

You can also use functions such as VTYPE, VLENGTH, andVLENGTHX to show specific descriptor information. For complete informationabout these functions, see SAS Language Reference: Dictionary.


Looking at Sources of Common Problems

If your programdoes not execute correctly, review your inputdata for the following errors:

  • variables that have the same name but thatrepresent different data

    SAS includes only one variable of a given name in thenew data set. If you are merging two data sets that have variables with thesame names but different data, the values from the last data set that wasread are written over the values from other data sets.

    To correct the error, you can rename variables beforeyou combine the data sets by using the RENAME= data set option in the SET,UPDATE, or MERGE statement, or you can use the DATASETS procedure.

  • common variables with the same data butdifferent attributes

    The way SAS handles these differences depends on whichattributes are different:

    • type attribute

      If the type attribute is different, SAS stops processingthe DATA step and issues an error message stating that the variables are incompatible.

      To correct this error, you must use a DATA step to re-createthe variables. The SAS statements you use depend on the nature of the variable.

    • length attribute

      If the lengthattribute is different, SAS takes thelength from the first data set that contains the variable. In the followingexample, all data sets that are listed in the MERGE statement contain thevariable Mileage. In QUARTER1, the length of the variable Mileage is fourbytes; in QUARTER2, it is eight bytes and in QUARTER3 and QUARTER4, it issix bytes. In the output data set YEARLY, the length of the variable Mileageis four bytes, which is the length derived from QUARTER1.

      To override the default and set the length yourself,specify the appropriate length in a LENGTH statement that precedesthe SET, MERGE, or UPDATE statement.

      Note: If the length of a variable changes as a result of combining datasets, SAS prints a warning message to the log and issues a nonzero returncode (for example, on z/OS, SYSRC=4). If you expect truncation of data--forexample, when removing insignificant blanks from the end of character values,the warning is expected and you do not want SAS to issue a nonzero returncode. In this case, you can turn this warning off by setting the VARLENCHKsystem option to NOWARN. For more information, see the VARLENCHK system option in SAS Language Reference: Dictionary.

    • label, format, and informat attributes

      If any of theseattributes are different, SAS takesthe attribute from the first data set that contains the variable with thatattribute. However, any label, format, or informat that you explicitly specifyoverrides a default. If all data sets contain explicitly specified attributes,the one specified in the first data set overrides the others. To ensure thatthe new output data set has the attributes you prefer, use an ATTRIB statement.

      You can also use the SAS File I/O functions such asVLABEL, VLABELX, and other Variable Information functions to access this information. For complete information about these functions, see SAS Language Reference: Dictionary.


Ensuring Correct Order

If you use BY-group processingwith the UPDATE, SET, and MERGEstatements to combine data sets, ensure that the observations in the datasets are sorted in the order of the variables that are listed in the BY statement,or that the data sets have an appropriate index. If you use BY-group processingin a MODIFY statement, your data does not need to be sorted, but sorting thedata improves efficiency. The BY variable or variables must be common to bothdata sets, and they must have the same attributes. For more information, seeBY-Group Processing in the DATA Step.


Sas Join Datasets

Testing Your Program

Sas Merge Datasets Left Join

Asa final step in preparing your data sets, you should test your program. Createsmall temporary SAS data sets that contain a sample of observations that testall of your program's logic. If your logic is faulty and you get unexpectedoutput, you can use the DATA step debugger to debug your program. For completeinformation about the DATA Step Debugger, see SAS Language Reference: Dictionary.

Sas Add Variable To Dataset

Previous PageNext PageTop of Page