Left Join Merge R

We will practice on our continents data.frame from module 2 and the gapminder data.frame. Note how these are tidy data: We have observations at the level of continent and at the level of country, so they go in different tables. The continent column in the gapminder data.frame allows us to link them now. If continents data.frame isn’t in your Environment, load it and recall what it consists of:

R语言中的fulljoin、innerjoin、leftjoin和rightjoin连接 两个dataframe数据之间的连接关系. 经常会碰到需要把两个数据进行合并,大的方向有按“列”和按'行'合并两种方式,这里主要讲下按“列”进行合并,如下图,假设有a、b两个数据,注意共有的列是”chr'和“bin';值为1,2,3,4,5的bin是a、b共有的; 值为.

We can join the two data.frames using any of the dplyr functions. We will pass the results to str to avoid printing more than we can read, and to get more high-level information on the resulting data.frames.

Join
  • In SQL database terminology, the default value of all = FALSE gives a natural join, a special case of an inner join. Specifying all.x = TRUE gives a left (outer) join, all.y = TRUE a right (outer) join, and both (all = TRUE) a (full) outer join. DBMSes do not match NULL records, equivalent to incomparables = NA in R.
  • Left (outer) join in R. The left join in R consist on matching all the rows in the first data frame with the corresponding values on the second. Recall that ‘Jack’ was on the first table but not on the second. In order to create the join, you just have to set all.x = TRUE as follows: merge(x = df1, y = df2, all.x = TRUE).

These operations produce slightly different results, either 1704 or 1705 observations. Can you figure out why? Antarctica contains no countries so doesn’t appear in the gapminder data.frame. When we use left_join it gets filtered from the results, but when we use right_join it appears, with missing values for all of the country-level variables:

There’s another problem in this data.frame – it has two population measures, one by continent and one by country and it’s not clear which is which! Let’s rename a couple of columns.

Challenge – Putting the pieces together

A colleague suggests that the more land area an individual has, the greater their gdp will be and that this relationship will be observable at any scale of observation. You chuckle and mutter “Not at the continental scale,” but your colleague insists. Test your colleague’s hypothesis by:

Join
  • Calculating the total GDP of each continent,
    • Hint: Use dplyr’s group_by and summarize
  • Joining the resulting data.frame to the continents data.frame,
  • Calculating the per-capita GDP for each continent, and
  • Plotting per-capita gdp versus population density.

R Left Join Merge On Multiple Columns

Challenge solutions

Solution to Challenge – Putting the pieces together

TablesLeft


This lesson is adapted from the Software Carpentry: R for Reproducible Scientific Analysis Multi-Table Joins materials and Brandon Hurr’s dplyr II: Joins and Set Ops presentation to the Davis R UsersGroup on Februrary 2, 2016.

This post is a translation from «Unir datos – un repaso de las diferencias entre merge, inner join, left join, right join, full join, cbind y rbind cuando se usa objetos tipo data.table en R«, in response to a request in TW, so pardon my english.

Datasets

Left Join Merge R

This week someone ask me how to make joins using data.table objects, this person was hesitating whether to use: merge, rbind o cbind; so I’ve made this blog post leveraging the script that we used to explain the differences.

When we want to join two datasets usually do one of this:

  • Add Rows: Increase the rows of a dataset under the other.
  • Add Columns: Increase the columns of a data set to another.
  • Join (vlookup): In this case we have some columns or variables as «key» or «id», with this columns or variables the data from the first set is added to the second when the «key» or «id» in both datasets is the same. Vlookup is a MSExcels users’s term but actually is a particular case of Join which is the right term for computacional people. There are different types of Join, in summary:
    • Inner Join: Returns only the data wich has «matched keys» in both datasets.
    • Left Join: Return all data from the left dataset and the data with matched key from the right dataset (vlookup is a left join).
    • Right Join: Return all data from the right dataset and the data with matched key from the left dataset.
    • Full Join: Returns all data from both datasets, obviously combining the data from the matched keys

The data.table package is an excellent choice to perform tasks more efficiently in R, but to learn how to use it a bit of reading and patience is required, you can read their vignettes as a good introduction.

To perform joins you can use the Dt [X] syntax from data.table package or use the merge command as if they were data.frame objects. Dt [X] is more efficient than merge for merge counterpart is more intuitive (at least for the average user of R). To understand a little the Dt [X] syntax you have to know that when you write Dt [X] the software will search the keys in Dt object based on the X’s key, ie, the basis for the merge is the object X

The following script shows how to make Inner Join, Left Join (vlookup), Right Join, Full Join, add columns and rows using 3 data.table object type (note that the output starts with a # # #, as you see when used knitr)

Join Data Frames R

The perceptive reader will have noticed how to do a Left Join without having the problem with the order of the columns… I leave you the question.