Combine Data From Multiple Workbooks

Introduction

Combine Data From Multiple Workbooks

One of the most commonly used pandas functions is read_excel. This short article shows how youcan read in all the tabs in an Excel workbook and combine them into a single pandas dataframe usingone command.

I need to combine all the data across the multiple worksheets and workbooks into 1 worksheet I have found a macro to combine multiple worksheets into 1 within the same workbook (given below). What I need however is to be able to combine the worksheets across all 6 workbooks into 1 worksheet. The macro I am using currently is Sub ConsolidateSheets. How to Combine Multiple Excel Workbooks into One Worksheet with VBA. Combined Data is Better Data. (You’ll want to change this to your folder, but in this example we are targeting C:blogexampledatainhere.) Then, on lines 16-17, we create a new Workbook (where Dst is short for “destination”, i.e.

For those of you that want the TLDR, here is the command:

Read on for an explanation of when to use this and how it works.

Excel Worksheets

For the purposes of this example, we assume that the Excel workbook is structured like this:

The process I will describe works when:

  • The data is not duplicated across tabs (sheet1 is one full month and the subsequent sheets have only a single month’s worth of data)
  • The columns are all named the same
  • You wish to read in all tabs and combine them

Understanding read_excel

The read_excel function is a feature packed pandas function. For this specificcase, we can use the sheet_name parameter to streamline the reading in of all the sheetsin our Excel file.

How to combine multiple worksheets into one

Most of the time, you will read in a specific sheet from an Excel file:

If you carefully look at the documentation, you may notice that if you use sheet_name=None,you can read in all the sheets in the workbook at one time. Let’s try it:

Pandas will read in all the sheets and return a collections.OrderedDict object. For the purposesof the readability of this article, I’m defining the full url and passing it to read_excel. In practice,you may decide to make this one command.

Let’s inspect the resulting all_dfs:

If you want to access a single sheet as a dataframe:

account numbernameskuquantityunit priceext pricedate
0412290Jerde-HilpertS2-778964376.663296.382018-03-04 23:10:28
1383080Will LLCS1-936832890.862544.082018-03-05 05:11:49
2729833Koepp LtdS1-302481344.84582.922018-03-05 17:33:52
3424914White-TrantowS2-824233850.931935.342018-03-05 21:40:10
4672390Kuhn-GusikowskiS1-509613448.201638.802018-03-06 11:59:00

If we want to join all the individual dataframes into one single dataframe, use pd.concat:

In this case, we use ignore_index since the automatically generated indicesof Sheet1, Sheet2, etc. are not meaningful.

If your data meets the structure outlined above, this one liner will return a singlepandas dataframe that combines the data in each Excel worksheet:

Combine

Summary

This trick can be useful in the right circumstances. It also illustrates how muchpower there is in a pandas command that “just” reads in an Excel file. The full notebookis available on github if you would like to try it out for yourself.

Comments

In last week’s post we looked at how to combine multiple files together using Power Query. This week we’re going to stay within the same workbook, and combine multiple worksheets using Power Query.

Let’s consider a case where the user has been creating a transactional history in an Excel file. It is all structured as per the image below, but resides across multiple worksheets; one for each month:

As you can see, they’ve carefully named each sheet with the month and year. But unfortunately, they haven’t formatted any of the data using Excel tables.

FromGet data from multiple workbooksMultiple

Now the file lands in our hands (you can download a copy here if you’d like to follow along,) and we’d like to turn this into one consolidated table so that we can do some analysis on it.

Naturally we’re going to reach to Power Query to do this, but how do we get started? We could just go and format the data on each worksheet as a table, but what if there were hundreds? That would take way too much work!

But so far we’ve only seen how to pull Tables, Named Ranges or files into Power Query. How do we get at the worksheets?

Basically, we’re going to start with two lines of code:

  • Go to Power Query –> From Other Sources –> Blank Query
  • View –> Advanced Editor

You’ll now see the following blank query:

let
Source = '
in
Source

What we need to do is replace the second line (Source = “”) with the following two lines of code:

FullFilePath = 'D:TempCombine Worksheets.xlsx',
Source = Excel.Workbook(File.Contents(FullFilePath))

Of course, you’ll want to update the path to the full file path for where the file is saved on your system.

Once you click Done, you should see the following:

Cool! We’ve got a list of all the worksheets in the file!

The next step is to prep the fields we want to preserve as we combine the worksheets. Obviously the Name and Item columns are redundant, so let’s do a bit of cleanup here.

  • Remove the Kind column
  • Select the Name column –> Transform –> Data Type –> Date
  • Select the Name column –> Transform –> Date –> Month –> End of Month
  • Rename the Name column to “Date”

At this point, the query should look like so:

Next we’ll click the little double headed arrow to the top right of the data column to expand our records, and commit to expanding all the columns offered:

Hmm… well that’s a bit irritating. It looks like we’re going to need to promote the top row to headers, but that means we’re going to overwrite the Date column header in column 1. Oh well, nothing to be done about it now, so:

  • Transform –> Use First Row As Headers –> Use First Row As Headers
  • Rename Column1 (the header won’t accept 1/31/2008 as a column name) to “Date” again
  • Rename the Jan 2008 column (far right) to “Original Worksheet”

We’re almost done, but let’s just do a bit of final cleanup here. As we set the data types correctly, let’s also make sure that we remove any errors that might come up from invalid data types.

  • Select the Date column
  • Home –> Remove Errors
  • Set Account and Dept to Text
  • Set Amount to Decimal Number
  • Select the Amount column
  • Home –> Remove Errors
  • Set Original Worksheet to Text

Rename the query to “Consolidated”, and load it to a worksheet.

Before you do anything else, Save the File.

To be fair, our query has enough safe guards in it that we don’t actually have to do this, but I always like to play it safe. Let’s review the completed query…

Combine Data From Multiple Workbooks Into One Sheet Vba

Edit the Consolidated query, and step into the Source line step. Check out that preview pane:

Interesting… two more objects! This makes sense, as we created a new table and worksheet when we retrieved this into a worksheet. We need to filter those out.

Combine Data From Multiple Workbooks In Excel Vba

Getting rid of the table is easy:

  • Select the drop down arrow on the Kind column
  • Uncheck “Table”, then confirm when asked if you’d like to insert a step

Select the next couple of steps as well, and take a look at the output as you do.

Aha! When you hit the “ChangedType” step, something useful happens… we generate an error:

Let’s remove that error from the Name column.

  • Select the Name column –> Home –> Remove Errors

And we’re done. We’ve managed to successfully combine all the data worksheets in our file into one big table!

This method creates a bit of a loop in that I’m essentially having to reach outside Excel to open a copy of the workbook to pull the sheet listing in. And it causes issues for us, since Power Query only reads from the last save point of the external file we’re connecting to (in this case this very workbook.) I’d way rather have an Excel.CurrentWorkbook() style method to read from inside the file, but unfortunately that method won’t let you read your worksheets.

It would also be super handy to have an Excel.CurrentWorkbookPath() method. Hard coding the path here is a real challenge if you move the file. I’ve asked Microsoft for this, but if you think it is a good idea as well, please leave a comment on the post. (They’ll only count one vote from me, but they’ll count yours if you leave it here!)