Python Merge Two Excel Files

  1. Python Merge Two Excel Files Into One
  2. Python Merge Two Excel Files Online
  3. Python Merge Two Excel Files By Common Field
  4. Merge Text Files Python

Using the Pandas library in Python, we can get data from a source Excel file and insert it into a new Excel file and then name and save that file. This is useful when you need to drill down to. Finally, I can combine the data frames into one by running pd.concat each of the excel files into one DataFrame: df = pd.concat (combinedexcels). Use panda’s merge function and tell it to do a left join which is similar to Excel’s vlookup function. Alldatast = pd.merge(alldata, status, how='left') alldatast.head This looks pretty good but let’s look at a specific account. Alldatastalldatast'account number'737550.head.

Python Merge Two Excel Files Into One

Introduction

As part of my continued exploration of pandas, I am going to walk through a real world example of howto use pandas to automate a process that could be very difficult to do in Excel.My business problem is that I have two Excel files that are structured similarly but havedifferent data and I would like to easily understand what has changed between the two files.

Basically, I want an Excel diff tool.

Here is a snapshot of the type of data I’m looking at:

account numbernamestreetcitystatepostal code
935480 Bruen Group 5131 Nienow Viaduct Apt. 290 Port Arlie Alabama 14118
371770 Cruickshank-Boyer 839 Lana Expressway Suite 234 South Viviana Alabama 57838
548367 Spencer, Grady and Herman 65387 Lang Circle Apt. 516 Greenholtbury Alaska 58394
296620 Schamberger, Hagenes and Brown 26340 Ferry Neck Apt. 612 McCulloughstad Alaska 74052
132971 Williamson, Schumm and Hettinger 89403 Casimer Spring Jeremieburgh Arkansas 62785

In this example, I have two customer address lists and I would like to understand:

  • which customers are new
  • which customers are removed
  • which customers have changed information between the two files

You can envision this being fairly useful when auditing changes in a system or potentiallyproviding a list of changes so you can have your sales team contact new customers.

Research

My first thought was that I wanted to evaluate existing tools that could easily perform adiff on two Excel files. I did some google searching and found a stack overflow discussionon the topic.

There are some decent solutions in the thread but nothing that I felt would meet my requirements. One ofmy requirements is that I’d like to make it as easy as possible to replicate for someone that maynot be very technically inclined. Before pandas, I might have created a script to loop through each fileand do my comparison. However, I thought that I might be able to come up with a better solution using pandas.In hindsight, this was a useful exercise to help me understand more about working with pandas.

Once I decided to work work pandas, I did another search and found stack overflow thread that lookedlike a good start.

First Attempt

Like I did in my previous article, I am using an IPython notebook to test out my solution.If you would like to follow along, here are sample-address-1 and sample-address-2

The first step, is my normal imports:

Next, read in both of our excel files into dataframes

Files

Order by account number and reindex so that it stays this way.

Create a diff function to show what the changes are.

Merge the two datasets together in a Panel. I will admit that I haven’t fullygrokked the panel concept yet but the only way to learn is to keep pressing on!

Once the data is in a panel, we use the report_diff function to highlight all the changes.I think this is a very intuitive way (for this data set) to show changes. It is relatively simpleto see what the old value is and the new one. For example, someone could easily check and see whythat postal code changed for account number 880043.

Python Merge Two Excel Files Online

account numbernamestreetcitystatepostal code
95 677936 Hodkiewicz-Koch 604 Lemke Knoll Suite 661 East Laurence Wisconsin 98576
96 880043 Beatty Inc 3641 Schaefer Isle Suite 171 North Gardnertown Wyoming 64318 —-> 64918
97 899885 Kessler and Sons 356 Johnson Isle Suite 991 Casiehaven Wyoming 37996
98 704567 Yundt-Abbott 8338 Sauer Highway Jennyfort Wyoming 19932
99 880729 Huels PLC 695 Labadie Lakes Apt. 256 Port Orland Wyoming 42977

One of the things we want to do is flag rows that have changes so it iseasier to see the changes. We will create a has_change function anduse apply to run the function against each row.

account numbernamestreetcitystatepostal codehas_change
95 677936 Hodkiewicz-Koch 604 Lemke Knoll Suite 661 East Laurence Wisconsin 98576 N
96 880043 Beatty Inc 3641 Schaefer Isle Suite 171 North Gardnertown Wyoming 64318 —-> 64918 Y
97 899885 Kessler and Sons 356 Johnson Isle Suite 991 Casiehaven Wyoming 37996 N
98 704567 Yundt-Abbott 8338 Sauer Highway Jennyfort Wyoming 19932 N
99 880729 Huels PLC 695 Labadie Lakes Apt. 256 Port Orland Wyoming 42977 N

It is simple to show all the columns with a change:

account numbernamestreetcitystatepostal codehas_change
24 595932 Kuhic, Eichmann and West 4059 Tobias Inlet —-> 4059 Tobias St New Rylanfurt Illinois 89271 Y
30 558879 Watsica Group 95616 Enos Grove Suite 139 —-> 829 Big street West Atlas —-> Smithtown Iowa —-> Ohio 47419 —-> 47919 Y
96 880043 Beatty Inc 3641 Schaefer Isle Suite 171 North Gardnertown Wyoming 64318 —-> 64918 Y

Finally, let’s write it out to an Excel file:

Here is a simple program that does what I’ve just shown:

Scaling Up

I have to be honest, I was feeling pretty good so I decided to run this on amore complex dataset and see what happened. I’ll spare you the steps but show you the output:

account numbernamestreetcitystatepostal code
19 878977.0 —-> 869125 Swift PLC —-> Wiza LLC 5605 Hodkiewicz Views —-> 9824 Noemi Harbors Summerfurt —-> North Tristin Vermont —-> Maine 98029.0 —-> 98114
20 880043.0 —-> 875910 Beatty Inc —-> Lowe, Tremblay and Bruen 3641 Schaefer Isle Suite 171 —-> 3722 Tatyana… North Gardnertown —-> Selmafurt Wyoming —-> NorthDakota 64318.0 —-> 17496
21 880729.0 —-> 878977 Huels PLC —-> Swift PLC 695 Labadie Lakes Apt. 256 —-> 5605 Hodkiewic… Port Orland —-> Summerfurt Wyoming —-> Vermont 42977.0 —-> 98029
22 nan —-> 880043 nan —-> Beatty Inc nan —-> 3641 Schaefer Isle Suite 171 nan —-> North Gardnertown nan —-> Wyoming nan —-> 64318
23 nan —-> 880729 nan —-> Huels PLC nan —-> 695 Labadie Lakes Apt. 256 nan —-> Port Orland nan —-> Wyoming nan —-> 42977

Hmmm. This isn’t going to work is it?

I am going to rethink this and see if I can come up withan approach that will scale on a bigger data set.

Second Attempt

I will use a similar approach but build it out to show more details on the changesand make the solution more robust for bigger data sets. Here are the data setsfor those interested: sample-address-new and sample-address-old.

Start with the standard imports.

Python Merge Two Excel Files By Common Field

We will define our report_diff function like we did in the previous exercise.

Read in the values in the two different sheets

Label the two data sets so that when we combine them, we know which is which

We can look at the data to see what the format looks like and how manyrecords we ended up with.

account numbernamestreetcitystatepostal codeversion
0 935480 Bruen and Jones Group 5131 Nienow Viaduct Apt. 290 Port Arlie Alabama 14118 new
1 371770 Cruickshank-Boyer 839 Lana Expressway Suite 234 South Viviana Alabama 57838 new
2 548367 Spencer, Grady and Herman 65387 Lang Circle Apt. 516 Greenholtbury Alaska 58394 new
3 132971 Williamson, Schumm and Hettinger 89403 Casimer Spring Jeremieburgh Arkansas 6278 new
4 985603 Bosco-Upton 89 Big Street Small Town Texas 19033 new
account numbernamestreetcitystatepostal codeversion
0 935480 Bruen Group 5131 Nienow Viaduct Apt. 290 Port Arlie Alabama 14118 old
1 371770 Cruickshank-Boyer 839 Lana Expressway Suite 234 South Viviana Alabama 57838 old
2 548367 Spencer, Grady and Herman 65387 Lang Circle Apt. 516 Greenholtbury Alaska 58394 old
3 296620 Schamberger, Hagenes and Brown 26340 Ferry Neck Apt. 612 McCulloughstad Alaska 74052 old
4 132971 Williamson, Schumm and Hettinger 89403 Casimer Spring Jeremieburgh Arkansas 62785 old

We will add all the data together into a new table

As expected, the full set includes 46 records.

account numbernamestreetcitystatepostal codeversion
0 935480 Bruen Group 5131 Nienow Viaduct Apt. 290 Port Arlie Alabama 14118 old
1 371770 Cruickshank-Boyer 839 Lana Expressway Suite 234 South Viviana Alabama 57838 old
2 548367 Spencer, Grady and Herman 65387 Lang Circle Apt. 516 Greenholtbury Alaska 58394 old
3 296620 Schamberger, Hagenes and Brown 26340 Ferry Neck Apt. 612 McCulloughstad Alaska 74052 old
4 132971 Williamson, Schumm and Hettinger 89403 Casimer Spring Jeremieburgh Arkansas 62785 old
account numbernamestreetcitystatepostal codeversion
41 869125 Wiza LLC 9824 Noemi Harbors North Tristin Maine 98114 new
42 875910 Lowe, Tremblay and Bruen 3722 Tatyana Springs Apt. 464 Selmafurt NorthDakota 17496 new
43 878977 Swift PLC 5605 Hodkiewicz Views Summerfurt Vermont 98029 new
44 880043 Beatty Inc 3641 Schaefer Isle Suite 171 North Gardnertown Wyoming 64318 new
45 880729 Huels PLC 695 Labadie Lakes Apt. 256 Port Orland Wyoming 42977 new
Python merge two excel files into one

We use drop_duplicates to get rid of the obvious columns where there hasnot been any change. Note that we keep the last one using take_last=True so we can tell whichaccounts have been removed in the new data set.

Python merge two excel files by common field

One interesting note about drop_duplicates, you can specify which columns you care about. Thisfunctionality is really useful if you have extra columns (say sales, or notes) that you expect to changebut don’t really care about for these purposes.

We have cut down our data set to 28 records.

Sort and take a look at what the data looks like. If you look at account number 132971, youcan get an idea for how the data is structured.

account numbernamestreetcitystatepostal codeversion
27 121213 Bauch-Goldner 7274 Marissa Common Shanahanchester California 49681 new
4 132971 Williamson, Schumm and Hettinger 89403 Casimer Spring Jeremieburgh Arkansas 62785 old
25 132971 Williamson, Schumm and Hettinger 89403 Casimer Spring Jeremieburgh Arkansas 6278 new
28 214098 Goodwin, Homenick and Jerde 649 Cierra Forks Apt. 078 Rosaberg Colorado 47743 new
3 296620 Schamberger, Hagenes and Brown 26340 Ferry Neck Apt. 612 McCulloughstad Alaska 74052 old

Use the get_duplicates function to get a list of all the accountnumbers which are duplicated.

Get a list of all the dupes into one frame using isin.

account numbernamestreetcitystatepostal codeversion
0 935480 Bruen Group 5131 Nienow Viaduct Apt. 290 Port Arlie Alabama 14118 old
4 132971 Williamson, Schumm and Hettinger 89403 Casimer Spring Jeremieburgh Arkansas 62785 old
5 985603 Bosco-Upton 03369 Moe Way Port Casandra Arkansas 86014 old
22 935480 Bruen and Jones Group 5131 Nienow Viaduct Apt. 290 Port Arlie Alabama 14118 new
25 132971 Williamson, Schumm and Hettinger 89403 Casimer Spring Jeremieburgh Arkansas 6278 new
26 985603 Bosco-Upton 89 Big Street Small Town Texas 19033 new

We need two data frames of the same size so split them into a new andold version.

Drop the version columns since we don’t need them any more.

account numbernamestreetcitystatepostal code
0 935480 Bruen Group 5131 Nienow Viaduct Apt. 290 Port Arlie Alabama 14118
4 132971 Williamson, Schumm and Hettinger 89403 Casimer Spring Jeremieburgh Arkansas 62785
5 985603 Bosco-Upton 03369 Moe Way Port Casandra Arkansas 86014

Merge Text Files Python

Index on the account number.

namestreetcitystatepostal code
account number
935480 Bruen and Jones Group 5131 Nienow Viaduct Apt. 290 Port Arlie Alabama 14118
132971 Williamson, Schumm and Hettinger 89403 Casimer Spring Jeremieburgh Arkansas 6278
985603 Bosco-Upton 89 Big Street Small Town Texas 19033
namestreetcitystatepostal code
account number
935480 Bruen Group 5131 Nienow Viaduct Apt. 290 Port Arlie Alabama 14118
132971 Williamson, Schumm and Hettinger 89403 Casimer Spring Jeremieburgh Arkansas 62785
985603 Bosco-Upton 03369 Moe Way Port Casandra Arkansas 86014

Run our diff process like we did in our first attempt now that we have the datastructured in the way we need to.

namestreetcitystatepostal code
account number
935480 Bruen Group —-> Bruen and Jones Group 5131 Nienow Viaduct Apt. 290 Port Arlie Alabama 14118
132971 Williamson, Schumm and Hettinger 89403 Casimer Spring Jeremieburgh Arkansas 62785 —-> 6278
985603 Bosco-Upton 03369 Moe Way —-> 89 Big Street Port Casandra —-> Small Town Arkansas —-> Texas 86014 —-> 19033

Looks pretty good!

We know our diff, now we need to figure out which accounts were removedin the new list. We need to find records from the “old” version that are no longer in the “new” version.

account numbernamestreetcitystatepostal codeversionduplicate
3 296620 Schamberger, Hagenes and Brown 26340 Ferry Neck Apt. 612 McCulloughstad Alaska 74052 old False

The final portion is figuring out which accounts are new.

We will go back to the full set and take only the first duplicate row.

account numbernamestreetcitystatepostal codeversion
0 935480 Bruen Group 5131 Nienow Viaduct Apt. 290 Port Arlie Alabama 14118 old
1 371770 Cruickshank-Boyer 839 Lana Expressway Suite 234 South Viviana Alabama 57838 old
2 548367 Spencer, Grady and Herman 65387 Lang Circle Apt. 516 Greenholtbury Alaska 58394 old
3 296620 Schamberger, Hagenes and Brown 26340 Ferry Neck Apt. 612 McCulloughstad Alaska 74052 old
4 132971 Williamson, Schumm and Hettinger 89403 Casimer Spring Jeremieburgh Arkansas 62785 old

Add a duplicate column again.

account numbernamestreetcitystatepostal codeversionduplicate
0 935480 Bruen Group 5131 Nienow Viaduct Apt. 290 Port Arlie Alabama 14118 old True
1 371770 Cruickshank-Boyer 839 Lana Expressway Suite 234 South Viviana Alabama 57838 old False
2 548367 Spencer, Grady and Herman 65387 Lang Circle Apt. 516 Greenholtbury Alaska 58394 old False
3 296620 Schamberger, Hagenes and Brown 26340 Ferry Neck Apt. 612 McCulloughstad Alaska 74052 old False
4 132971 Williamson, Schumm and Hettinger 89403 Casimer Spring Jeremieburgh Arkansas 62785 old True

We want to find the accounts that aren’t duplicated and are only in the new data set.

Let’s look at all the new accounts we have added:

account numbernamestreetcitystatepostal codeversionduplicate
27 121213 Bauch-Goldner 7274 Marissa Common Shanahanchester California 49681 new False
28 214098 Goodwin, Homenick and Jerde 649 Cierra Forks Apt. 078 Rosaberg Colorado 47743 new False
29 566618 Greenfelder, Wyman and Harris 17557 Romaguera Field South Tamica Colorado 50037 new False

Finally we can save all of this into three different sheets in an Excel file.

Here is a full streamlined code example:

Here is the final output excel file: my-diff-2

Conclusion

I would not be surprised if someone looks at this and finds a simpler way to do this. However,the final code is relatively straightforward and with minimal tweaks could be appliedto your custom data set. I also think this was a good exercise for me to walk through andlearn more about the various pandas functions and how to use them to solve my real world problem.

I hope it is as helpful to you as it was to me!

Changes

  • 28-Jan-2019: New and updated code is available in a new article

Comments

EasyXLS Excel library can be used to export Excel files with Python on Windows, Linux, Mac or other operating systems. The integration vary depending on the operating system or if the bridge for .NET Framework of Java is chosen:

1. EasyXLS on Windows using .NET Framework with Python


2. EasyXLS on Linux, Mac, Windows using Java with Python

EasyXLS on Windows using .NET Framework with Python

If you opt for the .NET version of EasyXLS, the below code requires Pythonnet, a bridge between Python and .NET Framework.

Step 1: Download and install EasyXLS Excel Library for .NET

To download the trial version of EasyXLS Excel Library, press the below button:

If you already own a license key, you may login and download EasyXLS from your account.

Step 2: Install Pythonnet

For the installation you need to run 'pip' command as it follows. Pip is a package-management system used to install and manage software packages written in Python.
<Python installation path>Scripts>pip install 'pythonnet.whl'

Step 3: Include EasyXLS library into project

EasyXLS.dll must be added to your project. EasyXLS.dll can be found after installing EasyXLS, in 'Dot NET version' folder.

Step 4: Run Python code that merges cells in Excel sheet

Execute the following Python code that exports an Excel file with merge cells.

EasyXLS on Linux, Mac, Windows using Java with Python

If you opt for the Java version of EasyXLS, a similar code as above requires Py4J, Pyjnius or any other bridge between Python and Java.

Step 1: Download and install EasyXLS Excel Library for Java

To download the trial version of EasyXLS Excel Library, press the below button:

If you already own a license key, you may login and download EasyXLS from your account.

Step 2: Install Py4j

For the Py4j installation you need to run 'pip' command as it follows. Pip is a package-management system used to install and manage software packages written in Python.
<Python installation path>Scripts>pip install 'py4j.whl'

Step 3: Create additional Java program

The following Java code needs to be running in the background prior to executing the Python code.


Step 4: Add py4j library to CLASSPATH

py4j.jar must be added to your classpath of the additional Java program. py4j.jar can be found after installing Py4j, in '<Python installation path>sharepy4j' folder.

Step 5: Add EasyXLS library to CLASSPATH

EasyXLS.jar must be added to your classpath of the additional Java program. EasyXLS.jar can be found after installing EasyXLS, in 'Lib' folder.

Step 6: Run additional Java program

Start the gateway server application and it will implicitly start Java Virtual Machine as well.

Step 7: Run Python code that merges cells in Excel sheet

Execute a code as below Python code that exports an Excel file with merge cells.

Related sections

See also

How to format Excel cells?


How to export to XLSX file?


How to export to XLSM file?


How to export to XLSB file?


How to export to XLS file?

Related methods

ExcelTable.easy_mergeCells
ExcelTable.easy_removeCellMerging
ExcelTable.MergeCellRangesCount
ExcelTable.easy_getCellMergingFirstRow
ExcelTable.easy_getCellMergingFirstCol
ExcelTable.easy_getCellMergingLastRow
ExcelTable.easy_getCellMergingLastCol