Data Merging

  1. Data Merging
  2. Data Merging Process
  3. Data Merging In Stata
  4. Data Merging Definition

The Data Cleansing Group is an Australian provider of data merging and data migration services. We are passionate data geeks who will do whatever it takes to solve your data problems.

In this tutorial, you can use Power Query's Query Editor to import data from a local Excel file that contains product information and from an OData feed that contains product order information. You perform transformation and aggregation steps, and combine data from both sources to produce a 'Total Sales per Product and Year' report.

  1. Data Merging & Data Migration Services. The Data Cleansing Group is an Australian provider of data merging and data migration services.We are passionate data geeks who will do whatever it takes to solve your data problems. Our services include extracting data from legacy or modern systems, restructuring, reorganising and merging the data into a homogeneous format.
  2. To merge two data frames (datasets) horizontally, use the merge function. In most cases, you join two data frames by one or more common key variables (i.e., an inner join). # merge two data frames by ID total.

Our services include extracting data from legacy or modern systems, restructuring, reorganising and merging the data into a homogeneous format. In addition, this data can be cleansed, de-duplicated and verified before being exported into your target system’s format.

Sometimes Easy, Sometimes Difficult…

Data merging & migration is a serious task due to the way that data is uniquely defined, structured and identified within the source and target systems. These issues are further aggravated when multiple sources are merged together. Typical problems include:

Data Merging

  • Contradicting Records: (eg “Part #: 123; Description: Widgit”, “Part #: 123; Description: Shiny Widgit” – The same item had different definitions
  • Cryptic Abbreviations: (eg occupation: Mkt. D. B)
  • Duplicated Records: (eg. Employee: John Smith, J. Smith, J.A. Smith) – Same employee included three times
  • Embedded Values: Multiple values included in a single field (eg First Name=”Bob Smith Melbourne”
  • Illegal Values: (eg. an employees date of birth is 1 Jan 1860) – Values outside of the domain’s range.
  • Incorrect References: (eg. John Smith, Dept: 27) . Whilst Dept: 27 is defined, Mr Smith actually works in dept 41)
  • Measurement Units: Cost = 10000, but this can be in dollars in System 1 but in cents or another currency in System 2
  • Misfielded Values: (eg: City=”Australia”). Correct value placed in the incorrect field
  • Missing Values: (eg data not provided or has been replaced with null or dummy data)
  • Misspellings: (eg typos, transposed values and spelling mistakes)
  • Naming Conflicts: Same name used to describe different fields in the different systems, or different names describe the same field.
  • Value Representations: (Eg Male, Female vs M,F vs 1,2
  • Violated Dependencies (eg Age = 20, but date of birth = 1 Jan 1965) – In this case, the employees age does not equal the current date minus the date of birth)
  • Violated References: ( eg: “Name: Bob Smith; Department:199” but department 199 is not defined in the old system)
  • Violated Uniqueness: (eg: different employees with the same tax file number). Tax file numbers are unique for every individual
  • Word Transpositions: (eg: ‘Smith John’ instead of ‘John Smith’)

The seriousness of the above problems depend on the project brief. Some clients just want their data migrated “warts and all” to the target system. And once it’s in the target system, they continue as before or initiate an in-house data quality project. These tend to be quick and easy migration projects.

For other clients, the brief is that the data must be as-correct-as-it-can-practically-be. These tend to be more challenging and time consuming projects.

In either case, however one requires a sound data migration strategy to ensure a cost effective solution.

Data Merging Process

A Logical Approach to Data Migration

A typical data migration and merging project includes the following phases:

  1. Data and Schema Analysis:
    • Data profiling provides an indication of data quality, and includes information such as data type, length, value range, discrete values and their frequency, uniqueness, typical data pattern, etc.
    • Data mining helps discover integrity constraints and business rules. These can be used to correct illegal values, identify duplicate records and rebuild missing values across data sources.
    • Schema review through discussions with domain experts, or system support forums, to obtain insight into the definition, structure and identities within the source and target data. Where this is not possible, we rely on the data mining results.
  2. Define data formats, mapping rules and transformation workflows: Results from Step 1 above are translated into a series of programmable instructions. Their purpose is to convert the data from its source to target format, plus resolve any associated data quality issues. These transformation steps may require user input where there is no applicable logic.
  3. Verification: The accuracy and effectiveness of the transformation steps are evaluated on the source data and the transformational steps tweaked if required. Steps 2 & 3 may require multiple iterations, since some errors may only become apparent after applying the initial transformations.
  4. Transformation: The verified transformation steps are applied. Any errors requiring manual identification are identified and resolved.
  5. Export cleaned data and import into target system: The cleaned and transformed data is exported, and imported into the target system.
  6. Test, Test, Test: Key reports from the target system are compared to those from the source system to ensure that the data was correctly transformed and uploaded. Significant errors may require a reiteration of Steps 1 to 6.

Typical Data Merging & Migration Projects

Our custom database migration tools and VBA scripts enable us to quickly automate many laborious processes, providing you with a high quality, cost effective solution. This service is targeted at organisations:

  • Migrating data from old to new systems.
  • Merging data from multiple sources into an existing system.
  • Extracting data from legacy systems for archive purposes.
  • Looking to clean up their database in order to improve decision making, reduce associated data costs, or improve their data’s ROI.

Our typical data merging & migration projects include:

Data Merging
  • CRM data.
  • Financial data.
  • Enterprise Resource Planning (ERP) data.
  • Health, Safety and Environment data.
  • Maintenance data.
  • Product & inventory data.
  • Production data.
  • Project data.
  • Server & internet log data.

Give your new system the best possible chance of success by starting with clean data foundation.

Adding Columns

Data Merging In Stata

To merge two data frames (datasets) horizontally, use the merge function. In most cases, you join two data frames by one or more common key variables (i.e., an inner join).

# merge two data frames by ID
total <- merge(data frameA,data frameB,by='ID')

# merge two data frames by ID and Country
total <- merge(data frameA,data frameB,by=c('ID','Country'))

Adding Rows

To join two data frames (datasets) vertically, use the rbind function. The two data frames must have the same variables, but they do not have to be in the same order.

total <- rbind(data frameA, data frameB)

If data frameA has variables that data frameB does not, then either:

  1. Delete the extra variables in data frameA or
  2. Create the additional variables in data frameB and set them to NA (missing)

before joining them with rbind( ).

Going Further

Data Merging Definition

To practice manipulating data frames with the dplyr package, try this interactive course on data frame manipulation in R.