Data Migration Made Fun (and processed in minutes)

There are few people in the world that think data migration from an existing system to a new one is a thrilling activity (if you’re one of those people, you should embrace it, there’s a niche out there for you!). It’s a tedious process typically involving countless spreadsheets, one application to view the source schema/data and another to view the schema of the new data model. Furthermore, in the ESRI GIS world, data migrations are complicated by subtypes, which can have coded value domains that may vary on a given attribute in a single table depending on the subtype of the row in the table. The title of this blog may be a bit of a stretch, there aren’t going to be fireworks or even balloons, but I’d like to show off two Jupyter notebooks I’ve written for data migration. “The Albatross” is the first notebook using a lot of interactive widgets; it presents the relevant information for making decisions about data migration all in one view, enabling the user to configure a data migration easily. “The Hummingbird” is the second, which Extracts/Transforms/Loads (ETL) the data efficiently (like a hummingbird flying back and forth between North and South America) into a target geodatabase.

The Old Way

Most of the design and build phases of data migrations I’ve seen typically revolve around hopping back and forth between tabs on a spreadsheet or going to ArcMap/ArcCatalog to view your existing data. Then over to ArcGIS Pro to review the destination database’s schema to decide what source data goes where. After the decision of what goes where is made, that information is entered into another spreadsheet (while hoping you don’t copy/paste something wrong or type a field or table name incorrectly). After the mapping spreadsheet is created, it gets handed to someone else (who may or may not have been part of the design session(s) to decide what goes where) to translate into yet another application to configure/perform the migration and/or to write code to do the ETL and any pre/post-processing that the main ETL wasn’t suited to do. This approach takes a long time to sort thru the various data sources to get the relevant information and is prone to human error.

The New Way

To configure a migration within the albatross notebook, we have a series of steps to follow:

Run a data assessment on both the source and the destination databases and then open them in the albatross.
Map the tables/subtypes from the source to the destination.
For each table/subtype mapping, map the fields from the source to the destination.
For each field in the destination database with a coded value domain applied, provide any transformations for the source values so they match the codes in the destination field.
Set default values on the fields in the destination database if needed.

In each step, the albatross presents the user with the necessary information about the source data and destination database to perform the step while limiting the extra information that isn’t needed to perform the task.

To perform the migration, the user needs to open the hummingbird notebook and provide the following:

The path to the data mapping spreadsheet that the albatross makes.
The path to the source database.
The path to the destination database.

The Data Assessment

Unsurprisingly, a data migration is entirely driven by the data/schema in the source and the destination databases. Before the albatross can begin to configure a migration, a summary of the schema for both the source and destination databases needs to be created, and a summary of the data within the source system must be created. There are various tools that can be used to generate summaries of schemas. I’ve got a Python class I wrote, which I’ve used for a few years, that reads all the information about domains, tables, feature classes, fields, and subtypes into a set of pandas data frames which can be exported to a spreadsheet (If you’re reading this thinking that I should be storing this information in a database of some sort, not a spreadsheet, yes, I agree, that would make a lot of sense, but I wanted people who may not be very database savvy to be able to interact with the output, and just about everyone involved in a data migration knows how to work excel). I call these spreadsheets AGC spreadsheets (ArcGIS Content) because they contain information specific to a database in the ArcGIS world. The class/spreadsheet’s original purpose was to be used to easily make changes to the schema of a database or configure a database schema in a new environment, but since it already had all the necessary information about the schema packaged up into data frames, it was the logical place for me to start. I extended the class to have another function that uses the ArcGIS Python API to read the tables/feature classes of a database into pandas dataframes and computed occurrences of values in each field of each table and store that information in a different dataframe. Then that data frame can also be exported to the spreadsheet. This is done in the albatross simply by providing the path to the database, and it will create a spreadsheet that has all the schema/database content goodness neatly packaged up and ready for the albatross to work with. The content summary portion of the data assessment can take a bit of time to churn thru every table/feature class in the database, but since each table is independent, the albatross uses multiple processes to expedite the creation of the assessment. With the AGC spreadsheet created from the source and destination database, we won’t need to open ArcMap or ArcCatalog to view the schema or get a feel for the contents of the data. We won’t even need to be on the same network as the database if we have the spreadsheet!

Creating the AGC Spreadsheet:

The type of information in the data assessment spreadsheet:

Table/Subtype Mapping

The first real part of configuring the migration is to open the spreadsheets within the albatross:

With the spreadsheets loaded in the albatross, we switch to table/subtype mapping mode and select the source table/subtype and the destination table/subtype for all the data to configure the data from which table/subtype in the source system needs to go to which table/subtype in the destination database that needs to be migrated. If only a subset of the source data should be migrated for some reason, a query can be applied using the widgets to select rows with specific values, a ‘like’ statement, greater than, less than, in, not in, perform a subquery with another table, or even a spatial filter based on weather a feature intersects another feature or not. It’s like a supercharged version of a select by attributes window! At this point in the process, we aren’t concerned with what field goes where, so the albatross doesn’t even display the fields in the destination tables, field mapping comes in the next step!

Field Mapping

After all the table/subtypes are mapped, we can begin the field mapping. In this step for each table/subtype mapping we will be mapping fields in the source data to where they belong in the destination table. When you click on the source field, it will show the field type information and the feature counts for the existing source data. When you click a destination field, any field property differences between the source and the destination will highlight, and if there is a domain applied to the field, the potential domain values will show up. We can select multiple fields from the source table to be combined with a mathematic function, coalesced, or concatenated together to migrate to a single destination field. We can apply a regular expression to a field and only migrate part of a field, or if the source table is related to another table via a primary/foreign key we can even join the tables together and get a record from the other table and migrate it too! At this point in the process, we aren’t too terrible concerned with the value in the source and the destination fields other than if the data types can work together. For example, we can’t take a text value and migrate it to an integer field (unless that text field happens to contain only integers) for this reason the albatross does show the values that are present in the fields but doesn’t present it as the main focus.

Domain Values

The last big part of the migration configuration in the albatross world is to configure the domain value mapping. In this step, for every destination field that is controlled by a coded value domain, the user will specify which source values need to be converted to which destination values. The albatross will present the user with the values in the source field and the potential values for the domain assigned to the destination field and the user selects from the lists to configure the transformations.

Default Values

The last step of the migration process I’ll cover in detail is the setting of default values. In this step we can set a default value for a field in any destination table, a specific destination table, a specific subtype of a destination table, or a specific table/subtype mapping. This allows us to set values on the migrated data even if there isn’t a source field for the information. It’s extra helpful for non-nullable fields and the fields that the Utility Network wants a value in that may not necessarily exist in the source data as an attribute.

Other Things That Can Be Configured

There are a few other things that we can configure but I won’t go into a lot of detail about, partially because this article is getting rather long, but also because they don’t necessarily apply to all migrations. Within the albatross we can also set up value overrides to overwrite specific values from fields, we can do some geometric manipulations of data to snap features together or spread stacked devices apart from one another. There is an option to create utility network junctions at assetgroup/assettype changes between lines if there isn’t a junction or device present. We can configure associations to be created automatically between features based on keys in the source data. We can re-establish relationship class keys, configure attachments to be migrated, set UN subnetwork controllers, configure terminals on lines, etc. Last but certainly not least we can also configure utility network specific data manipulation of lines/junctions/assemblies/devices of a high-fidelity UN model where assemblies and devices aren’t spatially coincident with lines, but the network connectivity is maintained via associations. Once the migration is configured in albatross, we can then perform the migration with the hummingbird.

The Hummingbird

Now that the migration is configured, we can fire up the hummingbird notebook and get to migrating the data efficiently. All we need to do is tell the hummingbird the location of the data mapping spreadsheet we created in the albatross to store the configuration, and the paths to the source database and destination database, and it handles the rest. It’s built using the arcgis python API to read only the tables and fields needed by the migration from the database into a set of pandas dataframes (tables are read in parallel using multiprocessing to expedite the process), and all the value transformations occur very quickly because they’re all performed on the dataframes in memory, there is no iteration of rows in a table to sequentially change values one at a time. Then if the migration is configured to do all the assembly/device offsetting for the high fidelity UN model, it uses a lot of functions from numpy/scipy to quickly compute what things need to be moved and where to, and what junctions need to be generated, and keeps track of how all the associations need to be created so it can write them to the associations table in an asset package. With the data all transformed it then loads the data to the destination database, again using multiple processes to expedite the writing of the data. After the data is loaded any relationships/attachments/associations that were configured to be established/migrated are then handled since they all need to have the data loaded to the destination database to establish the GLOBALIDs for the records. So, how fast is it you might be asking? And that answer can vary wildly based on the configurations, the assembly manipulations can take some time to process, but I’ve run 2.5 million features thru it with a full UN assembly/device offset/associate, re-established ~500K relationships and ~1 million associations in around an hour, and recently did a gas migration that had over 6 million features, and 2.5 million attachments on the features, reading from an SDE and writing to a file geodatabase asset package in under 8 hours (over 6 hours of which was the attachment migration, that’s a lot of attachment BLOBs to be moving around!). I know I said at the beginning that there wouldn’t be any balloons or fireworks, just data migration, but you stuck around, so ya know what? Here’s a balloon for ya! 🎈 Thanks for reading, be safe, be friendly, I’ll catch ya in the next one!

Empowered Solutions Blog

Sharing Our Insights on the Future of Utility Work

Data Migration Made Fun (and processed in minutes)

The Old Way

The New Way

The Data Assessment

Table/Subtype Mapping

Field Mapping

Domain Values

Default Values

Other Things That Can Be Configured

The Hummingbird

Data Migration SSP Blogs

We Wrote the Book

Associated Blog Posts

Data Migration – Avoiding the Pitfalls

Data Migration Made Fun (and processed in minutes)

CRM Data Migration Using Jupyter Notebook and the Salesforce API

What do you think?

Leave a comment, and share your thoughts Cancel Reply

Connect

Follow us on social media