Sometimes if we have to rewrite a very old system, we might want to change the database tables structure and the database server to a more suitable ones. However this is not an easy task. We can’t dump the database and restore it directly, because the database servers are different. Even if we manage to restore a dump, we still need to populate the data in the new table structures. One solution to this problem is to use a data integration tool. The data integration tools are also used to aggregate data in new tables from a data warehouses, in order to speed up the data retrieval. There are many other applications as well. One popular and open source data integration tool is Pentaho Data Integration (PDI), also called kettle. Below I am going to walk you through a simple example of database migration with PDI. The implementation can be found at this github repository: https://github.com/vtopavlov/db_migration_example
Preparation of example databases:
Let’s say that we have a database and we want to migrate it to another database server with more normalized structure. In order to show you how the PDI tool works, I installed MySQL server and created a database migration_example with the following table:
CREATE TABLE `migration_example`.`client` ( `id` INT NOT NULL, `name` VARCHAR(45) NULL, `phone_number` VARCHAR(45) NULL, PRIMARY KEY (`id`));
Also I installed a PostgreSQL server and created a database migration_example with the following tables:
CREATE TABLE public.client ( id serial NOT NULL, client_name text, CONSTRAINT client_pkey PRIMARY KEY (id), CONSTRAINT unique_client UNIQUE (client_name) ) CREATE TABLE public.phone_number ( phone_number text NOT NULL, client_id integer NOT NULL, CONSTRAINT phone_number_pkey PRIMARY KEY (phone_number, client_id), CONSTRAINT phone_number_client_fkey FOREIGN KEY (client_id) REFERENCES public.client (id) MATCH FULL )
The idea here is to separate the clients from the phone numbers. By doing this, if one client has several phone numbers, the name won’t be repeated in the database.
Downloading and running PDI tool:
In order to transfer the data from the first to the second database we can use the PDI tool. It is available for download(at the time of writing of the article) at this location: https://community.pentaho.com/projects/data-integration/
Once we download the zip containing the PDI and extract it, we can start the visual environment by executing spoon.sh for linux or spoon.bat for windows. In order to start the application we must have java installed. This is how it looks like:
Implementing a PDI solution that will migrate the data:
There are two major components in PDI. The first one is called a job and the second – transformation. The difference between job and transformation is that the job is executing each step synchronously, meaning that each step waits for the previous to finish. On the opposite the transformation is running all steps asynchronously.
For our example we will first populate the postgres client table and after that we will populate the postgres phone_number table with data from the MySQL database. We need to populate the client table first, because the phone_number table has a foreign key pointing to it. So we will need one job and two transformations. The job will execute the transformations one after another, the first transformation will populate the client table and the second transformation will populate the phone_number table. We use transformations to copy the data from one database to another, because we want the transfer to happen asynchronously.
1. We can start by creating the transformation that will populate the client table in postgres.Select from the menu File -> New -> Transformation.
2. Let’s save it first, so we can give it a name. Go to File -> Save As, and enter the name “MigrateClientNames.ktr”.
3. On the left panel, select the tab Design and search for a step called Table Input.
4. Drag and drop the step from the left panel to the right empty panel. You will see the step appearing. Now double click on the newly added step.
5. You will see another pop up window with configuration for this step. Give the new step a name, for example “Get distinct client names from mysql”. Now we need to create a database connection to the MySQL database. You will notice a label “Connection”, after that there will be a drop down and 3 buttons. Click on the button “New…”. A screen like this appears:
6. Populate the Connection Name to be “MySQL_DB”, select MySQL as connection type and populate your server host, port, user, pass and database. When you click on the button Test, it will tell you if everything is correct with the connection. If you encounter the following problem:
Driver class 'org.gjt.mm.mysql.Driver' could not be found, make sure the 'MySQL' driver (jar file) is installed. org.gjt.mm.mysql.Driver
You will need to download the latest jdbc driver for mysql and add it to the lib folder of the PDI tool and restart the graphical editor (spoon).
7. After we create and choose the connection in the Table Input step, we can write the SQL that will select the data. Put the following code in the SQL section:
SELECT distinct name FROM client;
The step configuration should look like this:
8. Now that we are ready with the data retrieving, let’s write it to the postgres database. Drag and drop step Table Output. Give the step a name, for example “Populate distinct names into client table in postgesql”. Create a new connection for the postgres database. Populate the target schema with “public” and target table with “client”. Click on the checkboxes “Ignore insert errors” and “Specify database fields”.
9. Select the tab “Database fields” and populate the fields as in the screenshot. The Table field is the column which we want to populate in the client table, and Stream field is the field from which we will get the data.
10. In order the second step to be able to get the data from the first step we need to connect them. If we press shift on the keyboard and click on the first step, we can drag the mouse to the second step and they will become connected. Also if we click on the first step a context menu will appear below it. From this menu we can click on the button for connecting steps. Finally the transformation should look like this:
11. Save the transformation. If you want to execute it, you can click on the triangle button marked with circle on the screenshot.
Populate the phone_number table in the PostgreSQL database
Now that we are ready with the first transformation, we can create the second one. It is a bit more complex because we need data from both databases now. On one hand we need the phone number and name from the MySQL database and on the other hand we need the ID and name of the client from the postgres database to populate the foreign key. We will create two steps Table Input, one step Merge Join and one step Table Output.
1. Create a new transformation “MigrateClientPhoneNumbers.ktr” and save it.
2. Create a step Table Input, choose the existing connection MySQL_DB, give a name for the step, for example “Get name and phone number from mysql” and write the following query:
SELECT name, phone_number FROM client order by name;
The step should look like this:
3. Create another step Table Input, give the step a name, for example “Get ID of the client”, choose the existing PostgreSQL_DB connection and write the following query:
SELECT id, client_name FROM client order by client_name;
The step should look like this:
4. Now the little complicated step. Find and add the step Merge Join. The purpose of this step is to perform LEFT OUTER join between the results from the first and the second step. Notice how powerful the PDI is, it can merge between two tables from different databases. Connect the steps as follows:
5. Configure the step Merge Join according to the screenshot:
6. Create a step Table Output with name “Populate phone numbers in postgres db” and connection PostgreSQL_DB, target schema “public”, target table “phone_number”, check “Ignore insert errors” and “Specify database fields” and configure the database fields according to the screenshot:
7. The steps and the connection between them should look like this
8. Save the transformation. Click on the triangle button below the name of the transformation if you want to execute it.
Create a job that executes both transformations
We are now ready with the transformations, however it might be inconvenient to execute the transformations one after another. We can create a job that will do the work for us.
1. Create a new Job with name “MigrationJob.kjb”
2. Find the step START, it is a step without action. It simply marks the beginning of the execution.
3. Find a step Transformation. In the step configuration, you can select a transformation file. Select the first transformation that we created.
4. Create another step with the second transformation.
5. Now add Success step which is also without action. It marks the end of the job.
6. The connection between the steps should look like the screenshot:
Execution of the transformation:
There is more than one way to execute a transformation and a job. The most obvious one is through the spoon visual environment. We can run the transformation by clicking the triangle button or by selecting Action -> Run.
Sometimes the PDI runs on a server, if the operations are too heavy to execute on a normal working environment. Usually the servers don’t have a graphical environment installed and the PDI offers a command line tool that is able to execute transformations and jobs. We can execute transformation from the command line by calling the pen.sh and pass the file as parameter. In order to execute a job we can use kitchen.sh and pass the job file as parameter.
I hope you enjoyed the article. Do you know another way to handle such scenario?