Big Data

Convert Oracle XML BLOB knowledge utilizing Amazon EMR and cargo to Amazon Redshift

In legacy relational database administration methods, knowledge is saved in a number of advanced knowledge sorts, such XML, JSON, BLOB, or CLOB. This knowledge would possibly comprise beneficial data that’s typically troublesome to rework into insights, so that you is likely to be in search of methods to load and use this knowledge in a contemporary cloud knowledge warehouse resembling Amazon Redshift. One such instance is migrating knowledge from a legacy Oracle database with XML BLOB fields to Amazon Redshift, by performing preprocessing and conversion of XML utilizing Amazon EMR. On this submit, we describe an answer structure for this use case, and present you implement the code to deal with the XML conversion.

Answer overview

Step one in any knowledge migration challenge is to seize and ingest the information from the supply database. For this activity, we use AWS Database Migration Service (AWS DMS), a service that helps you migrate databases to AWS rapidly and securely. On this instance, we use AWS DMS to extract knowledge from an Oracle database with XML BLOB fields and stage the identical knowledge in Amazon Easy Storage Service (Amazon S3) in Apache Parquet format. Amazon S3 is an object storage service providing industry-leading scalability, knowledge availability, safety, and efficiency, and is the storage of alternative for organising knowledge lakes on AWS.

After the information is ingested into an S3 staging bucket, we used Amazon EMR to run a Spark job to carry out the conversion of XML fields to semi-structured fields, and the outcomes are loaded in a curated S3 bucket. Amazon EMR runtime for Apache Spark may be over thrice quicker than clusters with out EMR runtime, and has 100% API compatibility with normal Apache Spark. This improved efficiency means your workloads run quicker and it saves you compute prices, with out making any modifications to your utility.

Lastly, reworked and curated knowledge is loaded into Amazon Redshift tables utilizing the COPY command. The Amazon Redshift desk construction ought to match the variety of columns and the column knowledge sorts within the supply file. As a result of we saved the information as a Parquet file, we specify the SERIALIZETOJSON choice within the COPY command. This permits us to load advanced sorts, resembling construction and array, in a column outlined as SUPER knowledge sort within the desk.

The next structure diagram reveals the end-to-end workflow.

Intimately, AWS DMS migrates knowledge from the supply database tables into Amazon S3, in Parquet format. Apache Spark on Amazon EMR reads the uncooked knowledge, transforms the XML knowledge sort right into a struct knowledge sort, and saves the information to the curated S3 bucket. In our code, we used an open-source library, known as spark-xml, to parse and question the XML knowledge.

In the remainder of this submit, we assume that the AWS DMS duties have already run and created the supply Parquet information within the S3 staging bucket. If you wish to arrange AWS DMS to learn from an Oracle database with LOB fields, confer with Successfully migrating LOB knowledge to Amazon S3 from Amazon RDS for Oracle with AWS DMS or watch the video Migrate Oracle to S3 Knowledge lake by way of AWS DMS.


If you wish to observe together with the examples on this submit utilizing your AWS account, we offer an AWS CloudFormation template you possibly can launch by selecting Launch Stack:


Present a stack identify and depart the default settings for all the pieces else. Anticipate the stack to show Create Full (this could solely take a couple of minutes) earlier than transferring on to the opposite sections.

The template creates the next assets:

  • A digital personal cloud (VPC) with two personal subnets which have routes to an Amazon S3 VPC endpoint
  • The S3 bucket {stackname}-s3bucket-{xxx}, which accommodates the next folders:
    • libs – Incorporates the JAR file so as to add to the pocket book
    • notebooks – Incorporates the pocket book to interactively take a look at the code
    • knowledge – Incorporates the pattern knowledge
  • An Amazon Redshift cluster, in one of many two personal subnets, with a database named rs_xml_db and a schema named rs_xml
  • A secret (rs_xml_db) in AWS Secrets and techniques Supervisor
  • An EMR cluster

The CloudFormation template shared on this submit is only for demonstration functions solely. Please conduct your individual safety overview and incorporate greatest practices previous to any manufacturing deployment utilizing artifacts from the submit.

Lastly, some primary information of Python and Spark DataFrames may also help you overview the transformation code, however isn’t obligatory to finish the instance.

Understanding the pattern knowledge

On this submit, we use school college students’ course and topics pattern knowledge that we created. Within the supply system, knowledge consists of flat construction fields, like course_id and course_name, and an XML subject that features all of the course materials and topics concerned within the respective course. The next screenshot is an instance of the supply knowledge, which is staged in an S3 bucket as a prerequisite step.

We are able to observe that the column study_material_info is an XML sort subject and accommodates nested XML tags in it. Let’s see convert this nested XML subject to a generic struct subject within the subsequent steps.

Run a Spark job in Amazon EMR to rework the XML fields within the uncooked knowledge to Parquet

On this step, we use an Amazon EMR pocket book, which is a managed surroundings to create and open Jupyter Pocket book and JupyterLab interfaces. It lets you interactively analyze and visualize knowledge, collaborate with friends, and construct purposes utilizing Apache Spark on EMR clusters. To open the pocket book, observe these steps:

  1. On the Amazon S3 console, navigate to the bucket you created as a prerequisite step.
  2. Obtain the file within the notebooks folder.
  3. On the Amazon EMR console, select Notebooks within the navigation pane.
  4. Select Create pocket book.
  5. For Pocket book identify, enter a reputation.
  6. For Cluster, choose Select an present cluster.
  7. Choose the cluster you created as a prerequisite.
  8. For Safety Teams, select BDB1909-EMR-LIVY-SG and BDB1909-EMR-Pocket book-SG
  9. For AWS Service Function, select the position bdb1909-emrNotebookRole-{xxx}.
  10. For Pocket book location, specify the S3 path within the notebooks folder (s3://{stackname}-s3bucket-xxx}/notebooks/).
  11. Select Create pocket book.
  12. When the pocket book is created, select Open in JupyterLab.
  13. Add the file you downloaded earlier.
  14. Open the brand new pocket book.

    The pocket book ought to look as proven within the following screenshot, and it accommodates a script written in Scala.
  15. Run the primary two cells to configure Apache Spark with the open-source spark-xml library and import the wanted modules.The spark-xml bundle permits studying XML information in native or distributed file methods as Spark DataFrames. Though primarily used to transform (parts of) massive XML paperwork right into a DataFrame, spark-xml also can parse XML in a string-valued column in an present DataFrame with the from_xml perform, as a way to add it as a brand new column with parsed outcomes as a struct.
  16. To take action, within the third cell, we load the information from the Parquet file generated by AWS DMS right into a DataFrame, then we extract the attribute that accommodates the XML code (STUDY_MATERIAL_INFO) and map it to a string variable identify payloadSchema.
  17. We are able to now use the payloadSchema within the from_xml perform to transform the sector STUDY_MATERIAL_INFO right into a struct knowledge sort and added it as a column named course_material in a brand new DataFrame parsed.
  18. Lastly, we are able to drop the unique subject and write the parsed DataFrame to our curated zone in Amazon S3.

As a result of construction variations between DataFrame and XML, there are some conversion guidelines from XML knowledge to DataFrame and from DataFrame to XML knowledge. Extra particulars and documentation can be found XML Knowledge Supply for Apache Spark.

After we convert from XML to DataFrame, attributes are transformed as fields with the heading prefix attributePrefix (underscore (_) is the default). For instance, see the next code:

  <guide class="undergraduate">
    <title lang="en">Introduction to Biology</title>
    <creator>Demo Creator 1</creator>

It produces the next schema:

 |-- class: string (nullable = true)
 |-- title: struct (nullable = true)
 |    |-- _VALUE: string (nullable = true)
 |    |-- _lang: string (nullable = true)
 |-- creator: string (nullable = true)
 |-- yr: string (nullable = true)
 |-- value: string (nullable = true)

Subsequent, we’ve got a price in a component that has no youngster parts however attributes. The worth is put in a separate subject, valueTag. See the next code:

<title lang="en">Introduction to Biology</title>

It produces the next schema, and the tag lang is transformed into the _lang subject contained in the DataFrame:

|-- title: struct (nullable = true)
 |    |-- _VALUE: string (nullable = true)
 |    |-- _lang: string (nullable = true)

Copy curated knowledge into Amazon Redshift and question tables seamlessly

As a result of our semi-structured nested dataset is already written within the S3 bucket as Apache Parquet formatted information, we are able to use the COPY command with the SERIALIZETOJSON choice to ingest knowledge into Amazon Redshift. The Amazon Redshift desk construction ought to match the metadata of the Parquet information. Amazon Redshift can substitute any Parquet columns, together with construction and array sorts, with SUPER knowledge columns.

The next code demonstrates CREATE TABLE instance to create a staging desk.

create desk rs_xml_db.public.stg_edw_course_catalog 
course_id bigint,
course_name character various(5000),
course_material tremendous

The next code makes use of the COPY instance to load from Parquet format:

COPY rs_xml_db.public.stg_edw_course_catalog FROM 's3://<<your Amazon S3 Bucket for curated knowledge>>/knowledge/goal/<<your output parquet file>>' 
IAM_ROLE '<<your IAM position>>' 

Through the use of semistructured knowledge assist in Amazon Redshift, you possibly can ingest and retailer semistructured knowledge in your Amazon Redshift knowledge warehouses. With the SUPER knowledge sort and PartiQL language, Amazon Redshift expands the information warehouse functionality to combine with each SQL and NoSQL knowledge sources. The SUPER knowledge sort solely helps as much as 1 MB of knowledge for a person SUPER subject or object. Notice, a person worth inside a SUPER object is restricted to the utmost size of the corresponding Amazon Redshift sort. For instance, a single string worth loaded to SUPER is restricted to the utmost VARCHAR size of 65535 bytes. See Limitations for extra particulars.

The next instance reveals how nested buildings in a SUPER column may be simply accessed utilizing SELECT statements:

SELECT DISTINCT bk._category
FROM rs_xml_db.public.stg_edw_course_catalog primary
INNER JOIN bk ON true;

The next screenshot reveals our outcomes.

Clear up

To keep away from incurring future fees, first delete the pocket book and the associated information on Amazon S3 bucket as defined in this EMR documentation web page then the CloudFormation stack.


This submit demonstrated use AWS providers like AWS DMS, Amazon S3, Amazon EMR, and Amazon Redshift to seamlessly work with advanced knowledge sorts like XML and carry out historic migrations when constructing a cloud knowledge lake home on AWS. We encourage you to do this resolution and benefit from all the advantages of those purpose-built providers.

When you’ve got questions or strategies, please depart a remark.

In regards to the authors

Abhilash Nagilla is a Sr. Specialist Options Architect at AWS, serving to public sector clients on their cloud journey with a concentrate on AWS analytics providers. Outdoors of labor, Abhilash enjoys studying new applied sciences, watching motion pictures, and visiting new locations.

Avinash Makey is a Specialist Options Architect at AWS. He helps clients with knowledge and analytics options in AWS. Outdoors of labor he performs cricket, tennis and volleyball in free time.

Fabrizio Napolitano is a Senior Specialist SA for DB and Analytics. He has labored within the analytics area for the final 20 years, and has not too long ago and fairly without warning turn out to be a Hockey Dad after transferring to Canada.

What's your reaction?

Leave A Reply

Your email address will not be published.