Migrate a petabyte-scale information warehouse from Actian Vectorwise to Amazon Redshift

May 30, 2024

5

Amazon Redshift is a quick, scalable, and absolutely managed cloud information warehouse that lets you course of and run your complicated SQL analytics workloads on structured and semi-structured information. It additionally helps you securely entry your information in operational databases, information lakes, or third-party datasets with minimal motion or copying of information. Tens of 1000’s of shoppers use Amazon Redshift to course of massive quantities of information, modernize their information analytics workloads, and supply insights for his or her enterprise customers.

On this submit, we focus on how a monetary companies {industry} buyer achieved scalability, resiliency, and availability by migrating from an on-premises Actian Vectorwise information warehouse to Amazon Redshift.

Challenges

The shopper’s use case required a high-performing, extremely accessible, and scalable information warehouse to course of queries towards massive datasets in a low-latency atmosphere. Their Actian Vectorwise system was designed to exchange Excel plugins and inventory screeners however ultimately advanced right into a a lot bigger and bold portfolio evaluation resolution operating a number of API clusters on premises, serving among the largest monetary companies corporations worldwide. The shopper noticed rising demand that wanted excessive efficiency and scalability as a consequence of 30% year-over-year enhance in utilization from the success of their merchandise. The shopper wanted to maintain up with elevated quantity of learn requests, however they couldn’t do that with out deploying extra {hardware} within the information middle. There was additionally a buyer mandate that business-critical merchandise will need to have their {hardware} up to date to cloud-based options or be deemed on the trail to obsolescence. As well as, the enterprise began transferring clients onto a brand new industrial mannequin, and due to this fact new initiatives would want to provision a brand new cluster, which meant that they wanted improved efficiency, scalability, and availability.

They confronted the next challenges:

Scalability – The shopper understood that infrastructure upkeep was a rising difficulty and, though operations have been a consideration, the present implementation didn’t have a scalable and environment friendly resolution to fulfill the superior sharding necessities wanted for question, reporting, and evaluation. Over-provisioning of information warehouse capability to fulfill unpredictable workloads resulted in underutilized capability throughout regular operations by 30%.
Availability and resiliency – As a result of the client was operating business-critical analytical workloads, it required the best ranges of availability and resiliency, which was a priority with the on-premises information warehouse resolution.
Efficiency – A few of their queries wanted to be processed in precedence, and customers have been beginning to expertise efficiency degradation with longer-running question occasions as their resolution began getting used increasingly. The necessity for a scalable and environment friendly resolution to handle buyer demand, handle infrastructure upkeep issues, change legacy tooling, and sort out availability led to them selecting Amazon Redshift as the longer term state resolution. If these issues weren’t addressed, the client could be prevented from rising their consumer base.

Legacy structure

The shopper’s platform was the principle supply for one-time, batch, and content material processing. It served many enterprise use circumstances throughout API feeds, content material mastering, and analytics interfaces. It was additionally the only strategic platform inside the firm for entity screening, on-the-fly aggregation, and different one-time, complicated request workflows.

The next diagram illustrates the legacy structure.

The structure consists of many layers:

Guidelines engine – The foundations engine was chargeable for intercepting each incoming request. Based mostly on the character of the request, it routed the request to the API cluster that might optimally course of that particular request based mostly on the response time requirement.
API – Scalability was one of many main challenges with the present on-premises system. It wasn’t potential to rapidly scale up and down API service capability to fulfill rising enterprise demand. Each the API and information retailer needed to assist a extremely unstable workload sample. This included easy information retrieval requests that needed to be processed inside a couple of milliseconds vs. energy user-style batch requests with complicated analytics-based workloads that might take a number of seconds and vital compute assets to course of. To separate these totally different workload patterns, the API and information retailer infrastructure was break up into a number of remoted bodily clusters. This made certain every workload group was provisioned with ample reserved capability to fulfill the respective response time expectations. Nevertheless, this mannequin of reserving capability for every workload kind resulted in suboptimal utilization of compute assets as a result of every cluster would solely course of a selected workload kind.
Information retailer – The info retailer used a customized information mannequin that had been extremely optimized to fulfill low-latency question response necessities. The present on-premises information retailer wasn’t horizontally scalable, and there was no built-in replication or information sharding functionality. Because of this limitation, a number of database cases have been created to fulfill concurrent scalability and availability necessities as a result of the schema wasn’t generic per dataset. This mannequin triggered operational upkeep overhead and wasn’t simply expandable.
Information ingestion – Pentaho was used to ingest information sourced from a number of information publishers into the information retailer. The ingestion framework itself didn’t have any main challenges. Nevertheless, the first bottleneck was as a consequence of scalability points related to the information retailer. As a result of the information retailer didn’t assist sharding or replication, information ingestion needed to explicitly ingest the identical information concurrently throughout a number of database nodes inside a single transaction to supply information consistency. This considerably impacted general ingestion pace.

General, the present structure didn’t assist workload prioritization, due to this fact a bodily mannequin of assets was reserved because of this. The draw back right here is over-provisioning. The system had an integration with legacy backend companies that have been all hosted on premises.

Resolution overview

Amazon Redshift is an industry-leading cloud information warehouse. Amazon Redshift makes use of SQL to research structured and semi-structured information throughout information warehouses, operational databases, and information lakes utilizing AWS-designed {hardware} and machine studying (ML) to ship one of the best price-performance at any scale.

Amazon Redshift is designed for high-performance information warehousing, which offers quick question processing and scalable storage to deal with massive volumes of information effectively. Its columnar storage format minimizes I/O and improves question efficiency by studying solely the related information wanted for every question, leading to quicker information retrieval. Lastly, you’ll be able to combine Amazon Redshift with information lakes like Amazon Easy Storage Service (Amazon S3), combining structured and semi-structured information for complete analytics.

The next diagram illustrates the structure of the brand new resolution.

Within the following sections, we focus on the options of this resolution and the way it addresses the challenges of the legacy structure.

Guidelines engine and API

Amazon API Gateway is a completely managed service that assist builders ship safe, sturdy, API-driven software backends at any scale. To handle scalability and availability necessities of the principles and routing layer, we launched API Gateway to do the routing of the consumer requests to totally different integration paths utilizing routes and parameter mappings. Having API Gateway because the entry level allowed the client to maneuver away from the design, testing, and upkeep of their guidelines engine growth workload. Of their legacy atmosphere, dealing with fluctuating quantities of site visitors posed a major problem. Nevertheless, API Gateway seamlessly addressed this difficulty by performing as a proxy and routinely scaling to accommodate various site visitors calls for, offering optimum efficiency and reliability.

Information storage and processing

Amazon Redshift allowed the client to fulfill their scalability and efficiency necessities. Amazon Redshift options akin to workload administration (WLM), massively parallel processing (MPP) structure, concurrency scaling, and parameter teams helped handle the necessities:

WLM offered the power for question prioritization and managing assets successfully
The MPP structure mannequin offered horizontal scalability
Concurrency scaling added extra cluster capability to deal with unpredictable and spiky workloads
Parameter teams outlined configuration parameters that management database conduct

Collectively, these capabilities allowed them to fulfill their scalability and efficiency necessities in a managed vogue.

Information distribution

The legacy information middle structure was unable to partition the information with out deploying extra {hardware} within the information middle, and it couldn’t deal with learn workloads effectively.

The MPP structure of Amazon Redshift provides environment friendly information distribution throughout all of the compute nodes, which helped run heavy workloads in parallel and subsequently lowered response occasions. With the information distributed throughout all of the compute nodes, it permits information to be processed in parallel. Its MPP engine and structure separates compute and storage for environment friendly scaling and efficiency.

Operational effectivity and hygiene

Infrastructure upkeep and operational effectivity was a priority for the client of their present state structure. Amazon Redshift is a completely managed service that takes care of information warehouse administration duties akin to {hardware} provisioning, software program patching, setup, configuration, and monitoring nodes and drives to get well from failures or backups. Amazon Redshift periodically performs upkeep to use fixes, enhancements, and new options to your Redshift information warehouse. Consequently, the client’s operational prices lowered by 500%, and they’re now capable of spend extra time innovating and constructing mission-critical purposes.

Workload administration

Amazon Redshift WLM was capable of resolve points with the legacy structure the place longer-running queries have been consuming all of the assets, inflicting different queries to run slower, impacting efficiency SLAs. With computerized WLM, the client was capable of create separate WLM queues with totally different priorities, which allowed them to handle the priorities for the essential SLA-bound workloads and different non-critical workloads. With quick question acceleration (SQA) enabled, it prioritized chosen short-running queries forward of longer-running queries. Moreover, the client benefited through the use of question monitoring guidelines in WLM to use efficiency boundaries to manage poorly designed queries and take motion when a question goes past these boundaries. To be taught extra about WLM, discuss with Implementing workload administration.

Workload isolation

Within the legacy structure, all of the workloads—extract, rework, and cargo (ETL); enterprise intelligence (BI); and one-time workloads—have been operating on the identical on-premises information warehouse, resulting in the noisy neighbor downside and efficiency points with the rise in customers and workloads.

With the brand new resolution structure, this difficulty is remediated utilizing information sharing in Amazon Redshift. With information sharing, the client is ready to share reside information with safety and ease throughout Redshift clusters, AWS accounts, or AWS Areas for learn functions, with out the necessity to copy any information.

Information sharing improved the agility of the client’s group. It does this by giving them instantaneous, granular, and high-performance entry to information throughout Redshift clusters with out the necessity to copy or transfer it manually. With information sharing, clients have reside entry to information, so their customers can see probably the most up-to-date and constant data because it’s up to date in Redshift clusters. Information sharing offers workload isolation by operating ETL workloads in its personal Redshift cluster and sharing information with different BI and analytical workloads of their respective Redshift clusters.

Scalability

With the legacy structure, the client was dealing with scalability challenges throughout massive occasions to deal with unpredictable spiky workloads and over-provisioning of the database capability. Utilizing concurrency scaling and elastic resize allowed the client to fulfill their scalability necessities and deal with unpredictable and spiky workloads.

Information migration to Amazon Redshift

The shopper used a home-grown course of to extract the information from Actian Vectorwise and retailer it in Amazon S3 and CSV recordsdata. The info from Amazon S3 was then ingested into Amazon Redshift.

The loading course of used a COPY command and ingested the information from Amazon S3 in a quick and environment friendly means. A greatest follow for loading information into Amazon Redshift is to make use of the COPY command. The COPY command is probably the most environment friendly approach to load a desk as a result of it makes use of the Amazon Redshift MPP structure to learn and cargo information in parallel from a file or a number of recordsdata in an S3 bucket.

To find out about one of the best practices for supply information recordsdata to load utilizing the COPY command, see Loading information recordsdata.

After the information is ingested into Redshift staging tables from Amazon S3, transformation jobs are run from Pentaho to use the incremental adjustments to the ultimate reporting tables.

The next diagram illustrates this workflow.

Key concerns for the migration

There are 3 ways of migrating an on-premises information warehouse to Amazon Redshift: one-step, two-step, and wave-based migration. To attenuate the danger of migrating over 20 databases that modify in complexity, we selected the wave-based strategy. The elemental idea behind wave-based migration entails dividing the migration program into initiatives based mostly on components akin to complexity and enterprise outcomes. The implementation then migrates every undertaking individually or by combining sure initiatives right into a wave. Subsequent waves observe, which can or might not be depending on the outcomes of the previous wave.

This technique requires each the legacy information warehouse and Amazon Redshift to function concurrently till the migration and validation of all workloads are efficiently full. This offers a clean transition whereas ensuring the on-premises infrastructure could be retired solely after thorough migration and validation have taken place.

As well as, inside every wave, we adopted a set of phases to be sure that every wave was profitable:

Assess and plan
Design the Amazon Redshift atmosphere
Migrate the information
Take a look at and validate
Carry out cutover and optimizations

Within the course of, we didn’t wish to rewrite the legacy code for every migration. With minimal code adjustments, we migrated the information to Amazon Redshift as a result of SQL compatibility was crucial within the course of as a consequence of current data inside the group and downstream software consumption. After the information was ingested into the Redshift cluster, we adjusted the tables for greatest efficiency.

One of many predominant advantages we realized as a part of the migration was the choice to combine information in Amazon Redshift with different enterprise teams sooner or later that use AWS Information Change, with out vital effort.

We carried out blue/inexperienced deployments to be sure that the end-users didn’t encounter any latency degradation whereas retrieving the information. We migrated the end-users in a phased method to measure the impression and modify the cluster configuration as wanted.

Outcomes

The shopper’s resolution to make use of Amazon Redshift for his or her resolution was additional strengthened by the platform’s potential to deal with each structured and semi-structured information seamlessly. Amazon Redshift permits the client to effectively analyze and derive priceless insights from their numerous vary of datasets, together with equities and institutional information, all whereas utilizing customary SQL instructions that groups are already snug with.

By rigorous testing, Amazon Redshift constantly demonstrated outstanding efficiency, assembly the client’s stringent SLAs and delivering distinctive subsecond question response occasions with a formidable latency. With the AWS migration, the client achieved a 5% enchancment in question efficiency. Scalability of the clusters was executed in minutes in comparison with 6 months within the information middle. Operational value lowered by 500% as a result of simplicity of the Redshift cluster operations in AWS. Stability of the clusters improved by 100%. Upgrades and patching cycle time improved by 200%. General, enchancment in operational posture and whole financial savings for the footprint has resulted in vital financial savings for the crew and platform generally. As well as, the power to scale the general structure based mostly on market information tendencies in a resilient and extremely accessible means not solely met the client demand when it comes to time to market, but additionally considerably lowered the operational prices and whole value of possession.

Conclusion

On this submit, we lined how a big monetary companies buyer improved efficiency and scalability, and lowered their operational prices by migrating to Amazon Redshift. This enabled the client to develop and onboard new workloads into Amazon Redshift for his or her business-critical purposes.

To find out about different migration use circumstances, discuss with the next:

In regards to the Authors

Krishna Gogineni is a Principal Options Architect at AWS serving to monetary companies clients. Krishna is Cloud-Native Structure evangelist serving to clients rework the best way they construct software program. Krishna works with clients to be taught their distinctive enterprise objectives, after which super-charge their potential to fulfill these objectives by software program supply that leverages {industry} greatest practices/instruments akin to DevOps, Information Lakes, Information Analytics, Microservices, Containers, and Steady Integration/Steady Supply.

Dayananda Shenoy is a Senior Resolution Architect with over 20 years of expertise designing and architecting backend companies for monetary companies merchandise. At present, he leads the design and structure of distributed, high-performance, low latency analytics companies for an information supplier. He’s keen about fixing scalability and efficiency challenges in distributed methods leveraging rising expertise which enhance current tech stacks and add worth to the enterprise to boost buyer expertise.

Vishal Balani is a Sr. Buyer Options Supervisor based mostly out of New York. He works carefully with Monetary Companies clients to assist them leverage cloud for companies agility, innovation and resiliency. He has in depth expertise main large-scale cloud migration applications. Exterior of labor he enjoys spending time with household, tinkering with a brand new undertaking or using his bike.

Ranjan Burman is a Sr. PostgreSQL Database Specialist SA. He focuses on RDS & Aurora PostgreSQL. He has greater than 18 years of expertise in numerous database and information warehousing applied sciences. He’s keen about automating and fixing buyer issues with the usage of cloud options.

Muthuvelan Swaminathan is an Enterprise Options Architect based mostly out of New York. He works with enterprise clients offering architectural steerage in constructing resilient, cost-effective and progressive options that handle enterprise wants.