Copy data management solutions are being added to architectures to address the demands of CCPA and GDPR. Copy data management is a capability available as a stand-alone data management tool, as well as provided within data virtualization platforms. Advantages of copy data management discussed in this research include the ability to stop orphan data copies, eliminate extra storage costs and keep copies up to date. Advantages discussed also include the ability for copies to be kept equal to each other, simplified governance and accelerated development.
These Whisper Studies demonstrate two initial uses cases that prompted organizations to add data virtualization to their architecture. The first study involves a North American-based railroad’s evolution to Precision Railroading, which required entirely new operational metrics. To develop the metrics, the team required replica of production without affecting production. The railroad leveraged Teradata QueryGrid. The second study is from a consumer electronics manufacturer that wished to reduce its cost basis for data storage and computation. The original application did not require any changes. The manufacturer leveraged Gluent on Oracle and Hadoop to achieve their desired result.
The Studies
Railroad Access On-Premise Data
Scenario: Production data in Teradata – needed parallel dev environment
Start State: Production data in Teradata Active Enterprise Data Warehouse
End State: Teradata QueryGrid provides data access to dev environment
Data Size: 5 terabytes
Set-up Time: Half day to install, 6 months to optimize tuning
Interview(s): August 2019
Consumer Products Manufacturer to Lower Cost Performance Data Management
Scenario: Application leverage Oracle production database
Start State: Oracle storage growing too fast, too expensive
End State: Leveraged Gluent to evolve data and computations to Hadoop
Data Size: 31 terabytes of application data in Oracle Database
Set-up Time: One-hour installation, two months to implement
Interview(s): March 2020
Key Takeaways
Data virtualization is frequently brought into environments to provide access to production data without disturbing the production environment.
Data virtualization is frequently brought into environments to reduce the cost structure of an environment. Data virtualization is also useful in bringing legacy applications into a modern data management solution.
Set up for data virtualization is often quick; initial set up frequently takes less than a day, while most organizations become fluent in tuning the environment within 2-6 months.
These Whisper Studies center on two cases leveraging a data virtualization platform. In both scenarios, it is the organization’s first data virtualization use case or the one that caused them to add it to their architecture. The first study involves an organization that needed to update its operational models and related analytics. They needed to leverage production data to develop and confirm new metrics without disrupting production. The second Whisper Study wished to reduce the cost profile of their production and analytics without a major architecture change. Both leveraged data virtualization to address their needs.
Scenario: Production data in Teradata – needed parallel dev environment
Start State: Production data in Teradata Active Enterprise Data Warehouse
End State: Teradata QueryGrid provides data access to Dev
Data Size: 5 terabytes
Set-up Time: Half day to install, 6 months to optimize tuning
Interview(s): August 2019
A North American-based railroad company needed to move to new operational analytics as part of moving toward Precision Schedule RailRoading1. To accomplish this, the railroad wanted to evaluate the new operational metrics in development before updating production.
To evaluate the new metrics, the organization required a parallel environment to that of production. This parallel environment required some 30 tables and millions of rows of data – all without interrupting or burdening production. The data was primarily transportation data with some finance data mixed in.
To accomplish this, development needed an exact copy of the large volumes of production data. The copy needed to be scheduled, properly updated based on all dependent processes, and the development copy needed to be complete. In addition, to compare the new set of operational metrics to the current operational models, target tables of production were also required in the parallel environment.
Note that as a railroad is land rich with significant bandwidth available along the lines, the railroad owns and operates two of its own data centers. This also allows the organization to control the highly sensitive data regarding its operations that affect multiple industries, since they ship raw ingredients across the continent. As such, their entire solution is considered on-premise.
Because a majority of the data was in Teradata Active Enterprise Data Warehouse on-premise, it was natural to reach out to Teradata for a solution, which provided Teradata QueryGrid2, a data virtualization solution. Additional research that provides details on QueryGrid’s capabilities can be found in “Whisper Report: Six Data Engineering Capabilities Provided by Modern Data Virtualization Platforms.”
By leveraging QueryGrid, the railroad had a perfect replica of production without the concern of interfering with production. When using a data virtualization platform, the platform provides a view of the data to execute your needs. This view is independent of the original form of the data but may or may not actually involve an additional complete physical copy of the data. More importantly, the data virtualization technology is able to maintain an up to date view of the data, as depicted in Figure 1.
To leverage Teradata’s QueryGrid, the following steps were required.
Connect the source: As with all data layers, the specific sources to be used by the platform must be connected. When connecting the sources, the majority of the time was spent tracking down and setting up the permissions to connect.
Configure the views: Data virtualization platforms such as QueryGrid operate by providing data views. The second step was creating the required data views as required for the Precision Railroading Project.
To protect production, only official DBAs within IT could create views leveraging QueryGrid – they did not want production data to be wrongly exploited. No major problems were incurred by the project.
Figure 1. Develop with Production Data without Affecting Production
With the exact replica of production data and related current operational metrics, the railroad was able to perform a side-by-side comparison with the incoming Precision Railroading Metrics. It was critical for the business to get comfortable with the impact of the new metrics before they became the official operating metrics for the company. Accuracy was critical, as the railroad’s operational metrics are publicly released. Note, formal data validation platforms were not used to compare the data, but rather, SQL scripts were leveraged (see Whisper Report: Decision Integrity Requires Data Validation for related research).
The new corporate reporting metrics tracked events such as how long a train took to go from Point A to Point B, as well as how long the train stayed or stopped at each of the stations between the two points. Overall, there were an assortment of metrics that are part of the Precision Railroading that they wanted to realize. As a result of the new operational insights, they found numerous opportunities to make improvements in the process. For example, visibility was given to instances where certain customers required multiple attempts to successfully deliver a load. With the waste identified, the organization could now address issues that negatively impacted their efficiencies.
This project was the railroad’s first Teradata QueryGrid project. With success under their belt, the next project will expand the business’s ability to be more involved in self-service.
The second study involves a large electronics manufacturer seeking to reduce their cost profile. The electronics manufacturer has a large amount of sensor data coming from their machines (a single data set is 10 terabytes). The data regarding the machines is stored in an Oracle database, which worked well at first but was not able to maintain the cost profile desired by the organization. There was an annual expense for additional space required in order to continue leveraging the application storing information in Oracle. This organization wished to reduce the cost profile without rewriting the application.
The consumer electronics manufacturer decided to leverage Gluent data virtualization solution3. Gluent was installed on the Oracle server and Hadoop. The application simply connected to Oracle without any changes whatsoever. Behind the scenes, the data and the work on the data was now spread between Oracle and Hadoop significantly reducing their cost structure and eliminating the need for the organization to expand their Oracle footprint. The fact that the data was spread between Oracle and Hadoop was invisible to the application and its users, as depicted in Figure 2.
In order to leverage Gluent the following steps were required.
Install Gluent: Gluent is installed on all data sources, particularly Hadoop and Oracle. When Oracle or Hadoop is called today, users are actually using the Gluent code installed on the server. The work is now able to be seamlessly offloaded to Hadoop as needed and is cost optimized. The install took less than one hour. Once again, it is critical to have access passwords available. Setting permissions correctly is also required.
Use the migration tool: Gluent has a built-in migration tool the consumer manufacturer was able to leverage to handle the initial set up. This automatically migrated some of the Oracle data to Hadoop while maintaining a single view of the data.
Query Tune: This is a continual effort that gets easier over time. When optimizations turn out to not be optimal, Gluent allows “Hints,” which are the methods one can design to optimize specific scenarios.
Figure 2. Gluent is Used to Extend Oracle Data to Hadoop Reducing Cost
The Oracle Applications still call on and use Oracle. Behind the scenes, Gluent installed on Oracle is able to leverage Hadoop for storage and compute power. The application itself did not require any changes. Their cost profile for data storage and computations is now reduced. The plan is to not change the Oracle application at all but, rather, to simply continue reducing the actual data and computations conducted by Oracle. Fortunately, this also moved this application group in line with other internal groups that are using big data solutions on Hadoop. The Hadoop environment is familiar to the other teams, and through Hadoop due to Gluent, those users can now leverage the Oracle application and related data without Oracle skills. This capability is due to two functionalities that are common in data virtualization.
Remote Function Execution: The ability of data virtualization to parse a query and have portions of a query executed on another remote system. In this instance, one can access Oracle and run the query on Hadoop. Likewise, one can access Hadoop and have a query run on Oracle. Where a query runs is subject to configuration and constraints such as cost and time.
Functional Compensation: The ability perform an Oracle operation, specifically SQL, on your Hadoop data, even though Hadoop does not support SQL queries natively.
Together, these two capabilities enable the manufacturer to leverage their Oracle experts without retraining. This benefit is in addition to reducing their storage and computational costs.
Data virtualization platforms have numerous benefits to organizations that add it to their architecture. This research examines five quantifiable advantages enterprises that adopt data virtualization experience. Quantifiable advantages include user access to data, copy data management, centralized governance, increased agility, and reduced infrastructure costs. Data Virtualization platforms provide measurable advantages to the digitally transformed and significantly contribute to the return on investment (ROI) realized by the architecture.
Digitally transformed organizations expect reliable, data-driven decisions. Data validation platforms are able to test data ingestions and transformations within an enterprise. Many data validation platforms can test structured data, big data, BI Tools and ERP Systems, as well as non-standard data types, be it flat files or streaming. Data validation platforms can conduct regression tests and monitor production data. Likewise, data validation platforms are being used to support six different and critically important use cases. This research evaluates and ranks the various modern data validation platforms according to their architectural capabilities and ability to successfully meet popular use cases.
When selecting technologies for your data architecture, it is important to understand common use cases enabled by the technology. This research examines six use cases enabled by data validation and the architecture capabilities used to support the use case. To this end, we examine the validation of ingestion and transformations, the data migration, as well as cloud update use cases. The use cases for production monitoring, completeness of data sets, the ability to compare BI tools’ values, as well as data DevOps are also evaluated.
North America’s largest construction trade show, CONEXPO-CON/AGG and IFPE are held together every three years in Las Vegas. There were endless examples of Industrial Intelligence of Things (IIoT) and edge computing. The conference featured 2.7 million net square feet of exhibits, over 2,300 exhibitors, 150 educational courses, and 130,000 attendees from 150 countries. The intelligence enabled by sensors continues to expand as does the visibility of the results. The digital transformation of the construction site is underway with multiple vendors offering supply chain and job site integrated views. The show closed one day early due to COVID-19.
TDWI Las Vegas was an educational and strategy event for 425 attendees, including 66 international attendees from 15 countries. Four major educational tracks featured over 50 full day and half-day sessions, as well as exams available for credit on related courses to become a Certified Business Intelligence Professional (CBIP). The educational tracks included: Modern Data Management, Platform and Architecture, Data Strategy and Leadership, Analytics and Business Intelligence. The strategy Summit featured 14 sessions, including many case studies and a special session on Design Thinking. The exhibit hall featured 20 exhibitors and 6 vendor demonstrations and hosted lunches and cocktail hours.
TDWI Las Vegas was an educational and strategy event for 425 attendees, including 66 international attendees from 15 countries. Four major educational tracks featured over 50 full day and half-day sessions, as well as exams available for credit on related courses to become a Certified Business Intelligence Professional (CBIP). The educational tracks included: Modern Data Management, Platform and Architecture, Data Strategy and Leadership, Analytics and Business Intelligence. The strategy Summit featured 14 sessions, including many case studies and a special session on Design Thinking. The exhibit hall featured 20 exhibitors and 6 vendor demonstrations and hosted lunches and cocktail hours.
Modern data virtualization platforms are increasingly becoming a critical part of data architectures for the digitally transformed. Many data virtualization platforms provide data as a service, data preparation, data catalog, logically centralized governance, the ability to join disparate data sets, and an extensive list of query performance optimizations. Likewise, data virtualization platforms are increasingly being used to support six different and critically important use cases. This research evaluates and ranks the various modern data virtualization platforms according to their architectural capabilities and ability to successfully meet popular use cases.
Playlist for Conference Whispers: Oracle’s Next Gen Cloud Analyst Summit 2020
ABSTRACT
Oracle’s 2020 Next Gen Cloud Analyst Summit allowed 40 industry analysts from North America to see executive product managers and customers talk about Oracle’s newest offering. There were four breakout session topics: Data and Analytics, Cloud Infrastructure, Identify Management and Core Cloud Security, as well as Application Development and Integration. Attendees were able to hear the story of Oracle’s digital transformation and migration to the cloud in addition to four other Oracle customers’ stories.