Under the title “Hadoop and Oracle technologies on BI projects” Mark Rittman flew to The Netherlands on the 14th of July to visit the Oracle Usergroup Holland.
As I had obviously heard a lot about Hadoop, I never really did anything further with it and left it to a synaptic link to Gwen Shapira. This lack of action created a kind of threshold in the understanding of the technology. When I heard about this session I realized this would be the moment to take a step further. It turned out the be the first real talk that puts “Big Data” in the perspective it needs to be consumable and realistic.
In these current times where “The Internet of Things”, more and more social media and ever further digitization we are heading to a Big Data Disruption. This is both a conceptual as a very real thing if you take a moment to think about it. According to real world experience it is also not something “which will once be”, it is something which is actually here today!
On the technical side of things, data is captured in something that is called a “data reservoir” (or “data lake” or “data dump (yard)”). Compared with “regular” data storage, you can conclude that data-governance, or a data-structure, in a Big Data system is applied later We are used to apply this structure, this governance, beforehand, by applying data definition. Using Hadoop in combination with noSQL give you “schema on read” capabilities making quering of the Hadoop data reservoir possible.
Adding this structure later is harder! This leads to the following:
- Data is much easier to get into Hadoop then into a star-schema
- Data is much easier to get out of a star-schema then out of Hadoop
This could be one of the essential things to consider when thinking about engaging in a Big Data project!
As Tanel Poder concluded: “High value, high density data will remain in the Oracle database” which I think is a very true conclusion. In the end, the high value conclusions (or the engineering of Big Data results) will also happen within the Oracle database.
On the horizon is “Oracle Big Data Discovery” which will help with the time consuming and tedious work of sorting and interpreting raw data in the data reservoir. The use of ‘R’, as the data exploration tool of duty, is expected to be replaced by this discovery tooling, over time…
To sum up the concept of the first half of the presentation, to my taste:
- Hadoop changes business
- NoSQL scales business
- Oracle runs business
“It takes eons to list all names of the Buddha” nicely sums up the number of different applications that make up and are needed to execute a successful Big Data project.
Plus, “You’d better keep the 13 rules for relational databases close at hand“!
Part two of the evening was spent on mapping these concepts on actually tools, disclosing data through Hadoop to Oracle SQL and making actual use of Big Data. The exercise was completed by demos and illustrated by screenshots from the slides (link below).
A special word of warning goes out to the security aspect of Big Data, which is something to really pay close attention to. Kerberos authentication and apache Sentry are imperative things to implement in your Big Data environment.
All in all, this evening turned out to be 110% more informative and necessary as I expected when I embarked on the journey to Utrecht! Thank you for sharing, Mark!
Thanks to Piet de Visser for the nice quotes! And a great “hi there” to Klaas-jan Jongsma, René Kuipers and Marti Koppelmans.
If you want to work with Big Data on your Smal(ler) Device, please download the Big data light VM from OTN.
The link to the slides for anyone who wants to review the “extended remix”!