I will be live blogging the SQLPASS Summit keynote again today. Today’s keynote will be delivered by community favorite Dr. David Dewitt.
Rick Heiges opens us up and introduces a special musical number by Rob Farley and Buck Woody. It was hilarious with a special lyrical shout out to Paul Randal. Rick is now walking us through the leadership changes in the PASS organization, with a very special thank you to past President Wayne Sneider as Rushabh Mehta becomes the immediate past President.
Rick is announcing upcoming events with SQLRally in Dallas on May 10-11. He also announces the next Summit in Seattle November 6-9. He also tells everyone that all Summit attendees will receive an e-book copy of the MVP Deep Dives book and to keep an eye on their email.
Dr. David Dewitt enters the stage with a great start by explaining why his wife is not here to watch him speak. There are lots of laughs. David starts his talk on big data and shares some size statistics for systems of the larger web sites like Facebook and other social media. The data sizes are astounding.
David is explaining the NOSQL movement and points out that is does NOT mean NO SQL, but means not only SQL. So he wants us to think about large data and a mix of systems to support the data. We may have one entry point or front end to get the data but on the backend some data might be in SQL and some might be in Hadoop. He is trying to get us to think about the data and that not all data is relational and is better suited for other storage systems.
David explains that NOSQL is not a paradigm shift and that RDBMS are still the best way to store data efficiently. However, some data like unstructured data does not work best in an RDBMS. He plans on talking about Hadoop and how it works.
David is explaining the Hadoop file system called HDFS. The HDFS does not replace the windows file system or NTFS but sits on top of it. The blocks are stored and replicated by a factor of three. It sounds like RAID 5 but spread across multiple nodes with seperate storage systems. The node first written to is the node where the transaction originated, the second on a node in the same rack, and the third on a node in a different rack. The replication of this data is handled by a name node or primary node (which also has a backup node). It monitors all the other nodes with heartbeats and decides how to distribute the data among the nodes.
David is now explaining how Hadoop handles failures and that it was designed to expect failures. It does use checksums for reads and writes, but expects that hardware and software failures will occur. When a failure occurs on a node then the name node finds the blocks that are missing and replicates them to other nodes to maintain that factor of three. I can’t help but think this sounds like RAID except it is supported by replication as opposed to multiple writes across disk. With a factor of three I envision it like a RAID 5 on top of a RAID 5, but all data is written by replication instead of multiple writes on a single disk array.
David is explaining how Hadoop finds the data when you request it since you don’t know where it is stored. David is moving on to how mapreduce works with an animation he jokes took him 6 hours to come up with. The audience loves the animation explanation, with some clapping. He shows how map tasks find the data across nodes and then hand it off to the map reduce procedure that takes the data from the multiple nodes and reduces it to a single output.
Now we are hearing that after Hadoop came out that Facebook and Yahoo started using it. However, they both came to different conclusions on the language. Facebook came up with something very SQL like, but Yahoo came up with something more procedural. David brings up a slide with a lot of writing and jokes that it is not meant to be read and that he will not be using zoomit. The crowd loves the comment with lots of laughs after the lack of use in other keynotes this week. David now points out that out of the 150k jobs Facebook runs that only 500 are map reduce jobs and the rest are hive sql. Now we are seeing how hive tables are designed and that they are partitioned but a particular attribute.
Now we are seeing how Hive relates to parallel datawarehousing. We can see how Hive is great for unstructured data that is not related, but how SQL is much better with relational data due to a common schema and partitioning method. Now David is talking about putting the two things together and connecting the universe. We see the difficulties in getting data from both worlds in regards to performance. He explains the SQOOP approach and the challenges and that there must be a better way. SQOOP moves the data from one world to other to get the data.
David asks what about if we don’t move the data, but put a management systems between the two that understands how to get the data from both systems seperately. This is something he is working on in his labs as it becomes clear that we will be living in a world with both types of data and a need to get information from both and relate it.
David is wrapping things up with a re-cap and driving home the major points. The biggest one is that SQL is not going away and neither is Hadoop or other unstructed data systems and we need to work with both.
1!
That’s the end of the last SQLPASS 2011 summit keynotes. The crowd is wild about David with a huge standing ovation!