MySQL's architect discusses open source, database in a cloud, other IT issues
LinuxWorld - MySQL AB architect Brian Aker discusses a wide range of issues in an interview with Don Marti, editor of LinuxWorld.
When I look at your Web site, I see some pretty unusual storage engines for MySQL. You can use a Web site as back-end storage or even memcached for memory-backed storage. Do those engines have any practical application? Or are they more in the nature of sample code? Well, they actually have quite a bit of practical application. In the database industry, we've been hearing, especially from companies like IBM, for some time now, about federating sources of data. So taking data from different sites or, in this case, just different data strategies and putting them together. This is kind of a very early concept that Monty Widenius had when he first came up with MySQL, though it was more around analytics and transactional engines. What we've done is we've kind of spread that concept out.
The HTTP engine is an interesting one to look at. It was written as a piece of sample code, designed so that it can communicate with a Web site. It can fetch basic data through HTTP methods and then translate that to being able to use as SQL. OK, so what's the big deal? Well, one big deal of this has been the S3 engine by Mark Atwood. Here's an engine where you've got Amazon, who's got this large infrastructure of available storage, and what Mark has done is he's made that available through an engine.
For instance, I know one of his early cases right now is they've got a real estate agency that is collecting ongoing statistics and data about sales. And instead of creating a local repository of terabytes or petabytes of data storage, they've started architecting it to just like normal SQL. They insert it into their database. But instead of having to store that data locally, it's actually placed into S3 for long-term archival. So they can take a data set that they may not really need access to all that often and put it into an environment where they don't have to pay for anything but the actual storage costs. Which, when you work out the numbers between storing that data locally, having more sys admins, having more infrastructure or just storing it in S3, the numbers are kind of strong toward using the S3 servers.
And just to give you an idea of some things that can be done, we were at the MySQL users' conference and I was explaining the HTTP engine to a group at the bar there. And this one guy is watching, and he talks to me a little bit, and he goes off. We're sitting around for the bar for like another hour, and he comes back and says, "Look at this." He had been to Google Spreadsheets, and what he had done was written a RESTful URL and placed it into the HTTP engine and then started collecting data off of a Google spreadsheet and displaying inside the database. So every time he did a SELECT on the database, it was actually pulling data from this Google spreadsheet that he had connected to. So that whole concept of taking different data sources and federating them has some very practical applications we see.
In the case of the memcache engine, it's just about trying to make an architecture which is common even simpler to use. If you were to go and look at CNET, Slashdot, Wikipedia, LiveJournal -- all these Web sites use memcache with some kind of a coordinated effort with MySQL. The idea is that I have to insert data into my database, and then I have to insert that data in the memcache. The idea was instead to just make it simple, so that when you only deal with the database for your inserts, therefore any kind of cache coherency problems go away because the data is, once it's inserted in the database automatically, blows out the cache and inserts the data directly inside of memcache. And you can always do this stuff in your application layer. But can we make it a lot simpler for users to grow these types of architectures without having to write a lot of pieces themselves? So you could say they're niche, but I especially think with the Web method, I don't think that we'll see that being nearly as niche in a few more years from now, especially as we watch more and more Web sites becoming API enabled.
You mentioned a RESTful interface. Can you explain how the concept of REST and the concept of how a SQL database expect to look at the data and match up? Well, that's interesting, and frankly, we only have a part of this actually solved today. So what we mean by RESTful is that you generate a URL, the same thing as you would normally type inside your browser. But instead, the server would respond back with some kind of XML, some kind of data source which is immediately easy to parse and easy to use. So, unlike say, SOAP or XML-RPC, it's a really, really simple API. And it's really spread pretty quickly among Web sites to generate their Web pages in very simple RESTful ways, which makes it really simple for someone to write an application to extract content from a Web site. And at the same time, you extract content from Web sites, we can take that same data source and express it in a kind of a table-like format for users to make use of.
It seems like a lot of the core work of REST is just in getting your HTTP status codes right and using GET and POST where appropriate. And then once you create the ability to put your data into an alternate template, a chunk of XML data instead of HTML, then you've taken what was just a Web application and turned it into something that becomes a Web API. Yes, exactly. And for the database, it turns it into a data engine that we can extract data from. And the good thing is that for someone who's standing at the development side, sitting there wanting to extract data, it is in the best interest for Web sites generally to provide data. So these Web sites are coming up with these APIs and extending them constantly now. So more and more data environments are being made available to the database every day.
Amazon doesn't just offer storage. It has another Web API where it's offering computing power. Yes, the "Elastic Compute Cloud" -- it's easier just to remember EC2. So, yes, that's one of the other environments. What they're doing there is they're taking that whole idea of timesharing that we've been listening to since the 1970s and making it very practical. So you can say if I need more computing power, if I need five more databases there, because I've got this much data I need to be reading, and I have this many coordinated applications, you can take MySQL on what's called an [Amazon Machine Image] running Linux and MySQL, throw it into the EC2 cloud and, as you need more computing nodes, you can just request more computing nodes. And you can keep the computing nodes around for as long as you need them.
What about the implications of this for software build and test? It sounds like a great way to get an on-demand compile farm or a farm of machines to run all your tests. Exactly. For those of us in the side of the industry where we're building and testing software, it makes complete sense. So instead of building on our own infrastructures, let's instead just throw up AMIs as we need them to test our own software. But you're seeing it replicated throughout the industry. You're seeing petroleum companies doing exploration calculations. You're seeing film companies doing rendering on it. So, for us in the database world, it's really nice for when we have databases which are in more analytic use. And you can just take more and more analytic nodes and make them available. And the thing with the S3 engine combined with EC2 is that the cost of transferring data between EC2 and S3 doesn't exist. So if you have MySQL nodes running inside EC2 and your back end storing the data inside S3, you have a storage infrastructure where you're not paying for any bandwidth. And yet you still have access to all that data for analytics.
What other opportunities do you see for companies to move software to this kind of infrastructure? Look at what it cost to set up a Web site in 1995. What did it cost to set up a Web site today? When we were working on Slashdot, one of the things was we had to do was keep providing more Web nodes. If we had a rush of traffic that came in, we had to get more nodes up and running, so frequently we had probably a third of our hardware sitting idle just for those times when it needed to be there to take an ongoing traffic rush. The nice thing is if you were building a modern Web application, you can actually just build it inside of the EC2 cloud. You could have your databases sitting there running inside of it, and your Web servers all communicating with one another. As your traffic load increases, you could just bring up more AMIs for your front-end application servers and be good to go.
I think it's incredibly exciting, because the cost of keeping infrastructure around is pretty expensive. And we're getting really smart with virtualization. That's why you see even virtualization inside corporations; that's why you see so much being done right now with KVM inside the Linux kernel, Xen and on the commercial side of things Parallels and, of course, VMware.
In your internal development at your company, I understand you use BitKeeper. Are you following some of the other revision-control systems out there? We follow them pretty much constantly. This is probably one of our favorite internal topics of debate. BitKeeper is an awesome tool. Larry McVoy cracked a nut with its design that really changed revision control and made people really think differently about revision control. And what's been interesting to see is what is happening in the open-source world as far as people adopting that same methodology and trying to make use of it.
In our view today, BitKeeper is still the strongest player and much stronger than actually three contenders right now -- Bazaar-NG, Mercurial and Git. And Git's only recent. And they're not quite there just yet. And it's interesting to see who can out-innovate who first. Can Larry and BitKeeper out keep out-innovating the open-source guys, or will the open-source guys pass him up? And it's interesting to watch. But I think it's making all the different products in that market better in the end, because they all have to compete with one another.
So does your internal development model involve a lot of disconnected development and major sync-ups? Of course. We have developers in, I think, 32 countries. So we are a modern distributed company that has distributed development going on. There is no way possible to make a CVS-like environment work for us in any way that wouldn't just slow us down.
Distributed development, the ability for groups to do commits locally, to revise commits, to then synchronize commits among collections of developers, and then to pass that upstream, that is core and critical to how we do development. We couldn't do development nearly as efficiently or nearly as quickly without that ability.
X.org recently moved over to Git, but they're using it enough in a much simpler model from the way that the Linux developers are using it. It's been interesting to just see which groups have been taking on which source control systems, and exactly how they're making use of them. I mean, any system -- even Git, BitKeeper or whatever -- can still be used in a simplified manner if you want to. But, really, the flexibility of being able to push around commit patches and share those commit patches, it really allows groups to work in a way that they want to work in.
And it's a lot quicker to set up than something that's based on a server. Yes. If you look at the different models, BitKeeper has an SSH model, and that's really nice for ones who have server setups. I think one of the strong pieces that pulls for Mercurial, is that Mercurial early on used an HTTPS model for committing. It also has SSH. So, for instance, you can just set up a public Apache server and throw a CGI in it. And then, it can handle being itself a node in a repository group. And that was one bit of innovation I thought was pretty interesting that they did. But if you're working inside a company, perhaps SSH works better for you. If you're working in a very large distributed environment, perhaps HTTPS is a better method.
On your blog recently, you posted a fairly long item about economic motivations of open-source software. And you mentioned that there's really no difference between community open source and commercial open source. A lot of the experts out there seem to be saying that there is a big difference. Well, I think that you have to look at first what is the motivation of why something was open source. And what exactly was the reason for it? Developers do this to share. Companies have that same reason. Now a developer is interested in share based either on that it's better for their career, perhaps. They get more recognition. They're more likely to be hired. It could be for the fact that they are interested in getting more testing done. On the company side, you can see the same things. They're interested in getting their product out, creating a larger channel. So the motivational factors are the same.
For the person who is purchasing it, the end question has to be at the end of the day, what happens if the company fails? What happens if the developer decides they're no longer interested in the software? I think that the motivations … then also come back into play. Namely, if the software was that important, usually someone else will spring up to support the software being that it's open source. Or that a company can bring a product in-house to support themselves that's a requirement of their business. So I think that the difference between a project that is run by a developer and a company are … about the same as far as whether it's commercial or actually noncommercial.
Now an individual developer might be participating in order to improve their skills and to show off and make those skills visible to a potential employer or potential client. Which of those effects do you think is dominant? Or do you think they balance out? I have asked this question to a number of people. You'll notice that I left out one answer to this, which is brought up every so often. Which is, "Let's make the world a better place." I have found that that is usually not actually that common. And that was the motivator that I often see mentioned more often.
Between the other two, I have found when asking that question, a lot of it had to do with that it was simpler to get it out and have other people extend it and make use of it along with the original developer. And often, that comes back more often from a case of "Oh, I discovered a bug for you." There's a myth sometimes in open source that as soon as you open source something, there's a stampede of people lining up at the door to hand you patches. That's just not the case. Even in the largest of large projects, that doesn't actually happen that often. And the pieces that come in, it's very hard to balance a new piece of code vs. a piece of code you're already writing.
So I think that for most developers, and what I hear most often, is that the real motivator is they get back components which are in the way of bug fixes. At a Usenix group about a year ago, there was a discussion on that. And Mike Olson, now at Oracle, at the time doing Sleepycat, to him the strongest motivator was that he got word of mouth and channel. He got his product more well known. And from that, he was able to bring in more customers. So I think it's really a coin flip between looking at the two.
Each project has a different reason or a different motivator for being out there and used, which goes back to even licensing. When we look at BSD vs. GPL, the strongest piece that I see there is if somebody's going BSD, the entire goal is complete ubiquity. A GPL play is not going for complete ubiquity. It's going for a large base of users. It's going back to try to extend more open source. And generally, there's more of a support leverage mechanism inside of the GPL. But for the BSD guys, they're going after complete ubiquity. So the fact that organizations are more likely to need that ubiquity go after a BSD license. Like say for something like SQLite, Richard Hipp's database. Anyway, I think that the licensing will stack up and will kind of force your original question of, hey, which one is more of the motivator.
So if the BSD license or other nonreciprocal software licenses, like the Apache license, are supposed to promote ubiquity, how come MySQL is the ubiquitous database on every Web-hosting service, on every distribution? It seems like it's all over the place. There's ubiquity of use, and there's ubiquity of embedding. If you've got a library, for instance, and you want that library to be scattered out to the world and used as many places as possible, that's a different market than an actual application that somebody would actually want to make use of. Look at something like a MySQL. Look at the GPL desktops. Look at the kernel. There's a lot of ubiquity -- they're more of an application than say a library or a tiny little piece that makes up an actual product. And I think the difference is how far toward complete ubiquity you actually need.
I'm sure you get this question a lot. But let's say "wannabe open-source developer" is in college and wants to make a name for himself and make a living doing software. What kind of projects do you see out there that are available for people to take on now? There's a couple of different things that I would be motivated by. One, we have the guys who want to be the next big Web site. They want to be Facebook or something like that. And that's an entirely different model that they would go after. And I think it requires kind of a different kind of entrepreneurial kind of feeling to want to do that. For other developers, I think that generally scratching your own needs is one of the best things that you can do. Find something that you need. Find something that you see a need for and that you're going to get passionate about it and go do it. I think that's the best way that you can really fall in love with what you're doing, but at the same time, also find the energy to do it.
When you pick to do open source, and you probably picked to do it in college, that means you're going to give up a few Friday and Saturday nights to work on what you're actually doing. And if you're sitting around later, asking yourself, "Why did I spend my Friday and Saturday night doing that?" you had better really love it.
It was suggested to me recently from a friend who is a little more business-oriented that perhaps open-source developers should actually be looking to fill holes in companies' lineups. So go figure out what the next piece is missing right now that you see companies not quite needing, or companies wanting in particular, and targeting those holes. But for a college student, I think that they should actually go for love and not necessarily money. I think they'll find that when they look back on their efforts, they'll be a little bit happier with the time they spend.



- Excel 2010 Cheat Sheet
- Register for this Computerworld Insider Cheat Sheet and gain access to hundreds of premium content articles, guides, product reviews and more.
- Overcome Top 7 Admin Challenges of Active Directory
- As Active Directory's role in the enterprise has drastically increased, so has the need to secure the data. Gain insight on creating repeatable,...
- Insiders Can Ruin Your Company. Take Action.
- Did you know that 80 percent of threats to an organization come from the inside? The threat from insiders is often overlooked in...
- Top Solutions and Tools to Prevent Devastating Malware
- Custom malware frequently goes undetected. According to Forrester Research, the best way to reduce risk of breach is to deploy file integrity monitoring...
- Streamline Compliance and Increase ROI
- Streamline, simplify, and automate compliance related activities; especially those that impact multiple business units. This white paper from NetIQ, outlines solutions that will...
- X-Ray of the PCI Process-4 Proactive Steps
- This white paper from Forrester Research Inc., helps break PCI into understandable components. Security and risk professionals will gain knowledge and insight into... All NOSes and Server Software White Papers
- Optimizing Networks for the Cloud
- Join guest speaker, Rohit Mehra, IDC Director of Enterprise Communications Infrastructure, to explore current trends, discuss best practices for optimizing Data Center and...
- Apps QuickStart Series Part 2: Designing and Deploying SQL Server on VMware vSphere
- Download this webcast to learn about the design considerations for virtualizing SQL workloads, performance and scalability information and high-availability options, as well as...
- Apps QuickStart Series Part 1: Designing and Deploying Exchange 2010 on VMware vSphere
- Download this webcast to learn the virtual hardware design considerations for Exchange 2010, deployment using the building block approach, options for high-availability and...
- Customer Spotlight: How IPC The Hospitalist Company Implemented Oracle on VMware
- Have you been looking to hear about customer's experiences with the new VMware vCenter Site Recovery Manager product? View this webcast to learn...
- Virtualize Business-Critical Applications with Confidence
- Virtualizing business-critical applications has become a key focus for organizations as they move along their virtualization journey. With the launch of VMware vSphere®... All NOSes and Server Software Webcasts