MySQL's architect discusses open source, database in a cloud, other IT issues

MySQL AB architect Brian Aker discusses a wide range of issues in an interview with Don Marti, editor of LinuxWorld. When I look at your Web site, I see some pretty unusual storage engines for MySQL. You can use a Web site as back-end storage or even memcached for memory-backed storage. Do those engines have any practical application? Or are they more in the nature of sample code? Well, they actually have quite a bit of practical application. In the database industry, we've been hearing, especially from companies like IBM, for some time now, about federating sources of data. So taking data from different sites or, in this case, just different data strategies and putting them together. This is kind of a very early concept that Monty Widenius had when he first came up with MySQL, though it was more around analytics and transactional engines. What we've done is we've kind of spread that concept out.

The HTTP engine is an interesting one to look at. It was written as a piece of sample code, designed so that it can communicate with a Web site. It can fetch basic data through HTTP methods and then translate that to being able to use as SQL. OK, so what's the big deal? Well, one big deal of this has been the S3 engine by Mark Atwood. Here's an engine where you've got Amazon, who's got this large infrastructure of available storage, and what Mark has done is he's made that available through an engine.

For instance, I know one of his early cases right now is they've got a real estate agency that is collecting ongoing statistics and data about sales. And instead of creating a local repository of terabytes or petabytes of data storage, they've started architecting it to just like normal SQL. They insert it into their database. But instead of having to store that data locally, it's actually placed into S3 for long-term archival. So they can take a data set that they may not really need access to all that often and put it into an environment where they don't have to pay for anything but the actual storage costs. Which, when you work out the numbers between storing that data locally, having more sys admins, having more infrastructure or just storing it in S3, the numbers are kind of strong toward using the S3 servers.

And just to give you an idea of some things that can be done, we were at the MySQL users' conference and I was explaining the HTTP engine to a group at the bar there. And this one guy is watching, and he talks to me a little bit, and he goes off. We're sitting around for the bar for like another hour, and he comes back and says, "Look at this." He had been to Google Spreadsheets, and what he had done was written a RESTful URL and placed it into the HTTP engine and then started collecting data off of a Google spreadsheet and displaying inside the database. So every time he did a SELECT on the database, it was actually pulling data from this Google spreadsheet that he had connected to. So that whole concept of taking different data sources and federating them has some very practical applications we see.

In the case of the memcache engine, it's just about trying to make an architecture which is common even simpler to use. If you were to go and look at CNET, Slashdot, Wikipedia, LiveJournal -- all these Web sites use memcache with some kind of a coordinated effort with MySQL. The idea is that I have to insert data into my database, and then I have to insert that data in the memcache. The idea was instead to just make it simple, so that when you only deal with the database for your inserts, therefore any kind of cache coherency problems go away because the data is, once it's inserted in the database automatically, blows out the cache and inserts the data directly inside of memcache. And you can always do this stuff in your application layer. But can we make it a lot simpler for users to grow these types of architectures without having to write a lot of pieces themselves? So you could say they're niche, but I especially think with the Web method, I don't think that we'll see that being nearly as niche in a few more years from now, especially as we watch more and more Web sites becoming API enabled.

You mentioned a RESTful interface. Can you explain how the concept of REST and the concept of how a SQL database expect to look at the data and match up? Well, that's interesting, and frankly, we only have a part of this actually solved today. So what we mean by RESTful is that you generate a URL, the same thing as you would normally type inside your browser. But instead, the server would respond back with some kind of XML, some kind of data source which is immediately easy to parse and easy to use. So, unlike say, SOAP or XML-RPC, it's a really, really simple API. And it's really spread pretty quickly among Web sites to generate their Web pages in very simple RESTful ways, which makes it really simple for someone to write an application to extract content from a Web site. And at the same time, you extract content from Web sites, we can take that same data source and express it in a kind of a table-like format for users to make use of.

It seems like a lot of the core work of REST is just in getting your HTTP status codes right and using GET and POST where appropriate. And then once you create the ability to put your data into an alternate template, a chunk of XML data instead of HTML, then you've taken what was just a Web application and turned it into something that becomes a Web API. Yes, exactly. And for the database, it turns it into a data engine that we can extract data from. And the good thing is that for someone who's standing at the development side, sitting there wanting to extract data, it is in the best interest for Web sites generally to provide data. So these Web sites are coming up with these APIs and extending them constantly now. So more and more data environments are being made available to the database every day.

Amazon doesn't just offer storage. It has another Web API where it's offering computing power. Yes, the "Elastic Compute Cloud" -- it's easier just to remember EC2. So, yes, that's one of the other environments. What they're doing there is they're taking that whole idea of timesharing that we've been listening to since the 1970s and making it very practical. So you can say if I need more computing power, if I need five more databases there, because I've got this much data I need to be reading, and I have this many coordinated applications, you can take MySQL on what's called an [Amazon Machine Image] running Linux and MySQL, throw it into the EC2 cloud and, as you need more computing nodes, you can just request more computing nodes. And you can keep the computing nodes around for as long as you need them.

What about the implications of this for software build and test? It sounds like a great way to get an on-demand compile farm or a farm of machines to run all your tests. Exactly. For those of us in the side of the industry where we're building and testing software, it makes complete sense. So instead of building on our own infrastructures, let's instead just throw up AMIs as we need them to test our own software. But you're seeing it replicated throughout the industry. You're seeing petroleum companies doing exploration calculations. You're seeing film companies doing rendering on it. So, for us in the database world, it's really nice for when we have databases which are in more analytic use. And you can just take more and more analytic nodes and make them available. And the thing with the S3 engine combined with EC2 is that the cost of transferring data between EC2 and S3 doesn't exist. So if you have MySQL nodes running inside EC2 and your back end storing the data inside S3, you have a storage infrastructure where you're not paying for any bandwidth. And yet you still have access to all that data for analytics.

What other opportunities do you see for companies to move software to this kind of infrastructure? Look at what it cost to set up a Web site in 1995. What did it cost to set up a Web site today? When we were working on Slashdot, one of the things was we had to do was keep providing more Web nodes. If we had a rush of traffic that came in, we had to get more nodes up and running, so frequently we had probably a third of our hardware sitting idle just for those times when it needed to be there to take an ongoing traffic rush. The nice thing is if you were building a modern Web application, you can actually just build it inside of the EC2 cloud. You could have your databases sitting there running inside of it, and your Web servers all communicating with one another. As your traffic load increases, you could just bring up more AMIs for your front-end application servers and be good to go.

I think it's incredibly exciting, because the cost of keeping infrastructure around is pretty expensive. And we're getting really smart with virtualization. That's why you see even virtualization inside corporations; that's why you see so much being done right now with KVM inside the Linux kernel, Xen and on the commercial side of things Parallels and, of course, VMware.

In your internal development at your company, I understand you use BitKeeper. Are you following some of the other revision-control systems out there? We follow them pretty much constantly. This is probably one of our favorite internal topics of debate. BitKeeper is an awesome tool. Larry McVoy cracked a nut with its design that really changed revision control and made people really think differently about revision control. And what's been interesting to see is what is happening in the open-source world as far as people adopting that same methodology and trying to make use of it.

In our view today, BitKeeper is still the strongest player and much stronger than actually three contenders right now -- Bazaar-NG, Mercurial and Git. And Git's only recent. And they're not quite there just yet. And it's interesting to see who can out-innovate who first. Can Larry and BitKeeper out keep out-innovating the open-source guys, or will the open-source guys pass him up? And it's interesting to watch. But I think it's making all the different products in that market better in the end, because they all have to compete with one another.

So does your internal development model involve a lot of disconnected development and major sync-ups? Of course. We have developers in, I think, 32 countries. So we are a modern distributed company that has distributed development going on. There is no way possible to make a CVS-like environment work for us in any way that wouldn't just slow us down.

Distributed development, the ability for groups to do commits locally, to revise commits, to then synchronize commits among collections of developers, and then to pass that upstream, that is core and critical to how we do development. We couldn't do development nearly as efficiently or nearly as quickly without that ability.

X.org recently moved over to Git, but they're using it enough in a much simpler model from the way that the Linux developers are using it. It's been interesting to just see which groups have been taking on which source control systems, and exactly how they're making use of them. I mean, any system -- even Git, BitKeeper or whatever -- can still be used in a simplified manner if you want to. But, really, the flexibility of being able to push around commit patches and share those commit patches, it really allows groups to work in a way that they want to work in.

And it's a lot quicker to set up than something that's based on a server. Yes. If you look at the different models, BitKeeper has an SSH model, and that's really nice for ones who have server setups. I think one of the strong pieces that pulls for Mercurial, is that Mercurial early on used an HTTPS model for committing. It also has SSH. So, for instance, you can just set up a public Apache server and throw a CGI in it. And then, it can handle being itself a node in a repository group. And that was one bit of innovation I thought was pretty interesting that they did. But if you're working inside a company, perhaps SSH works better for you. If you're working in a very large distributed environment, perhaps HTTPS is a better method.

On your blog recently, you posted a fairly long item about economic motivations of open-source software. And you mentioned that there's really no difference between community open source and commercial open source. A lot of the experts out there seem to be saying that there is a big difference. Well, I think that you have to look at first what is the motivation of why something was open source. And what exactly was the reason for it? Developers do this to share. Companies have that same reason. Now a developer is interested in share based either on that it's better for their career, perhaps. They get more recognition. They're more likely to be hired. It could be for the fact that they are interested in getting more testing done. On the company side, you can see the same things. They're interested in getting their product out, creating a larger channel. So the motivational factors are the same.

1 2 Page 1
Page 1 of 2
7 inconvenient truths about the hybrid work trend
Shop Tech Products at Amazon