Why I built my own OS

Rob Gibbon
5 min readOct 6, 2018

So, I’m working on a new data processing system that we’re calling the Barbarian Data System. Our idea is to build and ship a better implementation of the Apache Hadoop stack — or at least, the bits that are still relevant, which in our opinion are primarily Apache Hive and Apache SolrCloud.

It couldn’t have been better timed — Cloudera and HortonWorks announced they’re merging just this week and their CEO expressed his aspiration for the new firm to be “the next Oracle”. Depending on your viewpoint, but in my experience, this is going to be a heads up to most existing CDH/HDP customers to get off these platforms as fast as possible.

Our vision for Barbarian is for a cloud native, yet cloud agnostic data system that takes on all kinds of data just like it should, but that isn’t tied to any single cloud provider, and that gives Hadoop users a real choice, while being relatively easy to deploy and demonstrate value. So of course we went straight to Kubernetes, Helm, and the container revolution as the starting point for the solution.

By adopting the kube, we reasoned, we’ll be able to offer our users portability and ease of adoption, whilst still servicing the kinds of customers who prefer to stay behind a corporate firewall. We’ll be able to build something akin to SAP HANA in a fraction of the time, something fast, in-memory oriented, massively scalable and importantly, we will not be reinventing any wheels but instead leveraging the ocean of FOSS out there to build our own Hadoop remix.

Or so we thought, until we started thinking through the implications of container technology.

Containers and software distribution

If you want to ship and perhaps even sell software in Docker containers, you’re likely going to be considered a software distributor. As a distributor, you are obliged to to comply with all of the license terms of the software that you might ship in your container images. That will include the base image software. For example, if you post a PHP application that is based on an Ubuntu Linux base image on Azure AKS Marketplace, then you are responsible for complying with the license terms of both All of the software that composes Ubuntu Linux, and the terms of the Ubuntu Linux distribution itself.

The GPL.

The General Public License is the license applied by the GNU to much of the Linux codebase and supporting cast. According to the GPL FAQ, one of the the key principles underpinning the license is that users should be free to modify GPL licensed code as they see fit without limitation or restriction, which is a noble ideal.

The idea is that if the user wants to try a new version of, say, the bash shell — one that’s 10% faster than the original version, or closes a security flaw for instance, then they shouldn’t have to go back to the software distributor or vendor in order to have to have the application modified to use the new version of bash.

To this end, the GPL contains a clause that any code intimately linked to GPL code must be distributed under the same license terms as the GPL, or must otherwise not be distributed at all. This is often dubbed the “copyleft” idiom.

While other licenses, for example the CDDL license, also sometimes include clauses of this nature, few are so strongly framed as the clauses in the GPL.

Now, Hadoop depends on a lot of bash scripts. And in Hadoop 3.x, they’re intimately linked to the bash shell — in that those scripts cannot be run on any other system shell such as ksh without some major rework.

Not only that, but Hadoop is predominantly a Java codebase. Java is developed by Oracle Corporation and offered in a dual licensing model.

You can either pay Oracle for the proprietary version, which weighs in at about $3500 per CPU core, about $350 per named user, or a revenue share or other custom deal; or you can take the GPL licensed (with classpath exception) OpenJDK version of Java.

What it means

If you are developing a custom application to run on Kubernetes that you want to distribute for commercial gain, and you want to use Linux tools like bash, glibc, the GNU tools collection, or even busybox, and Java as the basis for that application, then you will need to comply with the terms of the GPL license.

There is currently little caselaw regarding Docker containers. In our opinion (please note, I Am Not A Lawyer, you should always seek legal advice from someone qualified in such matters, and not act solely on the basis of opinions that you read here), a container may present an intimate linking of code; in the sense that the GPL code contained in the container image is not necessarily user modifiable without intervention by the vendor, which is precisely the concern (well one of the concerns) that the GPL seeks to address — not unless all of the source code and build scripts for the container image are also provided — which is effectively the GPL’s copyleft clause in action.

Either way, as a software distributor the two options presented by the Java runtime in particular seem quite unpalatable. Complying with all of the terms of the GPL is onerous at best, and we wouldn’t want to be bullied into paying for the proprietary version of Java on the basis of non-compliance with the terms of the GPL.

Do it yourself

In the end the most viable option to comply with the GPL, in our opinion, was to build our own container OS image without any GPL code in it: BarbarianOS.

Our principle is — by excluding GPL code from the base image, but providing facilities to download the user’s choice of GPL dependencies once deployed, we make every reasonable endeavour to ensure that the user retains freedom and control over which versions of the GPL licensed components are installed and used by the application — which are some of the concerns at least that the GPL seeks to address.

BarbarianOS

To this end we built the BarbarianOS “gpl-free-base-image” — an arduous exercise in compliance, but in the end a fully functional container runtime with just what we need and no more, based on more liberally licensed software past and present untainted by the GPL. We took Mir Korn Shell — mksh — as the default system shell, Python 2.7, Vim, and a bunch of libraries and utilities like cawf, nurses, libffi, and zlib. For Unix tools the choices were more limited. In the end we opted for the Heirloom tools code drop — a vintage set of shell tools donated to the open source community, mostly under liberal licenses, by SCO, Lucent, and SUN Microsystems about 20 years ago. They still compile and run well.

We chose Musl libc over glibc, primarily because glibc is LGPL licensed. The LGPL is not as strongly protective of user rights as the GPL, but given the choice we preferred the MIT licensed Musl.

So much for not reinventing the wheel with Docker containers! Our experience has been that compliance with the licensing terms of free software on new technology platforms and code sharing systems like Google Kubernetes Maketplace is not trivial, and that there is a compliance void not currently being filled by the traditional Linux distributions like Red Hat and SuSE, or at least, we are not aware of a ubiquitous , automated way to assure easy compliance with software licenses when publishing container images. As soon as such a thing comes available, we’ll be first in line to adopt it, but in the meantime we will be maintaining and supporting our own foundational BarbarianOS.

--

--

Rob Gibbon

I believe that progress and profit can be sustainable, that we can all benefit from individual liberty, and that every creature deserves dignity.