I don’t agree with everything he wrote about systemd, but he isn’t wrong on a fair amount of it

Systemd has taken the linux world by storm. Replacing 20-ish year old init style processing for a more legitimate control plane, and replacing it with a centralized resource to handle this control. There are many things to like within it, such as the granularity of control. But there are any number of things that are badly broken by default. Actually some of these things are specifically geared towards desktop users (which isn’t a bad thing if you are a desktop linux user, as I am). But if you are building servers and images, you get a serious case of Whiskey Tango Foxtrot dealing with all the … er … features. Especially the ones you didn’t need, know you don’t need, and really want to avoid seeing live and in the field. Ever again.

My biggest beefs with systemd have been some of the lack of observability into specific inner workings during startup and shut down, a seeming inability to control systemd’s more insane leanings (a start event/a shutdown event …), and its journaling infrastructure. The last item is still a sore point for us, as we find it very hard to correctly/properly control logging for a system that should be running disklessly, when we see the logger daemon ignoring the limits we imposed on it in the relevant config files, and filling up the ramdisk. Yeah, not so much fun.

The startup/shutdown hang timeouts are also very annoying, and despite the fact that systemd provides a good control plane for some of this, these delays (which are strictly, and in the absolute sense, completely) unneeded, cause a very poor UX for folks like me whom value efficiency. I do not want systemd trying to automatically start all my network services, and hanging the whole system while it is waiting for network to autoconfigure. I really … truly do not want that. I’ve been looking at how to change this so that it does the startup sanely, but the control plane seems … incomplete … at best here.

The shutdown hangs are due in large part to complexities around the sequencing of things that systemd does, and thinks it knows, and ignores. The things it ignores are what cause the problems, as the dependency graph of shutdowns seems to not know how to deal correctly with things like parallel file systems, object stores, etc. We’ve been working on improving this, and with judicious use of the watchdogs, and some recrafting of various unit files, we have it saner. But its still not perfect.

And don’t get me started on the Intel MPSS bit with systemd.

My point is simple. Systemd tries to do too much, and it messes up IMO, because of this. I’d like it to be a simple control plane. Thats it. Handle start/stop of daemons. Handle that level of its own logging.

I don’t want it to be DNS, and logger, and login, and …

Because when it is all that, things break. Badly.

Our systems are not vulnerable to this bug. And yes, he should have followed responsible disclosure protocol rather than posting a blog entry.

The net of why this bug exists is an assert function. Assert is never, and I repeat, NEVER, something you should use in critical system software. Nor is BUG_ON.

When the revolution comes, the coders who write using BUG_ON and assert will be the first against the wall.

Crashing a core service because you get input you don’t like IS NOT A VALID MECHANISM OF ERROR HANDLING. I’ll argue its somewhat worse than throwing an exception for every “problem” as compared to handling the problem gracefully locally. Exceptions should only ever be thrown for serious things. So add the folks who built exception throwing as the “right” pattern to handle what amounts to a branch control point in a code, to the folks first against the wall.

Divide by zero? Yeah, throw an exception (the processor will). Access memory outside of your allocated segments? Throw an error (OS will). Input value for n to some routine is not 1 or greater? If you answer “I must throw an exception, by using an assert here if this is not true”, then you need to rethink your design. Very … very … carefully.

This bug that the writer alluded to is a simple case of passing a function call a value of zero where it expects a 1 or greater. And rather than gracefully returning a no-op (which would make perfect sense in this context) ….

THERE IS A )_(*&)(*&^*&^%&% assert(n > 0); in the code.

Seriously? WTF? In a core control plane? AYFKM?

The level of fail for others is profound. A simple user, with no elevated privileges whatsoever, can trivially run a DOS attack against a system.

For us, because we are using a slightly dated version of systemd, we just see some daemons stop and restart, but no significant loss in functionality, like others see.

I’ve been non-committal about the whole systemd bit for a while, I have/had hope for it. I am not now against it, but will be now looking for more ways to actively constrain what it can do. We already automatically defang some of its more annoying “features”. Now I am going to spend more time looking at how to turn off some of the functionality that I do not want it to handle itself.

I have no problem with it as a control plane. I have lots of problems with it as the Borg.

Viewed 306 times by 133 viewers

Hows this for a nice deskside system … one of our Cadence boxen

For a partner. They made a request for something we’ve not built in a while … it had been end of lifed.

One of our old Pegasus units. A portable deskside supercomputer. In this case, a deskside franken-computer … built out of the spare parts from other units in our lab.

It started out as a 24 core monster, but we had a power supply burn out, and take the motherboard with it. So we switched to a newer motherboard and CPU, but its 16 cores now.

Then the memory. They wanted as much as we could put in. Well, 256GB should work (we hope).

And, of course, fast disk. As fast as possible. When I was done, this little deskside unit was cranking out 6GB/s writes and 10GB/s reads. So, yeah, fast. No NVMe (remember, spare parts, and I didn’t have any spare NVMe around … not quite true, one was in the unit that burnt out, and I wasn’t sure if it burned as well).

Then the graphics. A nice Nvidia card. Ok, I bought this because we didn’t have any modern ones in stock.

And of course, the fans. Gotta keep this beast cool. So we got large silent cpu coolers, and large high CFM at low RPM fans. It is hard to hear the unit, even when you are next to it.

Our SIOS OS, with a nice desktop interface. Our SIOS Analytics Toolchain with all manner of analytical tool goodness.

As I used to call it, it is a muscular desktop.

Way way (way) back when I started at SGI, my manager got me an awesome desktop unit … an R8000 based workstation. Everyone else had R4000 or R3000 based units. I had this floating point monster on my desk. And I used it. I ran lots of my thesis calcs there. It was easily 20 times faster than the old Sun boxes I had access to in the physics department. It was my original muscular desktop.

This one runs circles around that one. Really quickly. I remember my old MD code used to take 1 hour wall clock per time step. Week long runs were common for me. On the R8000, it would be 1 minute per time step (I had tuned the code a bit by then). On units about 10 years ago (AMD Opterons) I was down to 10 seconds or so per time step.

I’ve not done a modern comparison … I really should …

Viewed 2019 times by 327 viewers

Build me a big data analysis room

This was the request that showed up on our doorstep. A room. Not a system. But a room.

Visions of the Star Trek NG bridge came to mind. Then the old SGI power wall … 7 meters wide by 2 meters high, driven by an awesomely powerful Onyx system (now underpowered compared to a good Nvidia card).

Of course, the budget wouldn’t allow any of these, but it was still a cool request.

Hopefully the room concept/design we put together will fly.

Viewed 2058 times by 338 viewers

A good read on realities behind cloud computing

In this article on the venerable Next Platform site, Addison Snell makes a case against some of the presumed truths of cloud computing. One of the points he makes is specifically something we run into all the time with customers, and yet this particular untruth isn’t really being reported the way our customers look at it.

Sure, you are paying for the unused capacity. This is how utility models work. Tenancy is the most important measure to the business providing the systems. The more virtual machines they can cram on a single system, the better for them.

But … but …

This paying for vacancy/unused cycles isn’t really the expensive part.

The part that is expensive is getting your data out, or having significant volumes of data reside there for a long time. Its designed to be expensive. And capture data. This is a rent seeking model … generally held to be non-productive use of assets. It exists to generate time-extended monetization of assets. Like license fees for software you require to run your business.

We’ve worked through analyses for a number of customers based upon their use cases. Compared a few different cloud vendors with accurate usage models taken from their existing day to day work. One of the things we discovered rapidly, for a bursting big data analytics effort, with a sizeable on site storage (a few hundred TB, pulling back 10% of the data per month), was that the cloud models, using specifically the most aggressive pricing models available, were more expensive (on a monthly basis) … often significantly … than the fully burdened cost (power/cooling, space/building, staff, network, …) of hosting an equivalent (and often far better/faster/more productive) system in house.

The major difference is that one of these is a capital expense (capex) and one is an operational expense (opex), and they come from different areas of the budget.

For occasional bursts, without a great deal of onsite data storage, and data return, clouds are great. This isn’t traditionally the HPC use case though. Nor is it the analytical services use case.

Interesting read on the article, and the other points are also quite good. But as noted, the vacancy cost is important, but not the only cost involved, nor even the dominant one.

Viewed 7244 times by 710 viewers

Running conditioning on 4x Forte #HPC #NVMe #storage units

This is our conditioning pass to get the units to stable state for block allocations. We run a number of fill passes over the units. Each pass takes around 42 minutes for the denser units, 21 minutes for the less dense ones. After a few passes, we hit a nice equilibrium, and performance is more deterministic, and less likely to drop as block allocations gradually fill the unit.

We run the conditioning over the complete device, one conditioning process per storage device, with multiple iterations of the passes. After 2 hours or so, and 3 passes, they are pretty stable and deterministic.

Its always fun to watch the system IO bandwidth during these passes. Each system is rocking 18-21 GB/s right now. About 90% idle on CPUs. Banging interrupts/context switches hard, but the systems are responsive.

Actually, while this is going on, we usually do our OS installation if the unit has drives for this.

I like parallelism like this …

Viewed 9067 times by 817 viewers

New #HPC #storage configs for #bigdata , up to 16PB at 160GB/s

This is an update to Scalable Informatics “portable petabyte” offering. Basically, from 1 to 16PB of usable space, distributed and mirrored metadata, high performance (100Gb) network fabric, we’ve got a very dense, very fast system available now, at a very aggressive price point (starting configs around $0.20/GB).

Batteries included … long on features, functionality, performance. Short on cost.

We are leveraging the denser spinning rust drives (SRD), as well as a number of storage technologies that we’ve built or integrated into the systems. The systems provide parallel file systems, Amazon S3 objects, common block storage formats, simultaneously.

See the page (https://scalableinformatics.com/petabyte) for more details. Happy to answer questions or discuss this in depth. Reach out to me at the day job.

Viewed 13266 times by 1072 viewers

Fully RAMdisk booted CentOS 7.2 based SIOS image for #HPC , #bigdata , #storage etc.

This is something we’ve been working on for a while … a completely clean, as baseline a distro as possible, version of our SIOS RAMdisk image using CentOS (and by extension, Red Hat … just need to point to those repositories). And its available to pull down and use as you wish from our download site.

Ok, so what does it do?

Simple.

It boots an entire OS, into RAM.

No disks to manage and worry over.

No configuration drift.

You can run ansible, puppet, cloud-init, kvm, gluster, … etc. Already communicates over serial console by default, though you can complete control over that. Default password is randomly generated, though you can override it at boot time with an option.

Currently fits in 1.8 GB RAM, though with work, we can trim this down a bit.

We boot VMs, physical machines, etc. with this.

By default it will try to bring up the first 4 networks, dhcping the ones that show a carrier after bringing the interface up. If your switch is not configured for portfast, you should be ashamed, and fix that. This way, the system doesn’t waste time waiting for switch ports to come up. And only dhcps on active ports (eliminating delays for dhcp start on ports that have no connections).

We’ll have SIOS images with a number of other tools up soon as well.

Note: For copyright/trademark purposes, this is NOT CentOS or Red Hat. You should not confuse this image, built from CentOS binaries, as being CentOS or Red Hat. It is an instance of an installation, in such a way as to run entirely out of RAM.
Includes full RDMA stack, latest rev CentOS kernel.

Give it a whirl, let me know how it goes. More tools coming to this directory tree very soon. Stay tuned!

Viewed 13888 times by 1122 viewers

An article on Python vs Julia for scripting

For those whom don’t know, Julia is a very powerful new language, which aims to leverage a JIT compilation mechanism to generate very fast numerical/computational code in general from a well thought out language.

I’ve argued for a while that it feels like a better Python than Python. Python, for those whom aren’t aware, is a scripting language which has risen in popularity over the recent years. It is generally fairly easy to work in, with a few caveats.

Indentation is the killer for me. The language is tolerable though, IMO, not nearly as “simple” as people claim, with a number of lower level abstractions peaking through. I am fine with those. I am not fine with (and have never been fine with) structure by indentation. This isn’t its only issue. The global interpreter lock, the incompatibility between Python 2.x and 3.x. Python does have a very nice interface to C/C++ libraries though, which make extending it relatively easy.

Julia eschews this structure by indentation. It also tries hard to be convenient, and consistent. IMO it does a great job of it. We are experimenting with using it for more than basic analytics, and it is installed on every single machine we ship in /opt/scalable/bin/julia , and have been for years. As is Python3, and Perl 5.xx.

These tools are part of our analytics stack, which has a few different versions depending upon physical footprint requirements.

Julia has made interacting with the underlying system trivial, as it should be, with few abstractions peaking out from underneath the syntax. This article discusses the differences from a pragmatic viewpoint.

Overall I agree with the points made. Perl, my go-to scripting language, has some of the python issues (abstraction leakage). Perl6 is better. Much better. Really … been looking into it in depth … and it is pretty incredible. Julia is better, and much better at the stuff that you’d want to use Python for.

Viewed 28810 times by 1843 viewers