The flight of the condor - Lessons learnt running a condor cluster

In 2009, while working at DS2 as a Sys Admin in Valencia, I found myself in charge of building and running a Condor cluster (now known as HTCondor) to support large-scale simulations for Power Line Communication (PLC) systems.

At the time, I didn’t think of it as “HPC” or “distributed computing.” It was just… a way to get simulations done before the deadline.

Looking back, that experience taught me more about systems architecture, parallelism, and infrastructure-driven engineering than I realized at the time — and today, as I started learning AI/ML, I keep recognizing echoes of those lessons.

The Challenge

DS2 was building cutting-edge chipsets for powerline broadband — before Wi-Fi was everywhere. The big question:

How do you transmit fast, reliable data over something as noisy and unpredictable as a building’s electrical wiring?

To answer that, R&D team had to simulate everything:

Noise models: impulse noise, colored background, electrical appliance interference
troughput tests across different protocol stack versions
Compliance testing for emerging standards like UPA and IEEE 1901

The workloads were huge, and back then, only was aware of a very little subset of the simulations or workloads that were passing by. Our Compac desk weren’t fast enough… yes we did try to use desktop computers during night time… as a cool startup, but this is for another blog entry…

The solution… A Condor Cluster!

With limited resources and time, we built a Condor cluster by repurposing unused desktops in the lab and some engineering workstations after hours.

We ran:

Mostly MATLAB, C/C++, and a few early Python scripts

Simulation jobs in parameter sweeps (e.g., 100s of SNR levels × topologies)

Scripts that could checkpoint and resume, because jobs might fail or be evicted

Why Condor?

It was easy to deploy on Linux (and worked “well enough” on Windows XP with effort)

It supported opportunistic computing — using idle machines

It taught me to think about job scheduling, retries, and fault tolerance

How we get started ?

So, we purchased a bunch of powerful PCs (I can’t remember exact specs, but yes they were PCs!! with fancy towers!)

Back in 2009, we didn’t have Docker, Terraform, or cloud GPUs. We had quad-core desktop PCs with 4–8 GB of RAM, a shared LAN, and a lot of patience. Here’s how we built our Condor cluster to simulate PLC environments across multiple machines.

We used a mix of:

Intel Core 2 Quad desktops (Q6600, Q9550)

Some AMD Phenom-based workstations

Running Ubuntu 8.04 LTS or Windows XP SP3 (yes, we tried cross-platform)!

Gigabit Ethernet on a shared office switch

Shared NFS helped us keep input/output in sync across nodes.

We had to manually install MATLAB runtimes or compile native C++ binaries for each machine.

Some nodes ran Windows, so we had to compile dual versions or use Cygwin for compatibility.

We hacked a Python script to monitor job queues and requeue failed ones (early “resilience”).

What I remember the most was, how painful, complex and confuse were teh configuration files, condor_config and condor_config.local had dozen of options, many of which were poorly documented, the R&D team were looking back to us to figure out how to lunch simulations, which was often a nice experience to work closely, but not to debug or understand the files per se. I remember slight configurations, like ALLOW_WRITE or missmatching hostnmaes will fail silently and good luck with our good friend Nagios to catch that !

Often the error messages were cryptic, or worse, there were none! the jobs had 0 retrying, so we had to implement catch all bucket scripts that captured when jobs failed and email to the owner (That evening was fun!). Condor did not had any real time monitoring, so we had to create a bunch of nagios checks with pythong which were creating dynamically a list of jobs and that supposed to run on the cluster and monitor they were running as expected.

Condor required bidirectional comms over multiple ports (random high ports) which was also poorly documented back then, creating all sort of colorful situations on our production firewalls.

I also remember the nice bash loops with scp for updating binaries over 20 modes

Setting this up felt like magic. Watching 20 machines light up and churn through simulation runs overnight gave me a deep respect for distributed systems — and started me on a path that led to where I am today: learning how AI pipelines depend just as much on compute orchestration as they do on model architectures.

Lessons learnt:

Infrastructure Matters The algorithms were smart — but they were powerless without the compute. That lesson echoes today as I work with AI: training is nothing without hardware and a pipeline to support it.

Parallelism Isn’t Free I had to learn when a task could be safely parallelized — and when I’d run into file locking, RAM constraints, or licensing headaches.

Real Systems Have Limits In 2009, software versions didn’t have the fancy tools we have today. MATLAB had no built-in parallel pool, Python had no joblib. We hacked it together.

Simulation Is the First Step Toward Prediction While we weren’t doing “machine learning,” our simulations were learning — teaching us how systems might behave under conditions we couldn’t physically replicate. In a way, it planted the seed for my current curiosity about AI/ML modeling