Overview of the objective, design and implementation

Summary

In 2009 I was hired to build, bring online and operate a state of the art computer data center in Toronto Canada. The purpose of the computer data center was for the processing of large amounts of raw binary data collected by, soon to be deployed, low earth orbit satellites. Processing of this data required a significant amount of computing horsepower to ensure one batch of raw data was completed before the next was received from the orbiting satellites. Additionally, my objective included staffing, process and procedure development and implementation to operate the computer data center 24 hrs X 7 day /wk X 365 days/yr to deliver the company’s primary service to global customers. Over the course of 8 years this one computer data center and operating staff evolved to encompass 5 additional ‘data centers’ located in Svalbard Norway, Harwell U.K., Bangalore India, Kitchener Ontario and Mainland China. Additionally, the infrastructure included a feeder network of 30 plus ground station facilities strategically placed all over the globe to collect raw data from 10 active in-orbit satellites. The entire infrastructure was maintained and operated by a small team of less than 30 individuals, all located in the Toronto Canada region, using modern networking, software technology tools and techniques to deliver on Service Level Agreements (SLA’s) with our customers requiring 99.9% up time and 30 minute or less latency from reception at a given satellite until data was delivered to customers.

Toronto Data Center

The Hardware

Initiated in 2009, the Toronto Data Center was the first and hub facility for the entire infrastructure. It consisted of a combination of ‘barre metal’ and ‘virtual’ computing with expandable multi-Terra byte storage (significant for the time). All of this computing infrastructure was leading edge for the time and one of, if not the, largest computing data facilities in Canada.

All of the computing infrastructure was built using high density computing cores (in excess of 730 cores of processing), hot swappable computing cards, rack mounted hardware chassis with centralized redundant storage systems and automated backup tape storage. A private IP network of redundant cabling, switching and routing connected all components. Each computing platform had 2 connections to the network to ensure access in the event of a single point hardware failure.

The private IP network was connected to the public Internet through redundant state of the art packet sniffing firewalls. The public Internet was used to connect to the initial satellite ground station used to collect raw data from orbiting satellites located in Svalbard, Norway. The public Internet was also used to allow the operations staff to VPN (Virtual Private Network) into the data center which enabled ‘lights out’ operation.

The ‘hardware’ was housed in 22 racks, under floor air intact with top exhaust, with cabling dressed, labelled and documented for quick trouble shooting and problem isolation. We equipped the racks with self-sustaining Uninterruptible Power Supplies (UPS)

The racks were located in a ‘shared’ computing facility operated by Bell Canada. This facility provided the adjacent locked racks on ‘raised’ floor space with overhead tray cabling. Our computing equipment was delivered and assembled into these racks on-site by our own personnel.

In addition to floor space, the Bell Canada facility provided, dual street power ingress, redundant high availability conditioned power (i.e. UPS and Generator), redundant high availability air handling equipment (air conditioning), fire isolation and suppression, multiple disparate Internet connection points and a primary Internet node for all of Canada located in the same building.

The facility was secured by a 24 hr manned guard station and a multi-factor authentication including a bio metric hand vein scan and man-trap entrance facility. Keys for rack access were maintained in an electronically secured lock box after the man-trap, requiring an access code to provide rack keys for your specific cabinets.

The Software

Where possible, hardware was sourced from a single manufacturer, Hewlett Packard. This was to allow use of the manufacturers multi platform integration software for configuration management and maintenance. By doing so we were able to streamline operations with a small operating team and focused skill sets.

The computing platforms were operated both as barre metal and virtual using Linux and Windows operating systems.

Barre metal was preferred for the proprietary parallel processing needs of the collected binary data from the satellites. The increased speed obtained by the computing hardware when it did not carry the overhead of a Virtual Machine was critical to meeting SLA requirements. Our proprietary software was specifically designed to take large volumes of raw digitized radio frequency spectrum and run it through computing intensive parallel processing to extract meaningful packets of information that were delivered to the customer as our product.

Virtual Machines were used for both staging (preparing for parallel processing), post processing , storage management, miscellaneous other functions and product distribution. We ‘spun-up’ new virtual machines as needed, limited only by processing capacity, storage and licensing. Our baseline number of individual computer Operating System’s (OS’s) active was approximately 500 performing a wide variety of dedicated functions enabling the product to be delivered. This number varied up and down depending on the needs at the time.

On of these OS’s ran a wide variety of off-the-shelf and in-house or contract designed and developed software applications. As product and operating needs evolved the computing machines needed to support them could be started or stopped as needed, often within a matter of minutes, if necessary.

Operations

Designing and building the computing infrastructure is only the beginning.The critical next objective is to ensure that the infrastructure continues to operate and be maintained for the duration of its useful life. While designers and engineers know the systems they build extensively, different skill sets and disciplines are usually needed for availability and sustainability.

A lights-out facility, such as what we built above, is specifically designed to work continuously without requiring personnel to physically access the hardware. If the purpose of the facility is static and the software never has to change then the infrastructure can be built, tested and brought online with almost no further thought except to deal with the rare occasion when a component fails and needs to be replaced.

However, change is highly likely in most computing infrastructures, particularly in a fast paced start-up environment such as ours. The ‘trick’ is ‘managing’ change as tightly as possible while still being responsive to customer needs.

The best you can hope to do is to plan review and test changes where possible and schedule them at times when the inevitable disruption causes the least amount of pain. To this end we implemented ITIL.

ITIL is an acronym for Information Technology Infrastructure Library. A set of detailed practices for IT service management that focuses on aligning IT services with the needs of business.

We never became ITIL certified, as some businesses do, since it was deemed of no benefit to us in the marketplace, however, we tailored the library of practices and procedures to enable us to provide high availability, minimize risk, personnel and overall cost for our large unique computing environment.

Root cause analysis of service interruptions over the years clearly indicated that ‘change’ was the largest contributor to reducing our service levels. Rare were the cases where hardware failures would cause a disruption. We had designed around those. It was a given that during the winter holiday season, when we implemented a moratorium on changes, we would achieve our highest level of service availability for the year.

Staffing

Our business, like most other technology businesses, would prefer to invest in hardware and software, turn it on to do its thing and then forget about it. Unfortunately, it requires a staff of costly personnel to ensure continuous product delivery.

Staffing level is highly driven by the needs of the customer. In essence, the fundamental question to answer is ‘How long is too long?’ to respond to a customer need, fix an issue or repair a component failure. Obviously, the immediate response is immediate or as short as possible, but that alone does not help us answer the question. What it boils down to is cost and what I term ‘buying 9’s’.

Buying 9’s is a reference to the SLA which typically states that the ‘service’ is available for customer use some percentage of the time, often calculated over the period of a year. An SLA availability to the customer of 99.9% means that the service can be unavailable to the customer for up to 525 minutes (8.75 hrs) in any given year. This may not seem like much but for ‘mission critical’ applications that operate 24 X 7 even a few minutes is usually unacceptable. Increasing availability involves spending more money to put more 9’s after the first ones – buying 9’s. 99.99% availability means an unavailability of 52.5 minutes per year. Achieving an SLA is a combination of hardware, software, procedures and personnel.

To keep our staffing needs to a minimum and still meet our SLA requirements we implemented what we called our ‘Eyes Open’ policy. The riskiest time period to achieving any SLA is during off business hours, nights and weekends. We can implement software and automation that would raise all kinds of alarms and notifications when any number of situations, interruptions or failures occur. These alarms and notifications could trigger a wide range of devices from console alerts, to email and to personal devices such as cell phones and digital assistants. However, if the person(s) being notified were not awake or available to take action, the condition will likely remain unaddressed and/or get worse. We needed staff with their ‘Eyes Open’ who were trained to follow procedures when events occurred.

We staffed with 3 levels of expertise and 14 people. To maintain ‘Eyes Open’ 24 X 7 X 365 requires a minimum of 7 people. This provides sufficient personnel to cover all hours plus accounting for vacation and sick time of staff. The Tier 1 level of staff needed is computer literate and savy in their computer skills but they are also are entry level for the kind of expertise needed to maintain our infrastructure.

Backing up each Tier 1 shift staff is at least one Tier 2 staff member who is on call. Tier 2 members are significantly more skilled, well versed across all systems, higher paid and financially motivated to pick up the phone and respond should a situation arise that cannot be handled by Tier 1 staff.

Tier 3 staff are specialists. They are as broadly skilled as Tier 2 staff and may or may not be on call in off hours. Tier 3 staff specialize in one or more particular subsets of the infrastructure such as Virtual Machines, Networking or other sub specialties. Their work hours are spent managing change implementation through planning, testing, implementation and documentation.

As our infrastructure grew and financial pressure increased we began the process of transitioning our operating staff to what is known as ‘DEVOPS’. ‘DEVOPS’ is a concept which address the all to common problem of engineers developing something that work in their world and handing it off to operational personnel while they move on to new projects and forget about or take little responsibility for what they developed. It was handed off to operations and hence if it does not work right, it is not my problem.

‘DEVOPS’ was created by ‘GOOGLE’ to manage the massive infrastructure that they created for their business. Fundamentally, the concept places developer level staff in roles in which they are expected to support products they create. Few things get a problem both solved and prevented from recurring faster than the staff who created the product being woken at 3AM to fix a problem they created. Very quickly they will fix the issue so they will not be woken again.

Over the years, as our staff became more skilled and knowledgeable and our success grew, we expanded using the same personnel and tools to operating additional remotely located data centers around the world.

We also, after realizing that satellite ground stations were primarily networked computers with specialized hardware attached, extended our reach and operational support to nearly 30 antenna systems.

Similarly, satellites and the software used to control them are nothing more than networked computers. As a result we integrated the 10 satellites we placed into orbit into the same staff and process and procedures supporting the computer centers and ground stations.

Summary

In 2009 when I began this project few people had built this caliber of data centers and established the procedures and staff to deliver software derived products so reliably. Over the course of eight years the project grew to encompass multiple computer centers (each with some subset of capability as the original in Toronto) and nearly 30 ground station facilities which were controlled by software and dynamically scheduled by in-house developed software tools.

We also added scheduling and operation of our fleet of satellites to the responsibility of the operating personnel with minimal increases in our staff. As a reference point, at the time, the ‘operating’ staff for a single Low Earth Orbit satellite in the government sector could easily be in excess of 100 people. Our ‘operating’ staff was fewer than 30 people for 10 satellites.

Philip L. Miller March 20, 2020