Uptime Institute Data Center Site Infrastructure Tier Standard Topology
Tier I: Basic Data Center Site Infrastructure
Tier II: Redundant Site Infrastructure Capacity Components
Tier III: Concurrently Maintainable Site Infrastructure
Tier IV: Fault-Tolerant Site Infrastructure
|Feature||Tier I||Tier II||Tier III||Tier IV|
|Active components supporting IT load||N||N+1||N+1||2N or 2N + 1
The point is to always have N after any failure
|Distribution paths||1||1||1 active and
|2 simultaneously active|
|Temperature||min: 18 °C (64.4 °F)|
|max: 27 °C (80.6 °F)|
|Raised floor height||at least 24"|
UPS and Generators
- UPS needs to run long enough for graceful shutdown
- Generator needs to be online before UPS fails
- Generator needs 12 hours of fuel, resupplying as needed within 12 hour windows
- Gas and diesel spoil, so liquid propane is better
I've done work for Cummins, so here we go much further down the rabbit hole than is appropriate. Ignore this when preparing for the test:
- LNG or Liquified Natural Gas is a cryogenic fluid, about 99% methane or CH4. It's good for bus engines, stored in a Dewar tank like a giant Thermos bottle. Good energy density, easy to handle and store.
- CNG or Compressed Natural Gas might be vaporized LNG or maybe compressed pipeline gas, which might be more like 95-96% methane. It's stored at 5,000 PSI, so it needs sturdy tanks: either heavy steel or aluminum internal skin with carbon fiber overwrap, which is expensive.
- Propane is CH3—CH2—CH3 and automotive spec propane is pretty pure. The bottles for grills and heaters are less pure, other hydrocarbons are in the mix. Propane is stored at 30-50 PSI, so the tanks still need periodic hydrostatic testing but it's relatively cheap.
- LP or Liquified Petroleum gas is a mix of propane and others hydrocarbons, mostly heavier.
- Gas engines (meaning LNG/CNG/propane, not gasoline) take longer than diesel to get to full rated power, so the data center UPS will have to support full load a little longer as the genset spins up. Maybe something like magnetically levitated flywheels spinning generators.
Time / Frequency Concepts
- Goal is usually "five nines", 99.999%, under six minutes per year
- MAD = Maximum Allowable Downtime — Cannot be down longer than this. (or company fails, perhaps)
- RTO = Recovery Time Objective — We want to be back up this soon. (significantly faster than MAD)
- MTTR = Mean Time To Recovery — On average, recovery takes this long.
- RPO = Recovery Point Objective — We can afford to lose this much.
- MTBF = Mean Time Between Failures — On average, it fails this often.
- RSL = Recovery Service Level — During disaster and following recovery, we need at least this much.
"About twice a year we have a major storage failure. We make backups nightly starting at 1 AM. Our goal is to get data restored within 1 hour. If we went 8 hours without data, our company would financially suffer. Over the past year, our data recovery process has averaged 41 minutes. While recovering one file system, we need at least 80% normal performance on the other unaffected file systems." For that story:
- MTBF = 6 months
- RPO = Within the past 24 hours
- RTO = 1 hour
- MAD = 8 hours
- MTTR = 41 minutes
- RSL = 80% or 0.8
Used when updating or reconfiguring the host, where the hypervisor runs.
- Customer access (starting new VMs) is blocked
- Live-migrate running VMs to other hosts
- Disable alerts
- Leave logging enabled
- Administrator access only, possibly restricted to physical console
- Follow vendor guidance and best practices
Clustered Hosts and Resource Sharing
- Reservations guarantee a minimum amount of resources to a specified VM
- Limits guarantee a maximum amount of resources to a specified VM
- Shares provision remaining resources left when there is resource contention. Allocate reservations first, then shares (prioritized, percentage-based) for the remaining resources to the other members.
"Everyone gets a sandwich, and Elite customers get 2. No one can have more than 4 sandwiches. After everyone gets their promised sandwiches, we'll fairly distribute what's left over."
- Reservations = 1 or 2, depending on customer
- Limits = 4
- Shares = Fair leftover distribution
The below is far deeper than you need to know for the test, but cloud services like Google Cloud and AWS and Microsoft Azure and so on must use SDN. Here's what the AWS dashboard shows you of the orchestration parts of a multi-VM deployment with network orchestration. Amazon calls this "CloudFormation". Here we're starting multiple:
- Database instances
- Security groups (firewall rulesets)
- Load balancers