Xserve colocation for MacSlash provided by   Digital Forest
MacSlash A Daily Dose of Mac News and Discussion
MacSlash
MacSlash
» FAQ
» Discussions
» Journals
» Messages
» Topics
» Authors

» Preferences
» Technorati Profile
» Older Stuff
» Past Polls
» Submit Story


Search MacSlash:
 

MacBookPro







Listed on BlogShares

Team One Tickets

» Prince Tickets - 3121 Jazz Cuisine in Las Vegas
» Coachella Tickets
» Houston Rodeo tickets
» Cirque du Soleil
» Las Vegas Hotels
» Celine Dion Tickets
» Using A Ticket Broker
» PBR Rodeo Tickets
» De La Hoya vs. Mayweather Tickets
» Burning Man Tickets


Shameless Plugs
» Mac Poker Site
» 2008 Democratic Primary Info


 
TenCon Keynote - Dr. Srinidhi Varadarajan
posted by Cannonball on Tuesday October 28, @07:26PM
from the 20Gbit/sec-isn't-just-a-good-idea,-it's-the-law. dept.
Mac OS X

Dr. Srinidhi Varadarajan is the director of the Terascale Computing Facility at Virginia Tech and is going to be taking us through the process of the creation of the Terascale Computing Facility. It's going to cover the why, the goals, the hardware and facilities, the software it's based on, the performance results and the research that's going to take place.


This all began early in 2003, bringing all the high performance computing people together. They wanted to build a world class program, and they needed to have a big ticket facility to go with it. Instead of having a grant/proposal structure for their students, they wanted to build a facility for them to use. They wanted to tie them into computational grids all over the nation. They want to treat supercomputers like electrical generation stations and using them in concert with visualization facilities and data storage centers making a much more intelligent.

Va. Tech is a part of the National Lambda Rail network, a huge fibre-optic network (15,000 miles +), a major network pipe for them to make use of with the new machine.

The Terascale Computing Facility was also a political success within the university, allowing many of the different departments to discuss and come together and share resources. They're all going in the same direction now.

Derrick Story: When did this all come together?

Dr. Varadarajan: This started in March of 2003. Within a month, they had financing. "We hope to continue on this."

This was built for dual usage, experimental and real world based.

High performance architectures was what they wanted, 64bit and up only. People don't pay enough attention to communication. Clusters use gigabit ethernet to talk to each other, SuperComputers uses Tightly coupled cores. That's what separates them. They wanted NLR, Internet and Internet 2 connectivity. It's become operational, and then ready for production runs this Fall.

Derrick Story: 64 bit is essential, how did you make the jump to the G5 and the Macintosh? You weren't using this before? Why here?

Dr. V: We're coming to that, please be patient.

Usage Goals:
Provide easy access for new investigators and exploratory research.
Support, collaborative multisite research activities
Will support on-demand access to computational cycles from external research partners.

They don't want to shut people out because they don't have a grant. This is about getting things together. The System can be strictly partitioned or it can be loosely handled.

The Future:
Conputational Sciences and Engineering isa long term intiative. The current facility will be followed by another one in 2006.

June 23rd:Apple announced the G5
June 26th:VT contacted Apple
Sept 5-11: G5's arrive
Sept 23rd: Facility began preliminary ops
Oct 1 - Nov: Performance optimization
Mid Nov.: Facility available for initial applications. Any user with operational HPC (MPI) codes can access the facility at this point.
Jan 1st: Facility available for full production use.

That's an incredible timeline, folks.

Here's the Hardware

Choosing the right architecture: limited budget and price/performance was the main consideration.

The total cost of the asset, including systems, memory, storage, primary and secondary communications fabrics and cables is $5.2mil. Facilities upgrade was $2mil. 1mil for the upgrades, 1mil for the UPS and generators. Arguably the cheapest world class supercomputer.

Definitely the most powerful student machine "Just that it runs is a big deal in itself"

Architecture Options:

Dell could deliver in mid-August. All Itanium II, but it fell through.
After that, AMD and IBM, Opteron systems. But the prices failed.
HP, same problem. All $9-10mil.

IBM couldn't deliver the 970 before January. But Apple could.

Don't design a machine for 18 months and then build it. Buy it and build it Right Then. Do it in 3 months.

The time was just right for the G5.

1100 Dual Apple G5 2Ghz CPU based nodes. Each node has 4GB of main memory and 160GB of Serial ATA storage. 176TB total secondary storage. 4 head nodes for compilations/job startup. 1 Management node.

"I came to the Mac by reading the kernel manual first." Dr. V did not use a mac before any of this.

Each G5 has 2 double precision FPUs. Each unit can complete 1 fused multiply add operation per cycle. This is the most common op in numerical computations. Thus, each processor can deliver 2 DP unites * 2 flops/cycle = 8GFlops. That's more than one Cray X1 Node. In a desktop. (shit.)

Each dual G5 can deliver a peak of 16GFlops of double precision performance leading to an Rpeak of 17.6TF.

Primary Comm Architecture.
Based on Infiniband tech. Switched Network. Each node connects into the network at 20Gbps full duplex. 24 96 port switches organized in a fat tree topology. Mellanox designed the switches and cards. They're using. Every node has a connection to every other node. It can support 150,000 connections per node. It's a very nice piece of hardware. less than 10ms latency.

18 leaf switches, each using 64 ports to nodes. 6 spine switches as a backplane, 32 ports per leaf switch interconnect to the spine switches. 5-6 ports per leaf switch are connected into each spine switch. Total switching capacity: 46Tbps. Um. Wow.

Why a half CBB design? 625MBps of duplex bandwidth when half the nodes simultaneously communicate to the other half. The full duplex bandwidth is the theoretical limit of the PCI-X bus.

They designed for the bus being the limit. Scientific appolications are not perfectly synchronous, and hence rarely encounter any bandwidth limitations from the half CBB design.

This is all above my head right now :D

Gigabit Ethernet management backplane/ Carries NFS, control job startup and typical IP traffic. It's based on five Cisco 4500 enterprise series switches. 240 Gigabit Ethernet ports/switch. Managed fabric with integrated IP traffic.

Facilities
How do you how this beast?

9000 sqft Data center, raised floor, environmental controlls, dual backbone, dual feeds and generators, fire suppresion.

They took 4000sqft for this. There's a 24/7 NOC right there.

3MW of power. Dual redundant with backup UPS. 2+ million BTUs of cooling capacity using Lierbet's extreme density cooling. This system uses rack-mounted heat exchangers with R-134a refrigerant and an overhead chiller.

Front to back cooling. Traditional AC would have resulted in a wind velocity of over 60 mph under the raised floor. They have a great wind tunnel. So, they used 270 of cold water (40degree) pumped in through huge pipes to chill refrigerant, then they go through copper pipes and through the hot aisles.

They've built a giant fridge. stays at 72 degrees. If it fails, it jumps to 100 degrees in 2 minutes. 2 minutes after, crispy G5s.

30 machines a day for a while.

2.5Ghz signalling in the Infiniband cable (looks like wide monitor cabling.) it's all copper cable. 20Gbit/sec off copper. Unreal.

They had to rebuild the area around the building with place.

Bring in a machine, power it up, then open it up, put in the infiniband card, then the RAM, then power it up. Then rack it. 2 hours total per machine.

Software

Runs Mac OS X 10.2.7 then Mellanox wrote Infiniband Drivers. They use MPI parallel comm libraries. C, C++ optimizing compilers IBM xlc and gcc 3.3

Fortran 95/90/77 compilers IBM xlf and Nagware

They rewrote a kext for cache optimized memory management. Ported MVAPICH to OS X, added message cache and dynamic memory management systems to improve performance. Scalable job startup system for MVAPICH.

Reliability

Supercomputers cased on commodity clusters face reliability concerns due to component numbers. They developed a transparent fault tolerance system - called déjà vu - for engineering reliability into large-scale supercomputers. VT is leading the collaboration with PSC and ISR. Déjà vu is being ported to the G5 platform, and will be deployed at the TCF, funded by the NSF. Currently working on a patent application.

The app should recover from any failure, because the system does it, transparent to the program. Apps shouldn't have to worry about this. The system does.

Salient features:
tgransparent checkpoint, revoery and migration system, it's kernel independent
New Model to achieve global state consistency
Incremental checkpointing
Non-Blocking checkpointing
Integrates user-initiated and system initiated checkpointing
Supports process migration

Communications
First version of the Mellanox driver and Verbs API was delivered in mid-August. Infiniband achieved 800MBps with MP performances 700MBps (MPI latency 8-14µs). Changes to PCI-X timing have increased Infiniband performance to 870MBps over the Verbs API.

There is translation between PCI-X and Main Memory, you go through an engine first. The card is full 64bit , but not from the PCI bus in the Mac.

The LINPACK benchmark

It solves a very large system of linear equations. Dense matrix operations.

500,000 variables.

They usedf the BLAS libraries, the core routine has GEMM efficiency of 84.1% (fairly phenomenal). Their benchmark used a mix of Goto's libs and Apple's veclib framework. IBM is nowhere near this good. Goto has the fastest library in pretty much every proc.

Currently they're at 9.555Teraflops. They want another 10% boost pretty quick, crossing the 10Teraflop line being the first academic machine to do so. That makes them #3. Worldwide. Period.

Q&A:

How much time did it take to develop the custom software?

2 months, 18 hours a day, almost all by himself.

When you first brought it up, did you stagger it?

Yep, otherwise they'd spike the power.

How are you handling data coming off the machine?

There's scratch space on each node, and they have external storage, plugged into the Infiniband mesh.

How much disk space?

We're not sure yet. 40-50 TB eventually.

How do you health check the nodes?

A daemon on each node. It does fault tolerance and such at startup, built into job scheduling

Examples of stuff that's run into the TCF?

Nanoscale Electronics
Quantum Chemistry
Computational Chemistry/Biochem
Aerodynamics
Cell Cycle Modeling
Molecular Statics
Computational Acoustics
and tons more I couldn't capture fast enough.

Would you attribute your success to the single source of code? Do you see the networking stuff impacting the private sector?

It wasn't just me, the coding was him, and there's a huge long list of apple folks helping out, as well as Mellanox, Liebert and Cisco, as well as Goto from the JPO, Dr. Panda and Andy Petit (OSU and UTK)

Major thanks to VT and everyone else.

They have people asking for clones as well as many many G5 clusters in the not too distant futures.

What's the status of the code behind this?

Is this going to be open source? Most of it will go back to OSU and their license style. The memory manager will go open source. Mellanox hasn't said, but most of their other stuff is open source.

What's the cost on Infiniband?

All the switches and cards $1.6 mil. $176k for the cables.

How did use the G5 instead of the Opteron or Itanium?

Both are fairly nice, but they're expensive. First, it didn't pass the price/performance ratio test. Opteron doesn't do what the G5 does. 4Gflops at peak, the G5 is twice that. The Itanium is phenomenally efficient, but only at 1.5Ghz, not the 2Ghz. The #4 is a 8.6Terafllop Itanium II cluster (on 2000 procs)

You built it all to 10.2.7, are you planning on upgrading to Panther?

They're upgrading to Panther in the next few weeks. The driver runs, the memory manager runs, everything else, no problem.

There's a lot of interest in departmental clusters, Is there documentation anywhere?

We hope to put up a full fledged package to duplicate this from 64 nodes and above. They hope to see many after this one.

How do you deal with Error Correction in Memory?

There's a lot of traffic on Ars Technica and other places. We do failure recovery, memory doesn't report. One of the things we've noticed is that failures aren't an issue yet. The reason they can be competent is the LINPACK test, which is showing 16 digits of accuracy. We are planning on moving to ECC systems in the future. They may have to run things twice for a bit.

How much coke and how many pizzas?

500-600 pizzas.

Security Update 2003-10-28 | TenCon Keynote - Adam Engst, TidBITS  >

 

 
MacSlash Login
Nickname:

Password:

[ Create a new account ]

Related Links
  • More on Mac OS X
  • Also by Cannonball
  • This discussion has been archived. No new comments can be posted.
    TenCon Keynote - Dr. Srinidhi Varadarajan | Login/Create an Account | Top | 11 comments | Search Discussion
    Threshold:
    The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
    The Hum (Score:2)
    by daeley ({robert} {at} {celsius1414.com}) on Tuesday October 28, @07:39PM (#55758)
    User #804 Info | http://www.celsius1414.com/
    I would like to stand in that room and Feel The Hum. :)
    Re:The Hum (Score:2, Interesting)
    by Anonymous Coward on Tuesday October 28, @07:59PM (#55761)
    They were talking about rebooting the machines staggered so that it would not overload the electrics and he was specifically refering to the way the startup chime propagates through the room when that happens. There should be an audio recording of that.
    Re:The Hum (Score:1)
    by SolInvictus on Wednesday October 29, @09:24AM (#55801)
    User #10164 Info
    I took a tour of the TCF this past Monday. The facility is incredible. Everything is extremely dense.

    And yes, it does hum. Loudly. The cooling units, particularly so.
    --Lex Talionis---- Christopher Fox ACSA Senior Technician / Technical Lead Capital One
    Thanks for the in-depth details about the system (Score:1)
    by UR30 (juuhaa@mac.n0-span.com) on Wednesday October 29, @12:38AM (#55772)
    User #7618 Info | http://radio.weblogs.com/0112083/
    Now I can go back to update my impressions [weblogs.com] of the system. The work they did at Virginia Tech is really top-notch.
    ECC? (Score:2)
    by Johnny Mnemonic (mdinsmore@mac.com) on Wednesday October 29, @05:40AM (#55782)
    User #162 Info

      We are planning on moving to ECC systems in the future.
     
    Pulled this out of the interview. Does that mean a different arch? Or will Apple supply ECC on a new machine ie G5 Xserve?
    Re:ECC? (Score:1)
    by bill_mcgonigle on Wednesday October 29, @01:09PM (#55818)
    User #3000 Info | http://www.zettabyte.net
    Or will Apple supply ECC on a new machine ie G5 Xserve?

    I think you nailed it. That's my last objection to spec'ing XServe's for Real Work(tm). Apple already did multi-channel RAID and a JFS, my other two gripes from days-gone-by.
    Listening to the Good Dr. (Score:2)
    by Omnipah (omni@gotz.DONT SPAM ME.org) on Wednesday October 29, @06:52AM (#55789)
    User #3397 Info | http://www.gotz.org
    Makes you feel like you are rubbing two sticks together in the dark. This presentation was absolutely amazing. The speed that this project was pulled off, the ability to make quick decisions in an organization that is filled with beaurocracy and red tape, and the personal stake that those involved have offered make this a more than sensational project.

    I have also had a chance to meet with Cannonball a bit here, and he is a great representative for MacSlash.
    He flew to Apple, then placed the order online (Score:1)
    by Sailfish on Thursday October 30, @05:35AM (#55885)
    User #11153 Info
    Apple Store online employees work on commision, wonder who got the lucky call?
    Too cool (Score:1)
    by rixstep on Thursday October 30, @02:28PM (#55930)
    User #6971 Info | http://rixstep.com/
    That guy is simply too cool.
      That was fun while it lasted. Powered by Slash

    [ home | contribute story | older articles | past polls | faq | authors | preferences ]
    Copyright © 1999-2006 MacSlash Inc.