From tskirvin at ks.uiuc.edu Mon Mar 19 09:33:06 2007 From: tskirvin at ks.uiuc.edu (Tim Skirvin) Date: Mon, 19 Mar 2007 09:33:06 -0500 Subject: [cluster-l] [mdgrabe@pitt.edu: home directories] Message-ID: <20070319143306.GB1261@ks.uiuc.edu> ----- Forwarded message from "Grabe, Michael David" ----- From: "Grabe, Michael David" To: Message-ID: Subject: home directories Date: Tue, 13 Mar 2007 21:19:09 -0400 X-Spam-Status: No, score=0.4 required=4.0 tests=BAYES_50,HTML_30_40, HTML_MESSAGE autolearn=ham version=3.1.7-0+tcb1 Dear Tim, I have a cluster question and I was looking through the NAMD archives/cluster archives to try to address this. Previously I set up a small cluster of OSX machines and I ran namd on them. I am looking to do this again but I have a question about how to set up the home directories of the users. Before, I made a home directory for each user on each node so that when they logged in their home directories were stored locally. I did this because i had this idea that storing each home directory at some other node on the cluster would make running NAMD with charmrun not work. i don't know if this is true anymore and i wonder if you have any advice on this. thanks in advance, and i apologize if my question is poorly phrased. take care michael ----- End forwarded message ----- -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 185 bytes Desc: not available Url : http://www.ks.uiuc.edu/pipermail/cluster-l/attachments/20070319/961d79a8/attachment.bin From noberg at uiuc.edu Wed Mar 28 13:18:47 2007 From: noberg at uiuc.edu (Nils Oberg) Date: Wed, 28 Mar 2007 13:18:47 -0500 Subject: [cluster-l] Networking Equipment Message-ID: <7.0.1.0.2.20070328130830.00f4c950@uiuc.edu> I'm purchasing a 10-node cluster and the interconnect is going to be gigabit Ethernet. The plan is to purchase a 24-port unmanaged switch. I have a few questions regarding equipment quality: 1. Does the vendor matter, or are all 24-port unmanaged switches created equal? 2. Are there better networking cables than others? 3. All of the nodes I'm purchasing have two gigabit Ethernet ports. Is there a way to bundle these ports together to get twice the bandwidth/half of the latency of a single port? If anyone knows of documentation or other resources resources for doing this under one of the Linux cluster distributions, please let me know. Thanks for any help, Nils -- Nils Oberg, Research Programmer Civil & Environmental Engineering, University of Illinois at U-C phone: 217-333-8365, web: http://vtchl.uiuc.edu From jim at ks.uiuc.edu Wed Mar 28 13:42:47 2007 From: jim at ks.uiuc.edu (Jim Phillips) Date: Wed, 28 Mar 2007 13:42:47 -0500 (CDT) Subject: [cluster-l] Networking Equipment In-Reply-To: <7.0.1.0.2.20070328130830.00f4c950@uiuc.edu> References: <7.0.1.0.2.20070328130830.00f4c950@uiuc.edu> Message-ID: On Wed, 28 Mar 2007, Nils Oberg wrote: > > I'm purchasing a 10-node cluster and the interconnect is going to be > gigabit Ethernet. The plan is to purchase a 24-port unmanaged > switch. I have a few questions regarding equipment quality: > > 1. Does the vendor matter, or are all 24-port unmanaged switches created equal? If they have the same backplane bandwidth then they're probably the same performance, but I can't guarantee that. We've used SMC switches, and while I noticed that the unmanaged switch we bought as a replacement is slower than the older managed switch, it doesn't affect the overall application performance. > 2. Are there better networking cables than others? The ones from ECE stores were pretty good. You want something with the nice flexible guards on the end, rather than just crimped on. > 3. All of the nodes I'm purchasing have two gigabit Ethernet > ports. Is there a way to bundle these ports together to get twice > the bandwidth/half of the latency of a single port? You'll never get lower latency. The early Beowulf projects used channel bonding to double bandwidth, which needs a second switch and you can't boot over the network. You could try giving each machine two addresses and plugging it into the switch twice, then setting up routing to use different ports for even and odd nodes. With three 8-port switches you can do a flat network neighborhood (http://aggregate.org/FNN/) with up to 12 nodes. I'd like to hear how that works if you decide to try it. -Jim From noberg at uiuc.edu Wed Mar 28 13:57:02 2007 From: noberg at uiuc.edu (Nils Oberg) Date: Wed, 28 Mar 2007 13:57:02 -0500 Subject: [cluster-l] Single- vs. Dual- vs. Quad-core CPUs In-Reply-To: References: <7.0.1.0.2.20070119172450.00ec8dd8@uiuc.edu> <7.0.1.0.2.20070124095646.00ec0ec8@uiuc.edu> Message-ID: <7.0.1.0.2.20070328132512.028a4598@uiuc.edu> Thanks for your help Jim. I performed some benchmarks on demo equipment from AMD and Intel and there are some interesting differences between the two platforms for our code. All times are in seconds. Here are some results: 2x2 Opteron 2218 2.6 GHz with 4GB RAM: 1 core: 697 2 cores: 323 4 cores: 211 4x2 Opteron 875 2.2 Ghz with 8 GB RAM: 1 core: 531 2 cores: 333 4 cores: 181 6 cores: 143 8 cores: 139 1x4 Xeon 5355 2.6 GHz with 4 GB RAM: 1 core: 510 2 cores: 343 4 cores: 251 2x4 Xeon 5355 2.6 GHz with 8 GB RAM: 1 core: 516 2 cores: 314 4 cores: 228 6 cores: 195 8 cores: 167 I don't understand why the Xeon performs better than the Opteron on one core, but worse than the Opteron on 4 cores. I tried a different CFD code and the same pattern emerged. Why might this be happening? I was getting better than linear speedup results for one of our programs. Is this possible? Here are some results: 2x4 Xeon 5355 2.6 GHz with 8 GB RAM: cores: 1 7590 cores: 2 4523 speedup: 1.68 cores: 4 2060 speedup: 3.68 cores: 8 916 speedup: 8.29 2x2 Opteron 2218 2.6 GHz with 4GB RAM: cores: 1 8497 cores: 2 4360 speedup: 1.95 cores: 4 1883 speedup: 4.51 Does this make sense? Thanks for any help. Nils At 15:11 2/22/2007, Jim Phillips wrote: >You really need to run some benchmarks. Failing that, look at the >SPEC FP Rate results at >http://www.spec.org/cpu2006/results/rfp2006.html There are three >different CFD codes in the benchmark suite. > >1x4 2.7 GHz Xeon leslie3d = 15.0 total = 33.6 >2x4 2.7 GHz Xeon leslie3d = 21.9 total = 54.1 >2x2 3.0 GHz Xeon leslie3d = 25.8 total = 43.0 >2x2 2.6 GHz Optn leslie3d = 28.3 total = 38.1 >2x2 2.8 GHz Optn leslie3d = 36.3 total = 48.3 (PathScale compilers) > >So, the dual-socket, dual-core Opteron *may* be your best bet, if >your workload is similar to leslie3d. Run some benchmarks. > >-Jim > > > >On Thu, 22 Feb 2007, Nils Oberg wrote: > >>Hi Jim, >> >>Thanks for your response. I should probably describe the >>problem. Our application is a computation fluid dynamics (CFD) >>code. My understanding of CFD codes is that they are primarily >>memory bound. Since the domain to be modeled is broken up into >>chunks, during the course of a time-step in the simulation a large >>number of messages (not necessary large amounts of data) are passed >>between processors. >> >>We're trying to decide between the following: >> >>uni-processor quad-core Xeon 4 GB RAM ($2,300 / node) >>dual-processor quad-core Xeon 16 GB RAM ($5,800 / node) >>dual-processor quad-core Xeon 8 GB RAM ($4,600 / node) >>dual-processor dual-core Xeon 8 GB RAM ($3,800 / node) >>dual-processor dual-core Opteron 8 GB RAM ($3,200 / node) > >-- >Nils Oberg, Research Programmer >Civil & Environmental Engineering, University of Illinois at U-C >phone: 217-333-8365, web: http://vtchl.uiuc.edu From jim at ks.uiuc.edu Wed Mar 28 14:30:40 2007 From: jim at ks.uiuc.edu (Jim Phillips) Date: Wed, 28 Mar 2007 14:30:40 -0500 (CDT) Subject: [cluster-l] Single- vs. Dual- vs. Quad-core CPUs In-Reply-To: <7.0.1.0.2.20070328132512.028a4598@uiuc.edu> References: <7.0.1.0.2.20070119172450.00ec8dd8@uiuc.edu> <7.0.1.0.2.20070124095646.00ec0ec8@uiuc.edu> <7.0.1.0.2.20070328132512.028a4598@uiuc.edu> Message-ID: Hi, Everything you're seeing makes sense for a memory-limited code. Each Opteron core has its own cache, and each chip has its own memory interface. Each pair of Xeon cores shares a single cache, and all chips share a single memory interface. Thus the Opteron system scales more linearly to higher numbers of cores. Of course, when you use fewer cores it also slows down linearly, as IBM pointed out a few years ago. It's a trade-off, and all that really matters is the maximum performance you can get when using all of the cores on a node. Superlinear scaling is usually seen when your memory usage per core drops, resulting in fewer cache misses. Superlinear scaling may continue as you add nodes, so a smaller single-node benchmark may be more realistic. If your code uses shared memory (OpenMP or pthreads) within a node, then the shared cache on the Xeon may help performance if both cores access the same data. In the end, make sure you're using the Intel compiler with all the bells and whistles turned on, and buy whatever gives the best bang per buck. -Jim On Wed, 28 Mar 2007, Nils Oberg wrote: > Thanks for your help Jim. > > I performed some benchmarks on demo equipment from AMD and Intel and there > are some interesting differences between the two platforms for our code. All > times are in seconds. > > Here are some results: > > 2x2 Opteron 2218 2.6 GHz with 4GB RAM: > 1 core: 697 > 2 cores: 323 > 4 cores: 211 > > 4x2 Opteron 875 2.2 Ghz with 8 GB RAM: > 1 core: 531 > 2 cores: 333 > 4 cores: 181 > 6 cores: 143 > 8 cores: 139 > > 1x4 Xeon 5355 2.6 GHz with 4 GB RAM: > 1 core: 510 > 2 cores: 343 > 4 cores: 251 > > 2x4 Xeon 5355 2.6 GHz with 8 GB RAM: > 1 core: 516 > 2 cores: 314 > 4 cores: 228 > 6 cores: 195 > 8 cores: 167 > > > I don't understand why the Xeon performs better than the Opteron on one core, > but worse than the Opteron on 4 cores. I tried a different CFD code and the > same pattern emerged. Why might this be happening? > > > I was getting better than linear speedup results for one of our programs. Is > this possible? Here are some results: > > 2x4 Xeon 5355 2.6 GHz with 8 GB RAM: > cores: 1 7590 > cores: 2 4523 speedup: 1.68 > cores: 4 2060 speedup: 3.68 > cores: 8 916 speedup: 8.29 > > 2x2 Opteron 2218 2.6 GHz with 4GB RAM: > cores: 1 8497 > cores: 2 4360 speedup: 1.95 > cores: 4 1883 speedup: 4.51 > > > Does this make sense? > > Thanks for any help. > > Nils > > > > At 15:11 2/22/2007, Jim Phillips wrote: > >> You really need to run some benchmarks. Failing that, look at the SPEC FP >> Rate results at http://www.spec.org/cpu2006/results/rfp2006.html There are >> three different CFD codes in the benchmark suite. >> >> 1x4 2.7 GHz Xeon leslie3d = 15.0 total = 33.6 >> 2x4 2.7 GHz Xeon leslie3d = 21.9 total = 54.1 >> 2x2 3.0 GHz Xeon leslie3d = 25.8 total = 43.0 >> 2x2 2.6 GHz Optn leslie3d = 28.3 total = 38.1 >> 2x2 2.8 GHz Optn leslie3d = 36.3 total = 48.3 (PathScale compilers) >> >> So, the dual-socket, dual-core Opteron *may* be your best bet, if your >> workload is similar to leslie3d. Run some benchmarks. >> >> -Jim >> >> >> >> On Thu, 22 Feb 2007, Nils Oberg wrote: >> >>> Hi Jim, >>> >>> Thanks for your response. I should probably describe the problem. Our >>> application is a computation fluid dynamics (CFD) code. My understanding >>> of CFD codes is that they are primarily memory bound. Since the domain to >>> be modeled is broken up into chunks, during the course of a time-step in >>> the simulation a large number of messages (not necessary large amounts of >>> data) are passed between processors. >>> >>> We're trying to decide between the following: >>> >>> uni-processor quad-core Xeon 4 GB RAM ($2,300 / node) >>> dual-processor quad-core Xeon 16 GB RAM ($5,800 / node) >>> dual-processor quad-core Xeon 8 GB RAM ($4,600 / node) >>> dual-processor dual-core Xeon 8 GB RAM ($3,800 / node) >>> dual-processor dual-core Opteron 8 GB RAM ($3,200 / node) >> >> -- >> Nils Oberg, Research Programmer >> Civil & Environmental Engineering, University of Illinois at U-C >> phone: 217-333-8365, web: http://vtchl.uiuc.edu > From jay at kreibi.ch Wed Mar 28 14:41:29 2007 From: jay at kreibi.ch (Jay A. Kreibich) Date: Wed, 28 Mar 2007 14:41:29 -0500 Subject: [cluster-l] Networking Equipment In-Reply-To: <7.0.1.0.2.20070328130830.00f4c950@uiuc.edu> References: <7.0.1.0.2.20070328130830.00f4c950@uiuc.edu> Message-ID: <20070328194129.GA17630@uiuc.edu> On Wed, Mar 28, 2007 at 01:18:47PM -0500, Nils Oberg scratched on the wall: > > I'm purchasing a 10-node cluster and the interconnect is going to be > gigabit Ethernet. The plan is to purchase a 24-port unmanaged > switch. I have a few questions regarding equipment quality: > 1. Does the vendor matter, or are all 24-port unmanaged switches > created equal? Switches are most definitely not created equal. Just because it has 24 GigE ports on it does not mean the backplane can deal with 24Gb of data. To make sure you can do that, the marketing terms you want to look for are "wire-speed, fully non-blocking." This means that the backplane has at least as much bandwidth as all the ports added up and that the forwarding engine can deal with moving that much data around. I'm not sure you're going to find that in an unmanaged switch. A 24 port GigE switch is still a pretty serious piece of equipment, especially if you expect any serious performance out of it. Expect to spend a few thousand. > 2. Are there better networking cables than others? If they're pre-made from a reputable source, then not really, especially if you're only talking about 25 feet or less. If they're certified then they're certified. 1000-base-T requires cat5 or better. > 3. All of the nodes I'm purchasing have two gigabit Ethernet > ports. Is there a way to bundle these ports together to get twice > the bandwidth/half of the latency of a single port? 2x bandwidth, yes... at least in theory. It rarely works out that way. 1/2 latency, no... the speed of light is the speed of light. In fact, because of queue and buffering issues in general purpose NICs, many times your latency can suffer. This depends on a number of things, however. The OS needs to support a common trunking model. Many do, but not very many do it all that well. This isn't really the OSes fault-- on-board buffers and queues in the NICs make it very difficult to balance them at a very fine level. This doesn't really matter if you're pumping data through the NICs, but it can work against you if you have lots of small packets and you need very very high efficiency. You will also need a switch that supports trunking. Since this requires port-by-port configuration, there is no way you will find these features in an unmanaged switch. It is definitely a higher-end feature for the networking gear involved. If it will actually buy you anything depends a great deal on what you need the network to do. If your cluster is throwing around huge data sets, it might help. If, however, your cluster is just sending around small sync messages, it isn't going to buy you much, if anything. It might even hurt your overall performance. If you do need to move moderate to large amounts of data around, I'd look at using GigE Jumbo Frames before I'd worry about trunking. You're likely to see a lot better return. You still need to be sure your switch (and NICs) support Jumbo Frames. Again, not a trivial feature. > If anyone knows of documentation or other resources resources for > doing this under one of the Linux cluster distributions, please let me know. Talk to your department IT Support Professional. Have them talk to the Network Design Office, or whatever they're called (CITES renamed a bunch of the groups a few months ago). The NDO needs to approve all network related gear attached to UIUCnet anyways, so it would be best to get them in the loop early. They can offer suggestions, depending on your cluster need. They also have standing bids with many of the major vendors, so they can often get gear at a discount. -j -- Jay A. Kreibich < J A Y @ K R E I B I.C H > "'People who live in bamboo houses should not throw pandas.' Jesus said that." - "The Ninja", www.AskANinja.com, "Special Delivery 10: Pop!Tech 2006" From rdrobert at uiuc.edu Wed Mar 28 14:44:02 2007 From: rdrobert at uiuc.edu (Ricky Robertson) Date: Wed, 28 Mar 2007 14:44:02 -0500 (CDT) Subject: [cluster-l] Single- vs. Dual- vs. Quad-core CPUs Message-ID: <20070328144402.AMV20119@expms5.cites.uiuc.edu> isn't this about memory access? it would make sense if the Xeon were ever so slightly faster than the opteron in terms of raw processor speed but had a memory access strategy that is less scalable. hence, when the memory bus (and/or interface to the network as the case may be) is wide open for the single core, the more efficient processor wins (i.e., xeon) whereas when a bunch of cores are fighting for access to the memory, the one that can manage all that traffic better wins (i.e., opteron). just some thoughts from a non-expert. quiescence, ricky ---- Original message ---- >Date: Wed, 28 Mar 2007 13:57:02 -0500 >From: Nils Oberg >Subject: Re: [cluster-l] Single- vs. Dual- vs. Quad-core CPUs >To: Jim Phillips >Cc: cluster-l at ks.uiuc.edu > >Thanks for your help Jim. > >I performed some benchmarks on demo equipment from AMD and Intel and >there are some interesting differences between the two platforms for >our code. All times are in seconds. > >Here are some results: > > ... From jay at kreibi.ch Wed Mar 28 15:07:35 2007 From: jay at kreibi.ch (Jay A. Kreibich) Date: Wed, 28 Mar 2007 15:07:35 -0500 Subject: [cluster-l] Single- vs. Dual- vs. Quad-core CPUs In-Reply-To: <7.0.1.0.2.20070328132512.028a4598@uiuc.edu> References: <7.0.1.0.2.20070119172450.00ec8dd8@uiuc.edu> <7.0.1.0.2.20070124095646.00ec0ec8@uiuc.edu> <7.0.1.0.2.20070328132512.028a4598@uiuc.edu> Message-ID: <20070328200735.GB17630@uiuc.edu> On Wed, Mar 28, 2007 at 01:57:02PM -0500, Nils Oberg scratched on the wall: > I don't understand why the Xeon performs better than the Opteron on > one core, but worse than the Opteron on 4 cores. I tried a different > CFD code and the same pattern emerged. Why might this be happening? This type of performance analysis is mostly smoke and mirrors. Really understanding what is going on requires a lot of nitty gritty analysis at a very low level. In the bigger picture, this is why you run these kinds of tests and benchmarks _with your own software_. You find what works best and don't worry too much about why. That said, some intelligent guesses can be made. The big thing to remember about cores is that the current generation of chip/sockets/motherboards has so much bandwidth in and out of the chip. While you can add more cores, this usually isn't accounted for in the processor bus design. From that, if you have a very memory intensive application (as most scientific computing stuff is), extra cores don't always buy you much. Putting four cores behind one processor bus will only drive up contention on the bus. Things are even worse if the cores share upper-layer on-chip caches. So my guess is that you're hitting a bus contention issue, and the different hardware is showing this in different ways. The fact that the AMDs are dual core, while the Xeon is quad core, can be very significant-- even if the total number of cores is the same. Also, when running "on four cores" for something like the dual-quad Xeon, it can make a big difference if that's "four cores on one chip" or "two cores on two chips." You might sneak two cores onto one processor bus without too many issues. You're not going to get away with that with four-core chips. > I was getting better than linear speedup results for one of our > programs. Is this possible? Here are some results: Sure. Different cache usage, perhaps leading to less memory thrashing as the problem is spread out. The OS is going to require some amount of overhead, but you only pay that once, regardless of the number of cores you're using (this is assuming the unused cores are turned off, not simply left unutilized by the program). This is especially true of something like networking, which can put a huge number of interrupts on a processor, but usually only one. This leads to massive context switching costs when using only one core. There are a million different reasons we could come up with, but to be honest, its all a lot of guessing without a pretty in depth analysis. Additionally, wall-clock speed is not a great way to do performance tests. In the end, it is usually what matters, but it isn't going to answer very many questions of this nature. -j -- Jay A. Kreibich < J A Y @ K R E I B I.C H > "'People who live in bamboo houses should not throw pandas.' Jesus said that." - "The Ninja", www.AskANinja.com, "Special Delivery 10: Pop!Tech 2006" From cclausen at uiuc.edu Wed Mar 28 16:20:26 2007 From: cclausen at uiuc.edu (Christopher D. Clausen) Date: Wed, 28 Mar 2007 16:20:26 -0500 Subject: [cluster-l] Networking Equipment References: <7.0.1.0.2.20070328130830.00f4c950@uiuc.edu> Message-ID: Jim Phillips wrote: > On Wed, 28 Mar 2007, Nils Oberg wrote: > >> >> I'm purchasing a 10-node cluster and the interconnect is going to be >> gigabit Ethernet. The plan is to purchase a 24-port unmanaged >> switch. I have a few questions regarding equipment quality: >> >> 1. Does the vendor matter, or are all 24-port unmanaged switches >> created equal? > > If they have the same backplane bandwidth then they're probably the > same performance, but I can't guarantee that. We've used SMC > switches, and while I noticed that the unmanaged switch we bought as > a replacement is slower than the older managed switch, it doesn't > affect the overall application performance. Note that some switches support jumbo frames and some do not. You may or may not care about a 9000 byte MTU (over the normal 1500) though. < References: <7.0.1.0.2.20070328130830.00f4c950@uiuc.edu> Message-ID: I think jumbo frames were more of a benefit when gigabit was new, cards were dumb, and processors were slow. I can measure 900 Mbit/s with netperf on our cluster so I don't think jumbo frames would help much. Also, since all machines on a switch need to have jumbo frames you can't use it to netboot or plug the switch into the building network (not that you'd want to do that). -Jim On Wed, 28 Mar 2007, Christopher D. Clausen wrote: > Jim Phillips wrote: >> On Wed, 28 Mar 2007, Nils Oberg wrote: >> >>> >>> I'm purchasing a 10-node cluster and the interconnect is going to be >>> gigabit Ethernet. The plan is to purchase a 24-port unmanaged >>> switch. I have a few questions regarding equipment quality: >>> >>> 1. Does the vendor matter, or are all 24-port unmanaged switches >>> created equal? >> >> If they have the same backplane bandwidth then they're probably the >> same performance, but I can't guarantee that. We've used SMC >> switches, and while I noticed that the unmanaged switch we bought as >> a replacement is slower than the older managed switch, it doesn't >> affect the overall application performance. > > Note that some switches support jumbo frames and some do not. You may > or may not care about a 9000 byte MTU (over the normal 1500) though. > > < > > _______________________________________________ > cluster-l mailing list > cluster-l at ks.uiuc.edu > http://www.ks.uiuc.edu/mailman/listinfo/cluster-l > From jay at kreibi.ch Wed Mar 28 22:29:55 2007 From: jay at kreibi.ch (Jay A. Kreibich) Date: Wed, 28 Mar 2007 22:29:55 -0500 Subject: [cluster-l] Networking Equipment In-Reply-To: References: <7.0.1.0.2.20070328130830.00f4c950@uiuc.edu> Message-ID: <20070329032955.GA25330@uiuc.edu> On Wed, Mar 28, 2007 at 04:54:09PM -0500, Jim Phillips scratched on the wall: > > I think jumbo frames were more of a benefit when gigabit was new, cards > were dumb, and processors were slow. I can measure 900 Mbit/s with > netperf on our cluster so I don't think jumbo frames would help much. Yes and no. While you might get close to max capacity without jumbo frames, that's not useful if you're using 95% of your available CPU to do it. In terms of the network stack (especially TCP), a very large percent of the processing costs are "per packet" and not "per byte." This is even more true if you have cards that off-load the IP checksums and such. From that standpoint, jumbo frames have the ability to cut the resources consumed by the networking stack to 1/6th their old cost. Since the primary goal of a cluster is to spend CPU resources solving problems, rather than running house-keeping tasks, that can be a significant win. On the other hand, if your cluster isn't filling 1500 byte frames to start with, jumbos will buy you nothing but configuration headaches. As with so many things, it depends a lot on the applications you're trying to run and how those systems utilize the interconnect. -j -- Jay A. Kreibich < J A Y @ K R E I B I.C H > "'People who live in bamboo houses should not throw pandas.' Jesus said that." - "The Ninja", www.AskANinja.com, "Special Delivery 10: Pop!Tech 2006" From noberg at uiuc.edu Thu Mar 29 14:49:06 2007 From: noberg at uiuc.edu (Nils Oberg) Date: Thu, 29 Mar 2007 14:49:06 -0500 Subject: [cluster-l] Single- vs. Dual- vs. Quad-core CPUs In-Reply-To: <20070328200735.GB17630@uiuc.edu> References: <7.0.1.0.2.20070119172450.00ec8dd8@uiuc.edu> <7.0.1.0.2.20070124095646.00ec0ec8@uiuc.edu> <7.0.1.0.2.20070328132512.028a4598@uiuc.edu> <20070328200735.GB17630@uiuc.edu> Message-ID: <7.0.1.0.2.20070329144510.00ebba90@uiuc.edu> At 15:07 3/28/2007, Jay A. Kreibich wrote: > Additionally, wall-clock speed is not a great way to do performance > tests. In the end, it is usually what matters, but it isn't going to > answer very many questions of this nature. What would be a good way to do performance tests? I looked at things like valgrind and other performance testers, but the ones I saw were intrusive and slowed performance down. Thanks, Nils -- Nils Oberg, Research Programmer Civil & Environmental Engineering, University of Illinois at U-C phone: 217-333-8365, web: http://vtchl.uiuc.edu From noberg at uiuc.edu Thu Mar 29 15:12:53 2007 From: noberg at uiuc.edu (Nils Oberg) Date: Thu, 29 Mar 2007 15:12:53 -0500 Subject: [cluster-l] Single- vs. Dual- vs. Quad-core CPUs In-Reply-To: References: <7.0.1.0.2.20070119172450.00ec8dd8@uiuc.edu> <7.0.1.0.2.20070124095646.00ec0ec8@uiuc.edu> <7.0.1.0.2.20070328132512.028a4598@uiuc.edu> Message-ID: <7.0.1.0.2.20070329144916.00ebc598@uiuc.edu> At 14:30 3/28/2007, Jim Phillips wrote: >Everything you're seeing makes sense for a memory-limited code. Will a memory-limited code benefit significantly by adding a higher-speed interconnect? Do vendors lend demo equipment to test these sorts of things? >Each Opteron core has its own cache, and each chip has its own >memory interface. Each pair of Xeon cores shares a single cache, >and all chips share a single memory interface. Thus the Opteron >system scales more linearly to higher numbers of cores. Of course, >when you use fewer cores it also slows down linearly, as IBM pointed >out a few years ago. It's a trade-off, and all that really matters >is the maximum performance you can get when using all of the cores on a node. Is there any way to determine how jobs are scheduled? I'm thinking of the case where a user runs an mpich job with 10 processors on a 5-node cluster with 8 cores in each node. Will mpich put 2 processes on each node, or will it bunch them all on the first two nodes in its machines file? Thanks, Nils >On Wed, 28 Mar 2007, Nils Oberg wrote: > >>Thanks for your help Jim. >> >>I performed some benchmarks on demo equipment from AMD and Intel >>and there are some interesting differences between the two >>platforms for our code. All times are in seconds. >> >>Here are some results: >> >>2x2 Opteron 2218 2.6 GHz with 4GB RAM: >>1 core: 697 >>2 cores: 323 >>4 cores: 211 >> >>4x2 Opteron 875 2.2 Ghz with 8 GB RAM: >>1 core: 531 >>2 cores: 333 >>4 cores: 181 >>6 cores: 143 >>8 cores: 139 >> >>1x4 Xeon 5355 2.6 GHz with 4 GB RAM: >>1 core: 510 >>2 cores: 343 >>4 cores: 251 >> >>2x4 Xeon 5355 2.6 GHz with 8 GB RAM: >>1 core: 516 >>2 cores: 314 >>4 cores: 228 >>6 cores: 195 >>8 cores: 167 >> >> >>I don't understand why the Xeon performs better than the Opteron on >>one core, but worse than the Opteron on 4 cores. I tried a >>different CFD code and the same pattern emerged. Why might this be happening? >> >> >>I was getting better than linear speedup results for one of our >>programs. Is this possible? Here are some results: >> >>2x4 Xeon 5355 2.6 GHz with 8 GB RAM: >>cores: 1 7590 >>cores: 2 4523 speedup: 1.68 >>cores: 4 2060 speedup: 3.68 >>cores: 8 916 speedup: 8.29 >> >>2x2 Opteron 2218 2.6 GHz with 4GB RAM: >>cores: 1 8497 >>cores: 2 4360 speedup: 1.95 >>cores: 4 1883 speedup: 4.51 >> >> >>Does this make sense? >> >>Thanks for any help. >> >>Nils > >-- >Nils Oberg, Research Programmer >Civil & Environmental Engineering, University of Illinois at U-C >phone: 217-333-8365, web: http://vtchl.uiuc.edu From jim at ks.uiuc.edu Thu Mar 29 15:54:19 2007 From: jim at ks.uiuc.edu (Jim Phillips) Date: Thu, 29 Mar 2007 15:54:19 -0500 (CDT) Subject: [cluster-l] Single- vs. Dual- vs. Quad-core CPUs In-Reply-To: <7.0.1.0.2.20070329144916.00ebc598@uiuc.edu> References: <7.0.1.0.2.20070119172450.00ec8dd8@uiuc.edu> <7.0.1.0.2.20070124095646.00ec0ec8@uiuc.edu> <7.0.1.0.2.20070328132512.028a4598@uiuc.edu> <7.0.1.0.2.20070329144916.00ebc598@uiuc.edu> Message-ID: On Thu, 29 Mar 2007, Nils Oberg wrote: > Will a memory-limited code benefit significantly by adding a higher-speed > interconnect? Do vendors lend demo equipment to test these sorts of things? By memory-limited I meant that your performance within a node is limited by memory bandwidth, memory latency, and/or cache size. Just because you use a lot of data on a single node doesn't mean that you need to communicate a lot of data to other nodes. It depends on the algorithm. You can get a 10,000 SU development allocation at NCSA in under a month. It's a lot easier to test there than to borrow and install cards. > Is there any way to determine how jobs are scheduled? I'm thinking of the > case where a user runs an mpich job with 10 processors on a 5-node cluster > with 8 cores in each node. Will mpich put 2 processes on each node, or will > it bunch them all on the first two nodes in its machines file? The queueing system is generally responsible for that. If you let Grid Engine schedule one slot per core then you have a choice of fill-up or round-robin based assignment in the config file. You can also let Grid engine just hand out nodes and tell mpirun to use 8 processes per node. -Jim From jay at kreibi.ch Thu Mar 29 19:01:17 2007 From: jay at kreibi.ch (Jay A. Kreibich) Date: Thu, 29 Mar 2007 19:01:17 -0500 Subject: [cluster-l] Single- vs. Dual- vs. Quad-core CPUs In-Reply-To: <7.0.1.0.2.20070329144510.00ebba90@uiuc.edu> References: <7.0.1.0.2.20070119172450.00ec8dd8@uiuc.edu> <7.0.1.0.2.20070124095646.00ec0ec8@uiuc.edu> <7.0.1.0.2.20070328132512.028a4598@uiuc.edu> <20070328200735.GB17630@uiuc.edu> <7.0.1.0.2.20070329144510.00ebba90@uiuc.edu> Message-ID: <20070330000117.GA4029@uiuc.edu> On Thu, Mar 29, 2007 at 02:49:06PM -0500, Nils Oberg scratched on the wall: > > At 15:07 3/28/2007, Jay A. Kreibich wrote: > > Additionally, wall-clock speed is not a great way to do performance > > tests. In the end, it is usually what matters, but it isn't going to > > answer very many questions of this nature. > > What would be a good way to do performance tests? I looked at things > like valgrind and other performance testers, but the ones I saw were > intrusive and slowed performance down. It depends on what you're testing for. While wall-clock doesn't offer you much idea of what is going on, in the end it is usually what you care about. A high performance system is usually designed to answer a question, and the only thing most people care about is that the question is answered as quickly as possible in terms of real-life minutes and seconds. If you can benchmark your actual loads with actual data, that's what really counts. It is only when you get to the question of tuning-- either the algorithm or the hardware configuration-- that you need to ask more detailed questions. If you're happy with the expected performance at the prices you're looking at, it might not be worth any additional testing. If, on the other hand, you're running thousands and thousands of simulations and getting a 10% run-time improvement translates to cutting out four or five weeks worth of work, it might be worth investing a few days in tuning (NOTE: it isn't worth much more than that, however). In order to improve runtimes, you need to learn a lot more details about where your bottlenecks are and where your runtime is being spent. Some of these are hard questions, however. Looking to see how much time the process spends asleep waiting for network traffic is fairly easy to answer. Looking to see how many runtime cycles are spent waiting for memory due to cache performance is much more tricky. Linux is not my OS of choice, so I can't really offer specific suggestions beyond saying that cluster tuning is a bit of a Heisenberg issue. If you slow down the process by putting all kinds of instrumentation on it, you might find some issues with cache performance. On the other hand, the fact that you have the process under inspection might change its network performance and hide issues that are happening there. The issues are very similar to multi-threaded programming, only worse. -j -- Jay A. Kreibich < J A Y @ K R E I B I.C H > "'People who live in bamboo houses should not throw pandas.' Jesus said that." - "The Ninja", www.AskANinja.com, "Special Delivery 10: Pop!Tech 2006"