# Thread: More precise orbital simulation program

1. Okay, try that program in post #59 first, then try this. It adds another loop, so a two body system might take a little longer than that last one, but a multi-body system should run more quickly by shortening the main loop, exponentially with a greater number of bodies, without affecting the rest of your program. By the way, I just noticed you've got h=0 for that second loop which I've changed to h=1 and colored in blue. It wouldn't make a difference to your program values, but it causes the loop to run one extra time unnecessarily, so that'll save you more time right there.
Code:
``` for (int h=1;h<=NumObjects;h++) {
objvx[h] = objvx[h] * ObjMass[h];
objvy[h] = objvy[h] * ObjMass[h];
objvz[h] = objvz[h] * ObjMass[h];
}

for (int k=1;k<=NumObjects;k++) {
for (int j=k+1;j<=NumObjects;j++) {
dx = objx[j] - objx[k];
dy = objy[j] - objy[k];
dz = objz[j] - objz[k];
D = sqrt(dx*dx+dy*dy+dz*dz);
if  (((objsize[k] + objsize[j])/2) > D) {
Collisions = Collisions + 1;
CollisionObjectA[Collisions] = j;
CollisionObjectB[Collisions] = k;
}
f = ObjMass[j] * ObjMass[k] * 398600440000000/D/D/D;
if (f > 0) {
objvx[j] = objvx[j] - dx*f;
objvy[j] = objvy[j] - dy*f;
objvz[j] = objvz[j] - dz*f;
objvx[k] = objvx[k] + dx*f;
objvy[k] = objvy[k] + dy*f;
objvz[k] = objvz[k] + dz*f;
}

}
}
for (int h=1;h<=NumObjects;h++) {
objvx[h] = objvx[h]/ObjMass[h];
objvy[h] = objvy[h]/ObjMass[h];
objvz[h] = objvz[h]/ObjMass[h];
objx[h] = objx[h] + objvx[h];
objy[h] = objy[h] + objvy[h];
objz[h] = objz[h] + objvz[h];
}```

2. Originally Posted by tony873004
Code:
```
for (int k=1;k<=NumObjects;k++) {
tM = ObjMass[k] * 398600440000000;
for (int j=k+1;j<=NumObjects;j++) {
dx = objx[j] - objx[k];
dy = objy[j] - objy[k];
dz = objz[j] - objz[k];
D = sqrt(dx*dx+dy*dy+dz*dz);
if  (((objsize[k] + objsize[j])/2) > D) {
Collisions = Collisions + 1;
CollisionObjectA[Collisions] = j;
CollisionObjectB[Collisions] = k;
}
if (tM > 0) {
f = tM/D/D/D;
objvx[j] = objvx[j] - dx*f;
objvy[j] = objvy[j] - dy*f;
objvz[j] = objvz[j] - dz*f;
}
tM2 = ObjMass[j] * 398600440000000;
if (tM2 > 0) {
f = tM2/D/D/D;
objvx[k] = objvx[k] + dx*f;
objvy[k] = objvy[k] + dy*f;
objvz[k] = objvz[k] + dz*f;
}

}
}```
So if I understand it correctly, if the code is parallelizable I should be able to go through the loop in any order I want ? In the above block of code, I believe this is possible, as all it is doing is updating velocities. The positions are not altered in this embedded loop.
Try to look at this slight rewrite, it gives the compiler a lot more to work on for vectorization, because it doesn't require dataflow analysis to see that all j iterations are independent.
I'm trading space (keeping the temporary variables for all iterations) for speed(allowing calculations in different iterations to overlap cleanly)
Code:
```for (int k=1;k<=NumObjects;k++) {
tM[k] = ObjMass[k] * 398600440000000;

/* this loop is a very good candidate for vectorization */
for (int j=k+1;j<=NumObjects;j++) {
dx[j] = objx[j] - objx[k];
dy[j] = objy[j] - objy[k];
dz[j] = objz[j] - objz[k];
D[j] = sqrt(dx[j]*dx[j]+dy[j]*dy[j]+dz[j]*dz[j]);
}

for (int j=k+1;j<=NumObjects;j++) {
if  (((objsize[k] + objsize[j])/2) > D[j]) {
Collisions = Collisions + 1;
CollisionObjectA[Collisions] = j;
CollisionObjectB[Collisions] = k;
}
}
/* this one too */
for (int j=k+1;j<=NumObjects;j++) {
if (tM[k] > 0) {
f[j] = tM[k]/D[j]/D[j]/D[j];
objvx[j] = objvx[j] - dx[j]*f[j];
objvy[j] = objvy[j] - dy[j]*f[j];
objvz[j] = objvz[j] - dz[j]*f[j];
}
tM2[k] = ObjMass[j] * 398600440000000;
if (tM2[k] > 0) {
f[j] = tM2[k]/D[j]/D[j]/D[j];
objvx[k] = objvx[k] + dx[j]*f[j];
objvy[k] = objvy[k] + dy[j]*f[j];
objvz[k] = objvz[k] + dz[j]*f[j];
}
}
}```
I made multiple smaller j loops to make it clearer what I mean about easily parallelizable computations, and to help the compiler because I don't know how well it handles a long inner loop, it's possible it will work better with a single inner loop over everything.

Incidentally, with zero offset arrays, you could do the loops as:
Code:
```k=NumObjects;
while (k--) {
j=k;
while (j--) {
.
.
.
}
j=k;
while (j--) {
.
.
.
}
}```

3. Henrik,

Well, I ran that through. First loop vectorized, but the second spewed out of lot. Most interesting is this:

Code:
```.\CodeOp2.cpp(45) : (col. 3) remark: vector dependence: proven ANTI dependence between tM2 line 54, and tM2 line 52.
.\CodeOp2.cpp(45) : (col. 3) remark: vector dependence: proven ANTI dependence between tM2 line 53, and tM2 line 52.```
Note the compiler *proved* data dependence with the tM2 array. There were a ton of other *assumed* dependencies which I suspect has to do with the fact that it doesn't know just what the obj and objv arrays are, since they are passed to it as arguments. For all it can know, those arrays might overlap in some way. I can (when I look it up) tell the compiler those arrays do not overlap along with other information.

However, it's "darn sure" the above two are there.

ETA: I see it now -- I get confused over the 'k' and 'j' iterations. In the second loop, each iteration is doing something with the same [k] values.

-Richard

4. My mistake on the second loop, probably better result with:
Code:
```  for (int j=k+1;j<=NumObjects;j++) {
if (tM[k] > 0) {
f[j] = tM[k]/D[j]/D[j]/D[j];
objvx[j] = objvx[j] - dx[j]*f[j];
objvy[j] = objvy[j] - dy[j]*f[j];
objvz[j] = objvz[j] - dz[j]*f[j];
}
}
for (int j=k+1;j<=NumObjects;j++) {
tM2[j] = ObjMass[j] * 398600440000000;
if (tM2[j] > 0) {
f[j] = tM2[j]/D[j]/D[j]/D[j];
objvx[k] = objvx[k] + dx[j]*f[j];
objvy[k] = objvy[k] + dy[j]*f[j];
objvz[k] = objvz[k] + dz[j]*f[j];
}
}```
splitting the calculation so the two loops are separate.
The one that's updating across k with values from j may be a real bummer to vectorize, so you should probably just accept if it doesn't.

As for the strange dependency on the objv arrays it's probably because you had an inner loop that updated both in j and k order.

5. Incidentally, is it just me who're reading things wrong, or what?

From what I read, the f in the loop is actually acceleration, not force, so should really have been a, but the thing that strikes me is that it's the other mass divided by distance cubed, not by distance squared, as it would be if my recollection of physics was right. What's going on there?

6. Originally Posted by HenrikOlsen
Incidentally, is it just me who're reading things wrong, or what?

From what I read, the f in the loop is actually acceleration, not force, so should really have been a, but the thing that strikes me is that it's the other mass divided by distance cubed, not by distance squared, as it would be if my recollection of physics was right. What's going on there?

Yeah, 'f' is acceleration -- the mass units are "earth masses", I think (Tony can confirm). The cube comes in this way from breaking apart the acceleration vector into cartesians:

a = -GM/R^2 * n, where n is a *unit* vector from the source to the field point, and R is the scalar distance. We can write n as r/R, where r is the distance vector. Thus,

a = -GM/R^3 * r. Now, r is just (x*i + y*j +z*k)

And that's what the loops are doing.

Now, Tony needs the actual R (D in the program), which we have to get from summing the squares and taking a square root. The Intel compiler (and MS's as well), in wide-open optimization mode will do sqrt() as an intrinsic, no function call, and just use the hardware square root instruction. How many clock ticks compared to a general power that takes, I'm not sure (and it's different between the SSE2 and x87 instructions). But it has a general power instruction as well.

If Tony didn't need the D, I would just code that as an instrinsic 3/2 power and compare times. Already we're taking a square root, and the 3/2 would elimate the additional division entirely.

As is, since he needs the D, I modified it to just save the sum of squares as D2, then do the cube as dividing by (D*D2). But I haven't yet compared speeds.

-Richard

7. No, there isn't a single power instruction -- the compiler will do it directly, but it takes several instructions, several complex instructions -- the square root instruction is reasonably fast.

If you're interested the way you do a general power is take the log (and the base 2 log at that), then multiply by the exponent, then raise 2 back to the result. I've forgotten exactly, but years ago I played around with doing it using the x87 instructions.

Anyway, doing a general power is a costly operation all the way around, but they do have optimized square root instructions.

-Richard

8. Oh great, after changing things, I'm getting a "condition may protect exception" message preventing vectorization. Now, let's see if I can figure out what that means................

Oh, I think that means that, if the code executed sequentially, the test for >0 would "protect" from an exception. When the CPU gets in "pipeline" and vector mode, it does speculative execution of all paths of a branch......that is probably what that's about.

-Richard

9. Just drop the if, the only case where you have to be careful is if D is 0.

10. This is fun, isn't it?

11. Well, I got Henrik's version significantly vectorized, all but the Collisions thing. Now, to see if it works, and b) if it's faster. To do it, I used some compiler pragmas which made it ignore the "exception protection" thing...... I also used some to tell the compiler that the array pointers passed were not "aliased", that is they point to distinct arrays. That alone killed all the "assumed" dependencies involving the obj arrays.

You may all place bets....................

-Richard

12. Well, it appears to work, and is *slightly* faster on the "barycenter" dataset than the non-vector, Intel compiled SSE2 version.

Tony tried the SSE2 version of the .dll on his Pentium 4 system (details I don't know) and reported it was slower than the old .dll. However, on my Athlon64 system, the SSE2 version looks about twice as fast.

In 30 seconds, the old .dll would advance to 2004 (starting from 1998), while the SSE2 would get to 2012.

The vector version I just compiled makes to 2013 in those 30 secs.

And while I don't think it makes much of difference, I'm running the Athlon64 in 64-bit mode using WinXP 64-bit. 32-bit code runs under that in a "WOW64" layer. Sometimes 32-bit code is slightly slower, and sometimes it's actually slightly faster in 64-bit native mode.

-Richard

13. Just tried and compared it on the "moonbuilder" dataset, which places a large number of small objects in earth orbit, then lets them do their thing. After a year or so, the program ends up with one large object in a tight earth orbit, and a smaller one in very big, perturbed ellipitical orbit (the path is not actually a closed ellipse, and I wouldn't exactly call it a simple precession either -- it's just "wig-waggling" near-ellipse, which is due to the sun I'm assuming).

Anyway, all three .dlls produce the same result after time. The run starts out at a time stamp of just before 1/1/2005. In 30 secs, the old .dll makes it to just around 3/2/2005. The SSE2 .dll makes it around 2/20/2006 in that same time period. "Blows away", I would say. However, as collisions occur the number of objects is reduced, so a slight edge in speed early on may get the number of objects down, boosting total speed even more, which may explain that enormous difference.

The vector version made it up into 3/08/2006, which agrees with "slightly faster" above.

-Richard

14. Has anyone tried out the code in post #59 and/or post #61 yet? I'm eager to see how they do.

15. Originally Posted by grav
Has anyone tried out the code in post #59 and/or post #61 yet? I'm eager to see how they do.
No, I was concentrating on trying to get Henrik's vectorizable version to work. Your modifications might do something for the sequential version, but they will not vectorize as is.

I forced a loop to vectorize which the compiler says "seems inefficient". I may let that go back to serial, and just have the one loop its efficiency logic thought was good.

I may even let the PGO (profile guided) have a go on this, as that will track data statistics that can tell the compiler what's happenning at run time. This assumes the PGO crap will work in a .dll called by a VB program..............

-Richard

16. Well, the PGP instrumented build is going wide open on the Athlon64 box as I type. The idea is to let it gather a bunch of run-time data, which the compiler can then use to better know what to do. It's slow as snails because it keeps track of all sorts of stuff.

So I'll let that burn CPU on several data sets, then run a feedback build and we'll see if that makes any difference. PGO can really shine, but I don't know what it'll do here.

Oh, and I'm just shaking my head at myself. I've got all these dadgumed compilers and tools that I don't really know how to use.......... I have to read a page or two of docs just to even know how to begin.......

-Richard

17. Well, heck's bells. I did the PGO, and on the feedback compile, the compiler says it was inefficient to vectorize all loops save for one, the last! This was based on the run-time data, now.

I'm going to run the optimized result of the build and see what the times are now.

Something else I noticed. There was divergence between the "moonbuilder" results between the PGO and the others. During the PGO phase, all optimizations were turned off, and that indicates a precision thing. These things are somewhat chaotic, and there was a big difference. I was using the default timestep of 16 for all them.

-Richard

18. Well, the compiler's PGO conclusions are sound. The PGO optimized build just went to 5/06 in 30 secs, compared to best result I had of 3/06 before.

I need to carefully look and compare the results. This latest result looks different than the previous as well.

-Richard

19. There are some differences between all the various versions I compiled and the original. This is slight precision differences between just exactly how the compiler does it, and it may be some of the changes we made, too.

What I need to do, and will probably do tomorrow is compile the various versions using the exact same source code, and then see what how they are different.

Anyway, the latest and fastest PGO version seems agrees the most with the original .dll, I think on the "moonbuilder" dataset. But they all are slightly different. Lowering the timestep changes them all.

If any of you have the program installed, do me a favor and run the moonbuilder dataset using time steps of 16, 8, and 4 and see what you get after about 1 or 2 years.

I need to give that poor Athlon64 a break. It has the "Cool N Quiet" throttling, and I've been running it wide open, 100%, which throttles up to the full 2GHz, and the poor thing is getting hot.

-Richard

20. Banned
Join Date
Dec 2005
Posts
14,315
The principle issue with accuracy among orbital simulators is that computed position t+n is always based upon the previously computed position t+n-1, even when there is no action on the body except the force of gravity.

The more often you perform the computations, the greater the number of errors that are introduced.

A much more accurate approach would be to compute the position based on t+n as calculated (not iteratively) from it's original position, t0.

The only time the program should use the iterative approach would be during significant changes in acceleration, such as liftoffs, orbital insertion burns, etc.

21. Order of Kilopi
Join Date
Jan 2006
Posts
3,657
But isn't that the thing? That there is no analytic solution to the n-body problem?

22. Originally Posted by mugaliens
The more often you perform the computations, the greater the number of errors that are introduced.
Well, actually, that is what I am working on now. Publius and HenrikOlsen are taking care of the speed part of it now, although I apparently have no clue what vectorization and parallelization mean, since I think I can show that the mathematics in the loops can't be reduced any further, except for maybe some slight number crunching, which is also about as good as it can get, so what they are doing must have some kind of benefit for the compiler being used or something, which I know nothing about, so I've started working on the accuracy of the program instead. I think if they could vectorize what I've got in post #59, though, the program would be even faster (I know, relentless ).

Anyway, I still don't get how that RK4 thing should be applied to this, but I've managed to add a few lines to this program that is something like what I did to my original program that increases the accuracy so much that I can't even measure the error on a ten digit calculator! I've been doing it by running through the loops one by one manually on paper and finding the results when starting at different points in the orbit and working through the amount of error with the extra lines to try to place it closer to where it is supposed to be in a natural way. It is working out very nicely so far, but I've still got one detail to work out when finding the new coordinates. There is a variable for the coordinates in the final loop that seems to depend on the angle. For example, at any multiple of 90 degrees, it is 3, but at an angle of 45 degrees, it is 2. And after I figure it out, I'll have to test it with other various orbits as well, so I've still got a little way to go. All in all, the extra lines may slow down the loops to about 1/2 or 1/3 of their present speed, but the accuracy is apparently at least millions of times greater, so the entire program could potentially be sped up thousands of times over and still gain thousands of times as much accuracy. I'll keep working on it.
Last edited by grav; 2007-Apr-22 at 06:27 PM.

23. Grav,

Paralleling/vectorizing has to do with "doing more than one thing at the same time". When you're doing a loop, rather than going through one iteration at time, do two or more of the iterations simultaneously.

"Vector" instructions work like this. Using psuedo assembly code, every CPU has say, a multiply instruction sort of like this:

mul A, B

Where A and B are registers (or memory locations, and combinations). That instruction multiplies A and B and stores the result back in A.

That is a "scalar" instruction. A vector version of that would let A and B be *arrays* or some other structure where you had multiple numbers, ie

mul A(), B(). It multiplies A(i)*B(i) all at once. If you think of the arrays as two vectors, the thing does a vector multiply, and that's where the term "vector" comes from.

Vectorizing is a "fine grained" type of parallelism, done at the instruction level of a single CPU. Now, another typ of parallelism is having two or more CPUs (or even logical CPUs, such as hyperthreading, where a single physical CPU processes two or more instruction streams at the same time).

The Intel compiler (and MS is getting in the act as well) can exploit that. Consider a big double loop:

For I = 1 to N

For J = 1 to M

..............

EndFor
EndFor

The compiler will try to vectorize the J loop, and parallel the I loop. The program spawns several worker threads (creating a thread is a costly operation, so you spawn your threads once, then let them sleep until they are need to do work). When program execution comes to the I loop, the compiler inserts code to "fork" that loop amongst the worker threads.

In the ideal case, you then have your multiple CPUs doing pieces of the I loop at the same time, and each one then vectorizes chunks of the J loop.

That is the ideal, of course. In practice, there's a lot more to it.

-Richard

24. Originally Posted by mugaliens
The only time the program should use the iterative approach would be during significant changes in acceleration, such as liftoffs, orbital insertion burns, etc.
Well, you see what Tony is doing is solving the general N-body problem, by iteratively integrating the equation of motion for each of the N bodies.

The N body problem has no analytic solution.

The one I was seeing the differences in was a very chaotic problem. About 50 or so little moonlet chunks in earth orbit (along with the rest of the solar system, sun + planets), and you see what happens. Very chaotic.

-Richard

25. Well, give Intel's compiler a pat on the back. Tony just reported via e-mail that the last PGO version of the loop ran 3 times faster on his
Pentium-4 machine.

Again, the PGO version was a feedback build using run-time statistics. Based on that, the compiler decided that it was inefficient to vectorize all but one of the loops that could be vectorized. And that run-time data also gives the compiler some other information about "how the data flows" so to speak.

The question I have is what are the differences on "chaotic" datasets. Does this fast version maybe diverge too much..............

-Richard

26. So it's kind of like having many different computers working on different values for the same calculation all at the same time and giving back the results in one lump array? That's pretty neat. It sounds like just what I need for an AI program I'd been working on, if I could only do it for many different kinds of calculations all at once instead, but using the same values for the information inputs.

By the way, could you vectorize for post #59 and see how that compares to what you are using now?
Would I be able to do this for UBasic, do you think, or does it depend on the compiler?

27. Originally Posted by grav
So it's kind of like having many different computers working on different values for the same calculation all at the same time and giving back the results in one lump array? That's pretty neat. It sounds like just what I need for an AI program I'd been working on, if I could only do it for many different kinds of calculations all at once instead, but using the same values for the information inputs.

By the way, could you vectorize for post #59 and see how that compares to what you are using now?
Would I be able to do this for UBasic, do you think, or does it depend on the compiler?
Grav,

#59 will not vectorize well -- we'd just end up doing what Henrik did again. But I'll look at the differences and see if it might be incorporated. Since the compiler learned only one of the loops paid off to vectorize, I will probably get rid of the array logic for the non-vector ones anyway. That will save memory at the very least.

And back to the general subject of parallelism. Compilers like Intel's "game" is doing this automatically. That is you write code and the compiler itself tries to figure out how to improve it. There is a big, big market for that. Some lament the fact of "dumb programmer vs smart compiler", but that's the way it is playing out.

To really do it, you write the code from the ground up to be parallel. It requires you to think about things differently.

The machine I'm typing on now, "ol' reliable", is a dual PIII machine, ABIT VP-6 mobo (I'm attached to it -- this board suffered from the cheap capacitor problem, and I spent \$100 for a buddy of mine with micro-soldering skills to replace the caps on it). I would love to parallel the code so I get could both of those CPUs working on the same problem at once (but it turns out that both of my 1GHz PIIIs going wide open are slower than the single Athlon64 at 2GHz! )

Dual cores, which put two CPUs on the same chip are the thing now, so parallel programming is now a mainstream thing, and I think making Tony's program parallel could benefit from this.

But to do it right requires coding it from the start to be parallel. For example, say we had 4 logical CPUs (not that outlandish with a CoreDuo with each core hyperthreading).

The way I would do it spawn some threads whose job is "calculate the acceleration of the ith body". Each thread would come in, fetch the distances, spit out that acceleration, and then go back and see if there's anymore accelerations it needs to calculate. If not, it sleeps until there is more work.

Problem with that as is is we would do much of the same work twice or more (calculating the d_ij distances). So we'd have to get clever, and using synchronization techniques, write a d_ij when we calculated it, and subsequent threads could use it if were there, calculate it otherwise.

That's the way to do it, I think. But that would require Tony's main program to get in the act and lots of communication.

-Richard

28. Tony,

While I'm thinking about it, there's some other things we could do to speed things up a bit. The main the thing is the function call with that long list of arguments. For each time step, the calling routine has to push all that crap on the stack, then the called function has to set up the frame for it, and all that. That takes a good bit of overhead.

What we can do is write a "LoopInit" routine in the .dll that gets passed the various pointers, then stores them locally in static variables. We can add a "change time step" routine as well to change the time step on the fly.

Then, all that would be involved is the function call itself during the loop running and that will save some overhead.

------------------------

Second, and this would be a much bigger change, is I'd like to get the time loop itself in C code, and would like to let that C code itself allocate the arrays and manage that itself. That would let the Intel compiler have control over that, and be able to better arrange things if possible.

The way I would do it is to spawn an independent thread to run the time loop -- just go wide open running it. Now, we would then add a "peek" function in the main thread (that your VB code is executing) that would give you the object position and velocity data. Heck, we could even add a "throttling" logic that would sleep a while if the user didn't want to run 100% CPU all the time -- say they wanted it to run in the background but wanted responsiveness with foreground tasks. We could easily do that.

So, on the one hand, we have a thread going at running the integration loop without having to worry with function calls, and the graphics as a secondary thing on the other hand.

-Richard

29. Established Member
Join Date
Mar 2005
Posts
1,234
Sorry I haven't had a chance to read all the posts in this thread yet.

I experimented a while back with the .dll wrapping a "for i=1, 10" around the whole thing. I wanted to see how much time was lost calling the .dll with its long list of arguments. I figured if it kept control for 10 iterations before passing it back to VB, that there would be a significant increase in speed at the cost of making the program seem jumpy. But there wasn't. I think it only sped things up by about 1%, so I abandoned that idea. I should test it again. I like the idea of not passing all those arguments once per iteration.

On my computer, Gravity Simulator never uses more than about 50% of the cpu cycles, so other applications run smoothly in the background. I wish I knew a way to make it run closer to 100%, so if I didn't care about running other applications that I could make the sim run faster. But I don't know how. I've been told on coding forums that it can't be done.

The vb that calls it looks something like this:

Code:
```Do
other stuff

xxx = RunLoop(objects, ObjMass(0), objx(0), objy(0), objz(0), objvx(0), objvy(0), objvz(0), objsize(0), CollisionObjectA(0), CollisionObjectB(0))

Loop until Pause```
where "other stuff" does things like
*update the distance boxes
*update the orbital element boxes
*check to see if it is time to execute an autopilot command
*check to see if it it time to output data to a file
*check to see if the user has paused the simulation
*apply thrust to an object if the user has chosen to do so

In the "Don't Plot" mode, it ignores a lot of these things.

Another area where I can gain about a 1% increase in speed is to not multiply tM*mass inside the loop. Rather than store objects' masses in the objMass array, I would store their mu's. (mu=GM).

Thanks for all the suggestions and work so far. Speed is the most critical thing in these types of programs (next to accuracy). I've put a new beta version on my web forum that uses your latest .dll.

30. Originally Posted by tony873004
On my computer, Gravity Simulator never uses more than about 50% of the cpu cycles, so other applications run smoothly in the background. I wish I knew a way to make it run closer to 100%, so if I didn't care about running other applications that I could make the sim run faster. But I don't know how. I've been told on coding forums that it can't be done.
Tony,

That sounds like a Hyperthreader. Are you running XP? If you are, pop up Task Manager, click on the "performance" tab, then choose Option. Under "CPU History", select "one graph per CPU". If you now see two CPU graphs, you are hyperthreading.

The main loop runs only in one thread, and so maxing out one CPU amounts to just 50% on a two CPU system. Now, you can turn off hyperthreading in the BIOS. I forget how it works with XP. There are big differences between a hyperthreader and real multiple CPUs (including dual core). MS made the SMP kernel aware of those differences in XP, but it did not in the old Win2K SMP kernel, and many times Win2K would be slower because of those differences.

Anyway, try turning off the Hyperthreading in the BIOS, then boot XP. If it doesn't boot, don't worry, just turn it back on. But if it does boot, compare the times. You should see 100% CPU when you've only got one logical CPU. And it may be faster, as you're not wasting half the instruction pipeline.

Anyway, on multiple CPUs, this is where threading would really shine, but it would take a substantial rewrite of everything to do it properly.

ETA: Well, it may not increase things much. "Waste half the pipeline" is not correct. When one logical CPU is in the idle state, the thing is supposed to be able to devote all resources to the other. Supposedly. That just doesn't show up in the Task Manager CPU thing, because it doesn't know anything about how the hardware is weighing the two streams.

-Richard

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•