 |
www.gpgpu.org General Purpose Computation on GPUs
|
| View previous topic :: View next topic |
| Author |
Message |
gziegler .

Joined: 05 May 2004 Posts: 21 Location: Saarbrücken
|
Posted: Fri Nov 24, 2006 11:01 am Post subject: GPU Data Compaction V2.0 (aka GPU PointList Generation) |
|
|
Hej !
I just finished my presentation at VMV2006 and have finally come around to creating some presentation slides on data compaction, below is the presentation text:
We have created a GPU algorithm which generates lists of selected elements from a larger 2D data array, without any data transfer from GPU to CPU.
It can be used for point cloud generation, intermediate result compaction, sparse matrix extraction, image processing and binning operations - both in 2D and 3D.
It is further able to create geometry on SM3.0 GPUs (see slides for a sketch).
Sourcecode is available on request.
You find it on http://www.mpi-inf.mpg.de/~gziegler.
I am glad for any discussion and new project that arises from it  _________________ /Gernot |
|
| Back to top |
|
 |
jowens .
Joined: 11 Aug 2003 Posts: 6 Location: Davis, California
|
Posted: Mon Nov 27, 2006 8:58 pm Post subject: |
|
|
Super. Thanks for posting! First, I do encourage you to make source code available, as that's quite useful for the community.
Can you compare your implementation, both in algorithm and perhaps in performance, to Shubho Sengupta's February 06 EDGE paper or to Alexander Greß's EG paper from this past summer? Sengupta did the first O(n) implementation on a 1D compaction and showed the hybrid work-efficient step-efficient approach was a performance win; Greß did the first 2D O(n) implementation of which I'm aware; yours seems pretty similar to what Greß did, but in comparing Sengupta/Greß, it's really just a question of which radix to choose, the fundamental approaches are similar. Both approaches are O(n) which is the important advance over Horn's 2005 work. Aaron Lefohn has some more details of our approach in his dissertation. Anyway, no one's really done a performance comparison, which I think is overdue, and perhaps something in which you are interested.
I think the future work point you mention about geometry shaders is definitely relevant. I'll be interested to see how they compare to the fragment-shader scan-based approach we've all pursued in the past. |
|
| Back to top |
|
 |
gziegler .

Joined: 05 May 2004 Posts: 21 Location: Saarbrücken
|
Posted: Mon Nov 27, 2006 9:42 pm Post subject: |
|
|
Oh, the sourcecode is already available - find it under the SVN repository of http://geocast.sf.net, it's called gpu_tools -> dcompact if i remember correctly. You and others can even contribute if you register with Sourceforge, I gladly add new users to this project - but I have nearly no time for it myself until the end of January.
Which exact paper by Alexander Greß did you mean ? The one on Collision Detection ? I have written a lengthy comparison to that one one time (they did not take the full step to extract lists, and pipe decision making through the vertex shader), I will post it here if it was exactly the paper I think it is.
I am aware of the EDGE paper (and did mention it in my VMV06 submission - the GH2006 submission didnt make it, darn ), but its cache behaviour is probably worse since it is 1D, and it is implemented in Brook, which makes me doubt if it is that optimal. They discuss that the last steps are redundant if the exact number of parallel units are known, but I didnt like that specialization and wanted to keep it optimal.
A rather new optimization I only realized a bit ago would be to use 16bit in the lower levels, and 32 bit for only the higher pyramid levels. But that means more complicated shaders on both the histopyramid and point list builder side.
Anyways, glad I made a contribution which caught interest  _________________ /Gernot
Last edited by gziegler on Mon Nov 27, 2006 10:39 pm; edited 1 time in total |
|
| Back to top |
|
 |
jowens .
Joined: 11 Aug 2003 Posts: 6 Location: Davis, California
|
Posted: Mon Nov 27, 2006 9:55 pm Post subject: |
|
|
The Greß paper is "GPU-based Collision Detection for Deformable Parameterized Surfaces". It's very nice work. I will definitely be interested to see your comparison of it to your own work.
Sengupta's work is not done in Brook. The paper indicates it's done in Cg and OpenGL. If references refer to it being done in Brook please do let me know as we'd like to correct that. The 1D cache efficiency is perhaps an issue, but the overall numbers were 4-5x over Horn's work for large arrays, and I think the performance was pretty solid overall. We can directly compare if you'd like to run, say, a reduction on a 2k by 2k array. We don't count readback time in our performance numbers (since we consume the data on the card, as do you, I imagine).
I'd also be careful of characterizing Brook implementations out of hand as suboptimal - the Stanford folks have done some amazing things, both implementation-wise and performance-wise, with Brook. |
|
| Back to top |
|
 |
gziegler .

Joined: 05 May 2004 Posts: 21 Location: Saarbrücken
|
Posted: Mon Nov 27, 2006 10:36 pm Post subject: |
|
|
Hej John !
You're correct, Sengupta's work is not done in Brook - I gobbled it up with Daniel's original approach, sorry for that.
Then my only point is on the 1D vs. 2D thing, and the question on if generating the last levels "at once" makes a difference. I think that caching and addressing is easier in 2D, and that it isn't worth to bother with treating the last levels differently, as it binds you to a certain number of ALUs.
On the Brook issue: I know that the guys are good at optimizing, but it is a platform-independent language after all - I used NVidia specific optimizations and hacks (packing all the mipmap levels into a rectangle texture, early branch exit, no ping-pong) which I don't think you can do there since it would break the platform-independency. It's the classic problem: the closer you get to the hardware, the more optimizations you can make - but the less portable the code becomes - and mine most definitely does NOT run on ATI hardware
----
Here my comments on Alexander Gress' paper:
> were you in the room when Alexander Gress presented their GPU-based
> collision detection for on-GPU deforming objects? The 2D stream
> compaction they used sounded very much like the pointlist approach,
> and they did a scatter step using the vertex program to avoid the
> log(n) steps required by a gather approach.
Oh, no I wasnt, because I only had a ticket for Graphics Hardware - I havenow printed out the article, and read through it:
Yes, they do use a 2D-Mipmap structure already, and calculate indices - what they havent considered is that float16 will not suffice for more than 4096 entries, and if they would use 32bit float textures, that the internal mimap building will probably be _very_ slow - a custom fragment shader for the HistoPyramid building is nearly compulsory.
Further, it seems as if their approach got stuck half-ways ... they do calculate the right target indices for the non-zero entries, but place them back into the _original_ base layout, instead of constructing a new, compacted texture (which I do through top-down traversal of the HistoPyramid). Instead they turn the whole base layout into one huge point cloud, and discard the points that are irrelevant, while placing the rest where it is supposed to end up (through the prepared index texture).
In summary, they solved one problem (generating the right output indices), but did not solve the massive discarding of useless points in the vertex shader, which inevitably happens if you try to generate the compacted output using "point primitive scattering" on the GPU
- instead, the paper implies that this is negligible effort, which I am sure it isn't if there are only few entries in a large texture (I had e.g. 33000 points in a 4096x4096 texture with my teapot dicer - their approach would generate 16 million point primitives).
In summary, I think that their solution does have its purpose, but is not easily applicable to the applications that I am aiming at.
----
So, hope that people still talk to me after this lot of claims
Please always correct me in any factual errors I make, I do change my views easily if convinced. _________________ /Gernot |
|
| Back to top |
|
 |
jowens .
Joined: 11 Aug 2003 Posts: 6 Location: Davis, California
|
Posted: Tue Nov 28, 2006 1:03 am Post subject: |
|
|
Caching can make a difference, this is true; this decision is often based on your input. You have a spatial 2D input, as did Greß, so a 2D formulation makes a lot of sense; our input was linear and 1D made more sense.
As far as "treating the last levels differently", I assume you are referring to the switch between work-efficient and step-efficient. It is true this must be tuned to the hardware (and for us on a 1k by 1k stream, so doing yielded a 4x speedup from step-efficient alone and a 2x speedup from work-efficient alone). I think we'll see more and more applications that need to be tuned for good performance, as we already do in the CPU world.
I will point Alexander at this thread and hope he will address the questions we have about his paper. Meanwhile, any approach has to do some sort of scatter emulation (as described in any of the tutorial "tips and tricks" talks); we did gather-search, you say he uses vertex scattering. Relative performance depends a lot on the characteristics of the input and the fraction of output elements, and since we all attacked very different problems, it makes sense that we may have all chosen different approaches to doing the scatter. |
|
| Back to top |
|
 |
gziegler .

Joined: 05 May 2004 Posts: 21 Location: Saarbrücken
|
Posted: Tue Nov 28, 2006 10:57 pm Post subject: |
|
|
"our input was linear and 1D made more sense. "
Even if so, it might still be more efficient to apply 2D structure to 1D arrays on <= NV40 architecture as long as the output order is irrelevant. BUT, and this is actually one of the problems of the 2D approach: You get a weird, kind-of-fractal output order, and not the well-known line-based traversal that a CPU would produce.
1D hierarchies will become a lot cheaper on G80, if I understood it correctly, since you now can access a texture as one continous 1D data array.
To calling the gather-search similar to scattering: I am not so sure about that - I would rather call it a problem of the output modality: If you want a compacted list, then gather-search is most efficient; in the other case (assigning the found indices to the points who still reside in the original input stream position), scattering is probably the way to go. But you can still use a log(n) approach there, propagating the index offsets in each HistoPyramid level from top-down, effectively reconstructing the intermediate data (skip information) which Daniel Horns algorithm produced (*). Depending on how quick scattering is (ATIs had one scattered float instead of 16 float), it might still prove useful.
(* seemingly aka push/pull, I had an indeep discussion with Martin Krause on that topic recently). _________________ /Gernot |
|
| Back to top |
|
 |
mark Site Admin

Joined: 08 Aug 2003 Posts: 996 Location: NVIDIA (Brisbane, Australia)
|
Posted: Fri Dec 01, 2006 3:12 pm Post subject: |
|
|
Gernot,
I just read your paper. It's pretty interesting.
I wanted to point out that your technique is equivalent to a gather-search method of scatter. The traversal you do takes O(log n) steps and each pixel has to search 4 sub-pixels, so it's O(log n) per output element, the same as binary search. Overall it's O(m log n), where m is the number of output elements. The advantage of 2D is that n in this case is sqrt(N), where N is the number of input elements.
The name "histogram pyramid" bugs me. The histogram in this case only has 2 buckets: on or off. The pyramid levels contain the the count of "on" elements in the corresponding quadrant of the level below. I'm not sure how this is a pyramid of histograms. I'd probably call it a "Scan Pyramid".
Mark |
|
| Back to top |
|
 |
gziegler .

Joined: 05 May 2004 Posts: 21 Location: Saarbrücken
|
Posted: Fri Dec 01, 2006 3:31 pm Post subject: |
|
|
Hej Mark,
thanks ! And thanks for clarifying this. Yes, I agree on the complexity - the 2D is currently an advantage as the architecture likes this a lot more. Actually, a 3D Octree (third root of N) or similar can be thought of, as well, as long as the output order is irrelevant. Its mostly a question of what the architecture likes best in terms of access patterns.
To the HistoPyramid: Yes, you are right - but the name stems from my original idea that you can actually build lists of multiple buckets at once, up to 16 on NV40 if you see what I aim at (darn, did I give too many hints now ? ) _________________ /Gernot |
|
| Back to top |
|
 |
trahern .
Joined: 02 Nov 2005 Posts: 23 Location: Prague, CZ
|
Posted: Wed Dec 06, 2006 12:02 pm Post subject: |
|
|
Hi,
great work. The quad-tree analysator was exatly what I needed right now for my master-thesis. I have implemented it successfuly ( with some minor changes ) and it works great, however I still have one question ( maybe a little bit off-topic one ).
When you are generating the histo-pyramid, you are writting/reading to the same texture and so you are using glFinish after each pass. But is this glFinish really necessary? I tried it without it and it worked well with noticable performance gain. Did you encountered any problems when not using glFinish() ? |
|
| Back to top |
|
 |
gziegler .

Joined: 05 May 2004 Posts: 21 Location: Saarbrücken
|
Posted: Fri Dec 08, 2006 9:56 pm Post subject: |
|
|
Ah yes, it is not needed anymore, that's was interesting - maybe the driver has changed - well, I wouldn't bet my horses on it, but I keep it in mind as another optimization hack
In other news: Chris Dyken from Norway has recently completed a marching cube implementation for NV40, partially based on the idea sketch for HistoPyramid-assisted geometry generation on page 22-23 in the VMV slides - let's see if we can beat Geometry Shaders on G80 hardware  _________________ /Gernot |
|
| Back to top |
|
 |
trahern .
Joined: 02 Nov 2005 Posts: 23 Location: Prague, CZ
|
Posted: Sat Dec 09, 2006 9:41 am Post subject: |
|
|
Well I have just encountered some problems with the read/write to the same texture. Its rather strange ( as it is probably expected when performing read/write on the same tex ). It can be summarized in the following steps:
I. Copy some data to the base level of the histo pyramid
- this is done by a simple copy shader and the histo pyramid texture is the output texture.
II. Perform reduction
- sets the histo pyramid texture as the input texture and leave it as the output texture
- ERROR: - at the first step of the reduction, the data on the base level are not updated yet so they are invalid
The above conditions are not complete as the problem occures only at same places ( but each time I run the simulation ). It can be easily solved by unbinding / rebinding the framebuffer with histo-texture before the reduction takes place.. |
|
| Back to top |
|
 |
gziegler .

Joined: 05 May 2004 Posts: 21 Location: Saarbrücken
|
Posted: Thu Dec 14, 2006 1:29 am Post subject: |
|
|
Interesting ... well, this is using some undefined behaviour anyways, so we can be glad it works at all
In other news, again: I have just received permission by Chris Dyken to provide the link to his result videos (and a summarizing presentation), see
http://heim.ifi.uio.no/~erikd/histo/
Here a teaser image from his work:
Marching cubes on an implicit 3D surface for 64x64x64, at 25 fps on a GeForce ... na nah ? ... 6800, of _course_  _________________ /Gernot |
|
| Back to top |
|
 |
KeenanCrane G

Joined: 08 May 2005 Posts: 67 Location: Urbana, IL
|
Posted: Thu Dec 14, 2006 3:42 am Post subject: |
|
|
Neat! This is really clever.
It also seems pretty simple to code -- I may try a comparison with geometry shaders unless you've done that already?
-Keenan |
|
| Back to top |
|
 |
cdyken .
Joined: 14 Dec 2006 Posts: 17 Location: SINTEF ICT, Norway
|
Posted: Thu Dec 14, 2006 8:30 am Post subject: |
|
|
| The G80 don't like my motherboard (CK804-based) under Linux, so I haven't been able to try any GS-stuff myself. So I would be really interested in a comparison to GS! |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|