Tag Archives: radeon

OpenGL Geometry Instancing


This article has been updated with new demos and new GI technique. Read the complete article here: OpenGL Geometry Instancing: GeForce GTX 480 vs Radeon HD 5870.


[French] Voici une petite démo qui utilise les techniques d’instancing (instancing simple, pseudo-instancing et geometry instancing(ou GI)) pour effectuer le rendu d’un anneau composé de 10000 petites sphères.
La démo est livrée en 5 versions:

  • chaque sphère est composée de 1800 triangles (18 millions de triangles pour l’anneau entier)
  • chaque sphère est composée de 800 triangles (8 millions de triangles pour l’anneau entier)
  • chaque sphère est composée de 200 triangles (2 millions de triangles pour l’anneau entier)
  • chaque sphère est composée de 72 triangles (720000 triangles pour l’anneau entier)
  • chaque sphère est composée de 18 triangles (180000 triangles pour l’anneau entier)

J’ai ajouté au dernier moment un extra: une version avec 20000 instances de 5000 triangles chacune soit 100 millions de polygones (fichier Demo_Instancing_100MTriangles_20kInstances.exe).
[/French] [English] This demo uses instancing techniques (simple instancing, pseudo-instancing and geometry instancing(or GI)) to render a ring made of 10,000 small spheres. The demo is delivered in 5 versions:

  • each sphere is made of 1,800 triangles (18 millions triangles for the whole ring)
  • each sphere is made of 800 triangles (8 millions triangles for the whole ring)
  • each sphere is made of 200 triangles (2 millions triangles for the whole ring)
  • each sphere is made of 72 triangles (720,000 triangles for the whole ring)
  • each sphere is made of 18 triangles (180,000 triangles for the whole ring)

I added in the last moment a bonus: a 20,000 instances version, each instance made of 5,000 triangles. We get the monstruous count of 100 millions triangles (file Demo_Instancing_100MTriangles_20kInstances.exe).
[/English]





DOWNLOAD

[French] Il y a plusieurs techniques d’instancing qui sont utilisées et chaque technique est accessible avec une des touches F1 à F6.

  • F1: instancing simple avec camera frustum culling: il y a une seule source de géométrie (un mesh) et elle est rendu pour chaque instance. Le calcul de la matrice de transformation est fait sur le CPU ainsi que le test de clipping avec la camera. Le rendu OpenGL utilise la fonction glDrawElements().
  • F2: instancing simple SANS camera frustum culling: il y a une seule source de géométrie (un mesh) et elle est rendu pour chaque instance. Le calcul de la matrice de transformation est fait sur le CPU mais il n’y a plus de test de clipping avec la camera. Le rendu OpenGL utilise la fonction glDrawElements().
  • F3: pseudo-instancing lent: il y a une seule source de géométrie (un mesh) et elle est rendu pour chaque instance. Le calcul de la matrice de transformation est maintenant effectué sur le GPU. Le passage des paramètres pour chaque instance se fait avec des variables uniformes. Il n’y a pas de test de clipping avec la camera. Le rendu OpenGL utilise la fonction glDrawElements().
  • F4: pseudo-instancing rapide: il y a une seule source de géométrie (un mesh) et elle est rendu pour chaque instance. Le calcul de la matrice de transformation est maintenant effectué sur le GPU. Le passage des paramètres pour chaque instance se fait avec des attributs de vertex persistants (comme les coordonnées de textures ou la couleur). C’est cette technique qui
    a été mise en avant par NVIDIA avec son whitepaper: GLSL Pseudo-Instancing. Il n’y a pas de test de clipping avec la camera. Le rendu OpenGL utilise la fonction glDrawElements().
  • F5: Geometry Instancing: c’est le vrai instancing hardware. Il y a une seule source de géométrie (un mesh) et le rendu se fait par lots (ou batchs) de 400 instances par draw call. Le rendu complet de l’anneau ne nécessite que 25 draw-calls au lieu de 10000. Le calcul de la matrice de transformation est effectué sur le GPU. Le passage des paramètres pour chaque batch se fait avec des tableaux de variables uniformes. Il n’y a pas de test de clipping avec la camera. Le rendu OpenGL utilise la fonction glDrawElementsInstancedEXT(). Actuellement, seules les cartes NVIDIA GeForce 8 (et sup.) supportent cette fonction.
  • F6: Geometry Instancing avec attributs de vertex persistants: c’est le geometry instancing hardware couplé avec le passage des paramètres par les attributs de vertex persistants. Mais le nombre d’attributs de vertex persistants est très limité. Au maximum j’ai reussi à rendre 4 instances par draw-call. Mais étrangement, 2 instances par draw-call donne de meilleurs résultats. Dans ce cas, le rendu complet de l’anneau nécessite que 5000 draw-calls au lieu des 10000. Le calcul de la matrice de transformation est effectué sur le GPU. Il n’y a pas de test de clipping avec la camera. Le rendu OpenGL utilise la fonction glDrawElementsInstancedEXT(). Actuellement, seules les cartes NVIDIA GeForce 8 (et sup.) supportent cette fonction.
[/French] [English] Several instancing techniques are used and you can select them with F1 to F6 keys.

  • F1: simple instancing with camera frustum culling: there is one source for geometry (a mesh) and it’s rendered for each instance. The tranformation matrix calculation is done on the CPU as well as the camera frustum test. OpenGL rendering uses the glDrawElements() function.
  • F2: simple instancing without camera frustum culling: there is one source for geometry (a mesh) and it’s rendered for each instance. The tranformation matrix calculation is done on the CPU but there is no longer camera frustum test. OpenGL rendering uses the glDrawElements() function.
  • F3: slow pseudo-instancing: there is one source for geometry (a mesh) and it’s rendered for each instance. Now the tranformation matrix calculation is done on the GPU and per-instance data are passed via uniform variables. There is no camera frustum test. OpenGL rendering uses the glDrawElements() function.
  • F4: pseudo-instancing: there is one source for geometry (a mesh) and it’s rendered for each instance. The tranformation matrix calculation is done on the GPU and per-instance data are passed via persistent vertex attributes (like texture coordinates or color). This technique has been shown by NVIDIA in the following whitepaper: GLSL Pseudo-Instancing. There is no camera frustum test. OpenGL rendering uses the glDrawElements() function.
  • F5: geometry instancing: it’s the real hardware instancing. There is one source for geometry (a mesh) and rendering is done by batchs of 400 instances per draw-call. The whole rendering of the ring requires 25 draw-calls instead of 10,000. The tranformation matrix calculation is done on the GPU and per-batch data is passed via uniform arrays. There is no camera frustum test. OpenGL rendering uses the glDrawElementsInstancedEXT() function. Currently, only NVIDIA GeForce 8 (and higher) support this function.
  • F6: geometry instancing with persistant vertex attributes: it’s the hardware instancing coupled with the transmission of parameters is done via the persistent vertex attributes. But the number of persistent vertex attributes is very limited. The best I did is to render 4 instances per draw-call. But oddly, I got the best results with 2 instances per draw-call. In that case, the rendering of whole ring requires 5000 draw-calls. The tranformation matrix calculation is done on the GPU and per-batch data is passed via uniform arrays. There is no camera frustum test. OpenGL rendering uses the glDrawElementsInstancedEXT() function. Currently, only NVIDIA GeForce 8 (and higher) support this function.

Ok now, let’s see some results with a NVIDIA GeForce 8800 GTX and an ATI Radeon HD 3870. Both cards have been tested with an AMD 64 3800+.
[/English]

18 millions triangles – 1800 tri/instance

NVIDIA GeForce 8800 GTX – Forceware 169.38 XP32

  • F1: 223MTris/sec – 13FPS
  • F2: 223MTris/sec – 13FPS
  • F3: 223MTris/sec – 13FPS
  • F4: 223MTris/sec – 13FPS
  • F5: 223MTris/sec – 13FPS
  • F6: 171MTris/sec – 10FPS

ATI Radeon HD 3870 – Catalyst 8.2 XP32

  • F1: 429MTris/sec – 25FPS
  • F2: 463MTris/sec – 27FPS
  • F3: 446MTris/sec – 26FPS
  • F4: 274MTris/sec – 16FPS
  • F5: mode not available
  • F6: mode not available

8 millions de triangles – 800 tri/instance

NVIDIA GeForce 8800 GTX – Forceware 169.38 XP32

  • F1: 190MTri/sec – 25FPS
  • F2: 190MTri/sec – 25FPS
  • F3: 205MTri/sec – 27FPS
  • F4: 213MTri/sec – 28FPS
  • F5: 205MTri/sec – 27FPS
  • F6: 152MTri/sec – 20FPS

ATI Radeon HD 3870 – Catalyst 8.2 XP32

  • F1: 251MTris/sec – 33FPS
  • F2: 236MTris/sec – 31FPS
  • F3: 297MTris/sec – 39FPS
  • F4: 251MTris/sec – 33FPS
  • F5: mode not available
  • F6: mode not available

2 millions de triangles – 200 tri/instance

NVIDIA GeForce 8800 GTX – Forceware 169.38 XP32

  • F1: 47MTri/sec – 25FPS
  • F2: 47MTri/sec – 25FPS
  • F3: 57MTri/sec – 30FPS
  • F4: 131MTri/sec – 69FPS
  • F5: 167MTri/sec – 88FPS
  • F6: 148MTri/sec – 78FPS

ATI Radeon HD 3870 – Catalyst 8.2 XP32

  • F1: 47MTris/sec – 25FPS
  • F2: 59MTris/sec – 31FPS
  • F3: 74MTris/sec – 39FPS
  • F4: 112MTris/sec – 59FPS
  • F5: mode not available
  • F6: mode not available

720,000 triangles – 72 tri/instance

NVIDIA GeForce 8800 GTX – Forceware 169.38 XP32

  • F1: 17MTri/sec – 25FPS
  • F2: 17MTri/sec – 25FPS
  • F3: 20MTri/sec – 30FPS
  • F4: 47MTri/sec – 69FPS
  • F5: 60MTri/sec – 88FPS
  • F6: 53MTri/sec – 78FPS

ATI Radeon HD 3870 – Catalyst 8.2 XP32

  • F1: 17MTris/sec – 25FPS
  • F2: 21MTris/sec – 31FPS
  • F3: 26MTris/sec – 39FPS
  • F4: 40MTris/sec – 59FPS
  • F5: mode not available
  • F6: mode not available

180000 triangles – 18 tri/instance

NVIDIA GeForce 8800 GTX – Forceware 169.38 XP32

  • F1: 4MTri/sec – 25FPS
  • F2: 4MTri/sec – 25FPS
  • F3: 5MTri/sec – 30FPS
  • F4: 11MTri/sec – 69FPS
  • F5: 15MTri/sec – 89FPS
  • F6: 13MTri/sec – 79FPS

ATI Radeon HD 3870 – Catalyst 8.2 XP32

  • F1: 4MTris/sec – 25FPS
  • F2: 5MTris/sec – 31FPS
  • F3: 6MTris/sec – 39FPS
  • F4: 10MTris/sec – 59FPS
  • F5: mode not available
  • F6: mode not available
[French] Analyse rapide des resultats:

  • nous comprenons maintenant pourquoi NVIDIA a appellé “Pseudo-Instancing” la technique utilisant les attributs persistants de vertex (key F4). La fonction glDrawElements() d’OpenGL est extremement rapide et optimisée et les attrubuts persistants de vertex nécessitent moins de traitement que les variables uniformes pour être passés au vertex shader. Les deux couplés ensemble donnent ce boost de performance.
  • le bénéfice du vrai hardware geometry instancing est principalement visible losqu’il y a peu de triangles par instance.
  • lorsqu’il y a beacoup de triangles par instance (1800), l’impémentation matérielle de glDrawElements() semble être plus efficace (près de deux fois!) sur le GPU RV670 que sur le G80.

Conclusion

Au vu des résultats, le hardware geometry instancing n’est pas la kill-feature que j’attendais. Je trouve cela très curieux car la différence entre 10000 render-calls avec glDrawElements et 25 render-calls avec glDrawElementsInstancedEXT n’est pas très importante. On dirait que la gestion de l’instancing (variable gl_InstanceID dans le vertex shader) fait perdre beaucoup de temps. Je trouve aussi dommage qu’ATI n’ait pas encore pris le temps d’implémenter le geomtry instancing dans les pilotes Catalyst. Je serais très curieux de tester le GI hardware avec un RV670.
[/French] [English] Quick results analysis:

  • we now understand why NVIDIA has called the technique using persistent vertex attributes “Pseudo-Instancing” (key F4). OpenGL glDrawElements() function is extremly fastand persistent vertex attributes require less overhead than uniforms to be passed to vertex shader. Both coupled together give this performance boost.
  • benefit of real hardware geometry instancing is mostly visible with few triangles per instance.
  • when there are many triangles per instance (1,800), the hardware implementation of glDrawElements() seems to be more efficient (twice!) on RV670 GPU than on G80.

Conclusion

From the results, hardware geometry instancing isn’t the kill-feature I expected. I find that very weird since the différence between 10000 render-calls with glDrawElements and 25 render-calls with glDrawElementsInstancedEXT is not verx important. Seems the instancing management (gl_InstanceID variable in the vertex shader) is a GPU-cycle eater!What a pity ATI hasn’t implemented yet geometry instancing in the Catalyst drivers. I’d be very curious to test hardware GI with a RV670.
[/English]

Les nouveaux Catalyst 7.11 à la sauce “Bug-Inside”

ATI vient de nous livrer les nouveaux Catalyst 7.11 pour nos belles Radeon. Mais on dirait que ça commence à être une habitude chez les petits gars d’ATI de nous pondre des pilotes bogués surtout pour les nouvelles cartes! Souvenez-vous des Catalyst 7.9 qui enfin corrigeaient un gros bug au niveau des shadow-maps et ce bug n’était visible que pour les Radeon 2k. Bien maintenant c’est la même chose avec les Cat7.11: ils sont bogués pour les Radeon 3k au niveau OpenGL: impossible de mettre plus d’une lumière dynamique dans les shaders GLSL! C’est quand même un sacré bug! Bon pour le moment je n’ai testé que sous WinXP donc peut etre que sous Vista c’est mieux.

A part ce bug (il y en a surement d’autres mais j’ai pas fait assez de tests pour le savoir), les Cat7.11 sont les premiers pilotes qui supportent les Radeon HD 3870. Le numéro interne des Cat7.11 est les 8.432.0.0.

Le téléchargement des Cat7.11 se passe ici:

WinXP 32-bit: [DOWNLOAD]
Vista 32-bit: [DOWNLOAD]

Catalyst 7.9 and Radeon 2K Shadow Mapping Bug

I found this bug while I was coding a new small soft shadows demo for GPU Caps Viewer. Soft shadows are built on shadow mapping and my OpenGL shadow mapping code works perfectly on all Geforce 6/7/8 and Radeon 1k but not on Radeon 2K (2400/2600/2900). Why ? Because of the shadow mapping comparison function that had a serious bug! To be short, the comparison function was supposed to return a boolean value (if shadow returns 0, else returns 1) and before Catalyst 7.9, this function returned, for Radeon 2K, the depth buffer value (as if the comparison function was disabled). But this bug is now a memory since Catalyst 7.9 has fixed it.

I guess we can say thanks to Quake Wars, that has been released few days ago and that is an OpenGL game. For this game (that is really nice), ATI has fixed all major OpenGL bugs.

R600 is VTF-capable

“All of the fetch and filtering capabilities are available to each thread type, making the samplers completely agnostic about what’s using them.”

This line from Beyond3D article on R600 means that vertex, geometry and pixel shaders can access to texture samplers. So Vertex Texture Fetching is now available with Radeon 2k series :thumbup: What’s more, the R600 can handle very large texture up to 8192×8192 just like the G80.

Uniform Arrays in GLSL

A new version of the Soft Shadows Benchmark is available but this time using uniform arrays to pass the blurring kernel to the pixel shader. On nVidia boards, there is a little increase of speed (1 or 2 fps). On my X700… black screen… Houston, we’ve got a problem… This is with Catalyst 6.6. Okay I try the very latest Catalyst, the 6.7. Bad idea, it’s worse! Both versions (with and without uniform arrays) do not work anymore with C6.7. Back to C6.6. That really sucks! :thumbdown:

But I’ve just received a feedback telling me that the uniform arrays version works fine on an ATI X1600 Pro with C6.5. :thumbup:

Okay, there is certainly a problem with the X*** series and uniform arrays.

ATI and Depth Map Filtering

I’ve just found in the super paper of ATI, called “ATI OpenGL Programming and Optimization Guide” that all ATI GPUs from the R300 (Radeon 9700) to the latest R580 (Radeon X1900) only support NEAREST (and the mipmap version) filtering for depth map. That explains the previous results. So if you want a nVidia-like depth map filtering, you have to code the filtering yourself in the pixel shader. Okay, this answer suits me!

NPOT Textures

It’s nice to come back to code!

I’m currently working on a new and simple framework for my OpenGL experimentations before implementing the algorithms in the oZone3D Engine . RaptorGL is a little bit too heavy for simple tests so for the moment I drop it. This new framework I called XPGL (eXPerimental Graphics Library), allows me to quickly test the new algos I’m working on. Every time I have to code a little but fully operational 3D demo in c++/opengl, I spend lot of time for a small result. In these moments, I say to myself that Hyperion is a very cool tool.

Okay, let’s see a weird behavour of radeon gpu. At the moment, my graphics controller is a Radeon X700. With the latest catalyst drivers (6.6), this graphics board should be an OpenGL 2.0 compliant CG. A little check to the GL_VERSION tells me the X700 is GL2 compliant. Then the X700 should handle non power of two texture since this feature is part of the OpenGL 2.0 core. But the GL_ARB_texture_non_power_of_two string is not found in the GL_EXTENSIONS. Maybe ATI does not mention the extensions that are part of the core. Anyways, I loaded a 600×445 npot-texture on a mesh plane and the X700 seems to support this texture. But with a ridiculous fps of 1… Software codepath? I think so! So I decided to load the same texture with power of two dims (512×512) and the fps is become decent again. With my gf6600gt (with the forceware 91.31) I never noticed this effect/bug because the GL2-support is better and nVidia gpus correctly support non power of two texture. You can download the demo with the npot and pot texture (the one mapped onto the mesh plane) hereafter and do the test for yourself. Feel free to drop me a feedback if you wish.

NPOT_Demo.zip (659k)

But keep in mind that graphics hardware is optimized for POT textures. Try to use POT textures in order to maximize your chances to see your demo running everywhere.

ATI X1900XTX and VTF

I’ve just received an email from an user saying that he was’nt able to run the demos of the Vertex Displacement Mapping Tutorial on his brand new Radeon X1900XTX. VTF or Vertex Texture Fetching is a cool feature of high-end graphics chipsets and it’s part of Shader Model 3.0. The X1900 series is based on the R500 chipset (R580) that is a SM3.0 complient GPU. But in OpenGL side and especially in GLSL, VTF is not supported. The OpenGL query done with GL_MAX_VERTEX_TEXTURE_IMAGE_UNITS_ARB always returns 0. That means that no texture units are available in the vertex shader.

ATI confirms this fact in one of its whitepapers shipped with the ATI SDK (ATI OpenGL Programming and Optimization Guide.pdf). At the page 11, we can read this: [i]”All ATI graphics HW have a few items that deserve special consideration when using GLSL. The first major item of note is the absence of vertex texture units. This means that vertex texturing is never available, and all shaders attempting to use texture functions in the vertex shader will fail to link.”[/i]. I know, this is a rude reality. The R580 GPU is really powerful and it’s a pity that ATI does not support VTF in his chipsets. I don’t know how the R580 behaves in D3D side but I can suppose the GPU has the same limitations. VTF is currently supported by Geforce chipset from 6600 to 7900. Conclusion: if you wish to play with VTF, use a nVidia board.

Maybe, all these problems will be solved with the SM4.0. I hope! :winkhappy: