I Use This!
Activity Not Available


Analyzed 4 months ago. based on code collected 5 months ago.
Posted 5 days ago
Bloom Filter The bloom filter makes bright objects glow and bleed through other objects positioned in between them and the camera. It is a common post-processing effect used all the time in video games and animated movies. The demo supports a couple ... [More] of configuration options that control the intensity and behavior of the filter, here are some screenshots with different settings: Bloom filter Off Bloom filter On, default settings Bloom filter On, intensity increased I particularly like the glow effect that this brings to the specular reflections on the water surface, although to really appreciate that you need to run the demo and see it in motion. Cascaded Shadow Maps I should really write a post about basic shadow mapping before going into the details of Cascaded Shadow Maps, so for now I’ll just focus on the problem they try to solve. One of the problems with shadow mapping is rendering high resolution shadows, specially for shadows that are rendered close to the camera. Generally, basic shadow mapping provides two ways in which we can improve the resolution of the shadows we render: 1. Increase the resolution of the shadow map textures. This one is obvious but comes at a high performance (and memory) hit. 2. Reduce the distance at which we can render shadows. But this is not ideal of course. One compromise solution is to notice that, as usual with 3D computer graphics, it is far more important to render nearby objects in high quality than distant ones. Cascaded Shadow Maps allow us to use different levels of detail for shadows that are rendered at different distances from the camera. Instead of having a single shadow map for all the shadows, we split the viewing frustum into slices and render shadows in each slice to a different shadow map. There are two immediate benefits of this technique: 1. We have flexibility to define the resolution of the shadow maps for each level of the cascade, allowing us, for example, to increase the resolution of the levels closest to the camera and maybe reduce those that are further away. 2. Each level only records shadows in a slice of the viewing frustum, which increases shadow resolution even if we keep the same texture resolution we used with the original shadow map implementation for each shadow map level. This approach also has some issues: 1. We need to render multiple shadow maps, which can be a serious performance hit depending on the resolutions of the shadow maps involved. This is why we usually lower the resolution of the shadow maps as distance from the camera increases. 2. As we move closer to or further from shadowed objects we can see the changes in shadow quality pop-in. Of course we can control this by avoiding drastic quality changes between consecutive levels in the cascade. Here is an example that illustrates the second issue (in this case I have lowered the resolution of the 2nd and 3d cascade levels to 50% and 25% respectively so the effect was more obvious). The screenshots show the rendering of the shadows at different distances. We can see how the shadows in the close-up shot are very sharp and as the distance increases they become blurrier due to the use of a lower resolution shadow map: CSM level 0 (4096×4096) CSM level 1 (2048×2048) CSM level 2 (1024×1024) The demo supports up to 4 shadow map levels although the default configuration is to use 3. The resolution of each level can be configured separately too, in the default configuration I lowered the shadow resolution of the second and third levels to 75% and 50% respectively. If we configure the demo to run on a single level (with 100% texture resolution), we are back to the original shadow map implementation, so it is easy to experiment with both techniques. I intend to cover the details behind shadow mapping and the implementation of the bloom filter in more detail in a future post, so again, stay tuned for more! [Less]
Posted 6 days ago
The most visible change this week is a fix to Mesa for texture upload performance.  A user had reported that selection rectangles in LXDE's file manager were really slow.  I brought up sysprof, and it showed that we were doing uncached reads from the ... [More] GPU in a situation that should have been entirely write-combined writes.The bug was that when the size of the texture wasn't aligned to utiles, we were loading the previous contents into the temporary buffer before writing the new texture data in and uploading, even if the full texture was being updated.  I've fixed it to check for when the full texture is being uploaded, and not do the initial download.  This bug was getting it on almost any Cairo vector graphics operation with its Xlib backend, so hopefully this helps a lot of people's desktops.I also worked on a cleanup of the simulator mode.  I use the closed source simulator regularly as part of my work -- it's fairly accurate to hardware behavior, and allows you to trace what's happening when things go wrong, all with the driver code running like "normal" for apps on my x86 desktop.However, my simulator support is a little invasive to the driver, replacing vc4 ioctl calls with alternative ioctls to the i915 driver.  I started on a series to make vc4_simulator.c take plain VC4 ioctls and translate them, so that the simulator code is entirely contained.  The last step I think I have is figuring out which level to put the simulator's copying in and out of window system framebuffers at.Last, I got DEQP's GLES2 tests up and running on Raspberry Pi.  These are approximately equivalent to the official conformance tests.  The initial results were quite good -- 18630/19098 (97.5%) passing when run in the piglit framework.  I found a couple of fixes to be made in glClear() support, one of which will affect all gallium drivers.  Of the remainder, there are a few tests that are failing due to test bugs (we expose extensions that allow features the tests don't expect), but most are failing in register allocation.  For the register allocation failures, I see a relatively quick fix that should reduce register pressure in loops. [Less]
Posted 6 days ago
For the last 3 years, I've been working in the OpenStack Telemetry team at eNovance, and then at Red Hat. Our mission is to maintain the OpenStack Telemetry stack, both upstream and downstream (i.e. inside Red Hat products). Besides the technical ... [More] challenges, the organization of the team always have played a major role in our accomplishments. Here, I'd like to share some of my hindsight with you, faithful readers. Meet the team The team I work in changed a bit during those 3 years, but the core components always have been the same: a few software engineers, a QE engineer, a product owner, and an engineering manager. That meant the team size has been always been between 6 and 8 people. I cannot emphasize enough how team size is important. Not having more than 8 persons in a team fits with the two pizzas rule from Jeff Bezzos, which turned out to be key in our team composition. The group dynamic that is applied by teams not bigger than this is excellent. It offers the possibility to know and connect with everyone – each time member has only up to 7 people to talk to on a daily basis, which means only 28 communication axis between people. Having a team of e.g. 16 people means 120 different links in your team. Double your team size, and multiply by 4 your communication overhead. My practice shows that the less communication axis you have in a team, the less overhead your will have and the swifter your team will be. All team members being remote workers, is is even more challenging to build relationship and bound. We had the opportunity to know each other during the OpenStack summit twice a year and doing regular video-conference via Google Hangout or BlueJeans really helped. The atmosphere you set-up in your team will also forge the outcome of your team work. Run your team with trust, peace and humor (remember I'm on the team 🤣) and awesome things will happen. Run your team with fear, pressure, and finger-pointing, nothing good will happen. There's little chance that when a team is built, everyone will be on the same level. We were no exception, we had more and less experienced engineers. But the most experienced engineers took the time needed to invest and mentor the less experienced. That also helped to build trust and communication links between members of the team. And over the long run, everyone is getting more efficient: the less experienced engineers are getting better and the more experienced can delegate a lot of stuff to their fellows. Then they can chill or work on bigger stuff. Win-win. It's actually no more different than that the way you should run an open source team, as I already claimed in a previous article on FOSS projects management. Practicing agility I might be bad at practicing agility: contrary to many people, I don't see agility as a set of processes. I see that as a state of mind, as a team organization based on empowerment. No more, no less. And each time I meet people and explain that our team is "agile", they start shivering, explaining how they hate sprints, daily stand-ups, scrum, and planning poker, that this is all a waste of time and energy. Well, it turns out that you can be agile without all of that. Planning poker In our team, we tried at first to run 2-weeks sprints and used planning poker to schedule our user stories from our product backlog (= todo list). It never worked as expected. First, most people had the feeling to lose their time because they already knew exactly what they were supposed to do. Having any doubt, they would have just gone and talked to the product owner or another fellow engineer. Secondly, some stories were really specialized and only one of the team member was able to understand it in details and evaluate it. So most of the time, a lot of the team members playing planning poker would just vote a random number based on the length of the explanation of the story teller. For example, if an engineer said "I just need to change that flag in the configuration file" then everyone would vote 1. If they started rambling for 5 minutes about "how the configuration option is easy to switch, but that there might be other things to change at the same time, and things to check for impact bigger than expected, and code refactoring to do", then most people would just announce a score of 13 on that story. Just because the guy talked for 3 minutes straight and everything sounded complicated and out of their scope. That meant that the poker score had no meaning to us. We never managed to have a number of points that we knew we could accomplish during a sprint (the team velocity as they call it). The only benefit that we identified from planning poker, in our case, is that it forces people to keep sit down and communicate about a user story. Though, it turned out that making people communicate was not a problem we needed to solve in our team, so we decided to stop doing that. But it can be a pretty good tool to make people talking to each other. Therefore, the 2-weeks sprint never made much sense as we were unable to schedule our work reliably. Furthermore, doing most of our daily job in open source communities, we were unable to schedule anything. When sending patches to an upstream project, you have no clue when they will be reviewed. What you know for sure, is that in order to maximize your code merge throughput with this high latency of code review, you need to parallelize your patch submission a lot. So as soon as you receive some feedback from your reviewers, you need to (almost) drop everything, rework your code and resubmit it. There's no need to explain what this does not absolutely work with a sprint approach. Most of the scrum framework lays on the fact that you own workflow from top to bottom, which is far from being true when working in open source communities. Daily stand-up meetings We used to run a daily stand-up meeting every day, then every other day. Doing that remotely kills the stand-up part, obviously, so there is less guarantee the meeting will be short. Considering all team members are working remotely in different time zones, with some freedom to organize their schedule, it was very difficult to synchronize those meetings. With member spread from the US to Eastern Europe, the meeting was in the middle of the afternoon for me. I found it frustrating to have to stop my activities in the middle of every afternoon to chat with my team. We all know the cost of context switching to us, humans. So we drifted from our 10 minutes daily meeting to a one-hour weekly meeting with the whole team. It's way easier to synchronize for a large chunk of time once a week and to have this high-throughput communication channel. Our (own) agile framework Drifting from the original scrum implementation, we ended up running our own agility framework. It turned out to have similarity with kanban – you don't always have to invent new things! Our main support is a Trello board that we share with the whole team. It consists of different columns, where we put card representing small user stories or simple to-do items. Each column is the state of the current card, and we move them left to right: Ideas: where we put things we'd like to do or dig into, but there's no urgency. It might lead to new, smaller ideas, in the "To Do" column. To Do: where we put real things we need to do. We might run a grooming session with our product manager if we need help prioritizing things, but it's usually not necessary. Epic: here we create a few bigger cards that regroup several To Do items. We don't move them around, we just archive them when they are fully implemented. There are only 5-6 big cards here at max, which are the long term goals we work on. Doing: where we move cards from To Do when we start doing them. At this stage, we also add people working on the task to the card, so we see a little face of people involved. Under review: 90% of our job being done upstream, we usually move cards done and waiting for feedback from the community to this column. When the patches are approved and the card is complete, we move the card to Done. If a patch needs further improvement, we move back the card to Doing and work on it, and then move it back to Under review when resubmitted. On hold / blocked: some of the tasks we work on might be blocked by external factors. We move cards there to keep track of them. Done during week #XX: we create a new list every Monday to stack our done cards by week. This is just easier to display and it allows us to see the cards that we complete each week. We archive lists older than a month, from time to time. It gives a great visual feedback and what has been accomplished and merged every week. We started to automate some of our Trello workflow in a tool called Trelloha. For example, it allows us to track upstream patches sent through Gerrit or GitHub and tick the checkbox items in any card when those are merged. We actually don't put many efforts on our Trello board. It's just a slightly organized chaos, as are upstream projects. We use it as a lightweight system for taking notes, organizing our thought and letting know what we're doing and why we're doing it. That's where Trello is wonderful because using it has a very low friction: creating, updating and moving card is a one click operation. One bias of most engineers is to overthink and over-engineer their workflow, trying to rationalize it. Most of the time, they end up automating which means building processes and bureaucracy. It just slows things down and builds frustration upon everyone. Just embrace chaos and spend time on what matters. Most of the things we do are linked to external Launchpad bugs, Gerrit reviews or GitHub issues. That means the cards in Trello carry very little information, as everything happens outside, in the wild Internet of open source communities. This is very important as we need to avoid any kind of retention of knowledge and information from contributors outside the company. This also makes sure that our internal way of running does not leak outside and (badly) influence outside communities. Retrospectives We also run a retrospective every 2 weeks, which might be the only thing we kept from the scrum practice. It's actually a good opportunity for us to share our feelings, concerns or jokes. We used to do it using the six thinking hats method, but it slowly faded away. In the end, we now use a different Trello board with those columns: Good 😄 Hopes and Wishes 🎁 Puzzles and Challenges 🌊 To improve 😡 Action Items 🤘 All teammates fill the board with the card they want, and everyone is free to add themselves to any card. We then run through each card and let people who added their name to it talk about it. The column "Action Items" is usually filled as we speak and discover we should do things. We can then move cards created there to our regular board, in the To Do column. Central communication Sure, people have different roles in a team, but we dislike bottleneck and single point of failure. Therefore, we are using an internal mailing list where we ask people to send their request and messages to. If people send things related to our team job to one of us personally, we just forward or Cc the list when replying so everyone is aware of what one might be talking about with people external to the team. This is very important, as it emphasizes that no team member should be considered special. Nobody owns more information and knowledge than others, and anybody can jump into a conversation if it has valuable knowledge to share. The same applies for our internal IRC channel. We also make sure that we discuss only company-specific things on this list or on our internal IRC channel. Everything that can be public and is related to upstream is discussed on external communication medium (IRC, upstream mailing list, etc). This is very important to make sure that we are not blocking anybody outside the Red Hat to join us and contribute to the projects or ideas we work on. We also want to make sure that people working in our company are no more special than other contributors. Improvement We're pretty happy with our set-up right now, and the team runs pretty smoothly since a few months. We're still trying to improve, and having a general sense of trust among team members make sure we can openly speak about whatever problem we might have. Feel free to share your feedback and own experience of running your own teams in the comment section. [Less]
Posted 11 days ago
Lots of individuals and companies have made substantial contributions to Apitrace. But maintenance has always rested with me, the original author. I would prefer have shared that responsibility with a wider team, but things haven't turned out that ... [More] way. For several reasons I suppose: There are many people that care about one section of the functionality (one API, one OS), but few care about all of them. For all existing and potential contributors including me Apitrace is merely a means to an end (test/debug graphics drivers or applications), it's not an end on itself. The other stuff always gets the top priority. There are more polished tools for newer generation APIs like Vulkan, Metal, Direct3D 12. These newer APIs are much leaner than legacy APIs, which eliminates a lot of design constraints. And some of these tools have large teams behind them. Last but not least, I failed to nurture such a community. I always kept close control, partly to avoid things to become a hodgepodge, partly from fear of breakage, but can't shake the feeling that if I had been more relaxed things might of turned out differently. Apitrace has always been something I worked on the spare time, or whenever I had an itch to scratch. That is still true, with the exception that after having a kid I have scarcely any free time left. Furthermore the future is not bright: I believe Apitrace will have a long life in graphics driver test automation, and perhaps whenever somebody needs to debug an old OpenGL application, but I doubt it will flourish beyond that. And this facts weighs in whenever I need to decide whether to spend some time on Apitrace vs everything else. The end result is that I haven't been a responsive maintainer for some time (long time merging patches, providing feedback, resolving issue, etc), and I'm afraid that will continue for the foreseeable future. I don't feel any obligation to do more (after all, the license does say the software is provided as is), but I do want to set right expectations to avoid frustrating users/contributors who might otherwise timely feedback, hence this post. [Less]
Posted 12 days ago
In this post I’ll discuss how I setup and render terrain mesh in the OpenGL terrain rendering demo. Most of the relevant code for this is in the ter-terrain.cpp file. Setting up a grid of vertices Unless you know how to use a 3D modeling program ... [More] properly, a reasonable way to create a decent mesh for a terrain consists in using a grid of vertices and elevate them according to a height map image. In order to create the grid we only need to decide how many rows and columns we want. This, in the end, determines the number of polygons and the resolution of the terrain. We need to map these vertices to world coordinates too. We do that by defining a tile size, which is the distance between consecutive vertices in world units. Larger tile sizes increase the size of the terrain but lower the resolution by creating larger polygons. The image below shows an 8×6 grid that defines 35 tiles. Each tile is rendered using 2 triangles: 8×6 terrain grid The next step is to elevate these vertices so we don’t end up with a boring flat surface. We do this by sampling the height map image for each vertex in the grid. A height map is a gray scale image where the values of the pixels represent altitudes at different positions. The closer the color is to white, the more elevated it is. Heightmap image Adding more vertices to the grid increases the number of sampling points from the height map and reduces the sampling distances, leading to a smoother and more precise representation of the height map in the resulting terrain. Sampling the heightmap to compute vertex heights Of course, we still need to map the height map samples (gray scale colors) to altitudes in world units. In the demo I do this by normalizing the color values to [-1,+1] and then applying a scale factor to compute the altitude values in world space. By playing with the scaling factor we can make our terrain look more or less abrupt. Altitude scale=6.0 Altitude scale=12.0 For reference, the height map sampling is implemented in ter_terrain_set_heights_from_texture(). Creating the mesh At this point we know the full position (x, y, z) in world coordinates of all the vertices in our grid. The next step is to build the actual triangle mesh that we will use to render the terrain and the normal vectors for each triangle. This process is described below and is implemented in the ter_terrain_build_mesh() function. Computing normals In order to get nice lighting on our terrain we need to compute normals for each vertex in the mesh. A simple way to achieve this is would be to compute the normal for each face (triangle) and use that normal for each vertex in the triangle. This works, but it has 3 problems: 1. Every vertex of each triangle has the same exact normal, which leads to a rather flat result. 2. Adjacent triangles with different orientations showcase abrupt changes in the normal value, leading to significantly different lighting across the surfaces that highlight the individual triangles in the mesh. 3. Because each vertex in the mesh can have a different normal value for each triangle it participates in, we need to replicate the vertices when we render, which is not optimal. Alternatively, we can compute the normals for each vertex considering the heights of its neighboring vertices. This solves all the problems mentioned above and leads to much better results thanks to the interpolation of the normal vectors across the triangles, which leads to smooth lighting reflection transitions: Flat normals Smooth normals The implementation for this is in the function calculate_normal(), which takes the column and row indices of the vertex in the grid and calculates the Y coordinate by sampling the heights of the 4 nearby vertices in the grid. Preparing the draw call Now that we know the positions of all the vertices and their normal vectors we have all the information that we need to render the terrain. We still have to decide how exactly we want to render all the polygons. The simplest way to render the terrain using a single draw call is to setup a vertex buffer with data for each triangle in the mesh (including position and normal information) and use GL_TRIANGLES for the primitive of the draw call. This, however, is not the best option from the point of view of performance. Because the terrain will typically contain a large number of vertices and most of them participate in multiple triangles, we end up uploading a large amount of vertex data to the GPU and processing a lot of vertices in the draw call. The result is large memory requirements and suboptimal performance. For reference, the terrain I used in the demo from my original post used a 251×251 grid. This grid represents 250×250 tiles, each one rendered as two triangles (6 vertices/tile), so we end up with 250x250x6=375,000 vertices. For each of these vertices we need to upload 24 bytes of vertex data with the position and normal, so we end up with a GPU buffer that is almost 9MB large. One obvious way to reduce this is to render the terrain using triangle strips. The problem with this is that in theory, we can’t render the terrain with just one strip, we would need one strip (and so, one draw call) per tile column or one strip per tile row. Fortunately, we can use degenerate triangles to link separate strips for each column into a single draw call. With this we trim down the number of vertices to 126,000 and the size of the buffer to a bit below 3 MB. This alone produced a 15%-20% performance increase in the demo. We can do better though. A lot of the vertices in the terrain mesh participate in various triangles across the large triangle strip in the draw call, so we can reduce memory requirements by using an index buffer to render the strip. If we do this, we trim things down to 63,000 vertices and ~1.5MB. This added another 4%-5% performance bonus over the original implementation. Clipping So far we have been rendering the full mesh of the terrain in each frame and we do this by uploading the vertex data to the GPU just once (for example in the first frame). However, depending on where the camera is located and where it is looking at, just a fraction of the terrain may be visible. Although the GPU will discard all the geometry and fragments that fall outside the viewport, it still has to process each vertex in the vertex shader stage before it can clip non-visible triangles. Because the number of triangles in the terrain is large, this is suboptimal and to address this we want to do CPU-side clipping before we render. Doing CPU-side clippig comes with some additional complexities though: it requires that we compute the visible region of the terrain and upload new vertex data to the GPU in each frame preventin GPU stalls. In the demo, we implement the clipping by computing a quad sub-region of the terrain that includes the visible area that we need to render. Once we know the sub-region that we want to render, we compute the new indices of the vertices that participate in the region so we can render it using a single triangle strip. Finally, we upload the new index data to the index buffer for use in the follow-up draw call. Avoiding GPU stalls Although all the above is correct, it actually leads, as described, to much worse performance in general. The reason for this is that our uploads of vertex data in each frame lead to frequent GPU stalls. This happens in two scenarios: 1. In the same frame, because we need to upload different vertex data for the rendering of the terrain for the shadow map and the scene (the shadow map renders the terrain from the point of view of the light, so the visible region of the terrain is different). This creates stalls because the rendering of the terrain for the shadow map might not have completed before we attempt to upload new data to the index buffer in order to render the terrain for the scene. 2. Between different frames. Because the GPU might not be completely done rendering the previous frame (and thus, stills needs the index buffer data available) before we start preparing the next frame and attempt to upload new terrain index data for it. In the case of the Intel Mesa driver, these GPU stalls can be easily identified by using the environment variable INTEL_DEBUG=perf. When using this, the driver will detect these situations and produce warnings informing about the stalls, the buffers affected and the regions of the buffers that generate the stall, such as: Stalling on glBufferSubData(0, 503992) (492kb) to a busy (0-1007984) buffer object. Use glMapBufferRange() to avoid this. The solution to this problem that I implemented (other than trying to put as much work as possible between read/write accesses to the index buffer) comes in two forms: 1. Circular buffers In this case, we allocate a larger buffer than we need so that each subsequent upload of new index data happens in a separate sub-region of the allocated buffer. I set up the demo so that each circular buffer is large enough to hold the index data required for all updates of the index buffer happening in each frame (the shadow map and the scene). 2. Multi-buffering We allocate more than one circular buffer. When we don’t have enough free space at the end of the current buffer to upload the new index buffer data, we upload it to a different circular buffer instead. When we run out of buffers we circle back to the first one (which at this point will hopefully be free to be re-used again). So why not just use a single, very large circular buffer? Mostly because there are limits to the size of the buffers that the GPU may be able to handle correctly (or efficiently). Also, why not having many smaller independent buffers instead of circular buffers? That would work just fine, but using fewer, larger buffers reduces the number of objects we need to bind/unbind and is better to prevent memory fragmentation, so that’s a plus. Final touches We are almost done, at this point we only need to add a texture to the terrain surface, add some slight fog effect for distant pixels to create a more realistic look, add a skybox (it is important to choose the color of the fog so it matches the color of the sky!) and tweak the lighting parameters to get a nice result: Final rendering I hope to cover some of these aspects in future posts, so stay tuned for more! [Less]
Posted 13 days ago
Last week I started work on making hello_fft work with the vc4 driver loaded.hello_fft is a demo program for doing FFTs using the QPUs (the shader core in vc4).  Instead of drawing primitives into a framebuffer like a GL pipeline does, though, it ... [More] uses the User QPU pipeline, which just hands a uniform stream to a QPU shader instance.Originally I didn't build a QPU user shader ioctl for vc4 because there's no good way to use this pipeline of the VC4.  The hardware isn't quite capable enough to be exposed as OpenCL or GL compute shaders (some research had been done into this, and the designers' conclusion was that the memory acccess support wasn't quite good enough to be useful).  That leaves you with writing VC4 shaders in raw assembly, which I don't recommend.The other problem for vc4 exposing user shaders is that, since the GPU sits directly on main memory with no MMU in between, GPU shader programs could access anywhere in system memory.  For 3D, there's no need (in GL 2.1) to do general write back to system memory, so the kernel reads your shader code and rejects shaders if they try to do it.  It also makes sure that you bounds-check all reads through the uniforms or texture sampler.  For QPU user shaders, though, the expected mode of using them is to have the VPM DMA units do general loads and stores, so we'd need new validation support.If it was just this one demo, we might be willing to lose support for it in the transition to the open driver.  However, some modes of accelerated video decode also use QPU shaders at some stage, so we have to have some sort of solution.My plan is basically to not do validation and have root-only execution of QPU shaders for now.  The firmware communication driver captures hello_fft's request to have the firmware pass QPU shaders to the V3D hardware, and it redirects it into VC4.  VC4 then maintains the little queue of requests coming in, powers up the hardware, feeds them in, and collects the interrupts from the QPU shaders for when they're done (each job request is required to "mov interrupt, 1" at the end of its last shader to be run).Dom is now off experimenting in what it will take to redirect video decode's QPU usage onto this code.The other little project last week was fixing Processing's performance on vc4.  It's one of the few apps that is ported to the closed driver, so it's a likely thing to be compared on.Unfortunately, Processing was clearing its framebuffers really inefficiently.  For non-tiled hardware it was mostly OK, with just a double-clear of the depth buffer, but for vc4 its repeated glClear()s (rather than a single glClear with all the buffers to be cleared set) were triggering flushes and reloads of the scene, and in one case a render of a full screen quad (and a full screen load before doing so) rather than fast clearing.The solution was to improve the tracking of what buffers are cleared and whether any primitives have been drawn yet, so that I can coalesce sets of repeated or partial clears together while only updating the colors to be cleared.  There's always a tension in this sort of optimization: Should the GL driver be clever to work around apps behaving badly, or should it push that off to the app developer (more like Vulkan does).  In this case, tiled renderer behavior is sufficiently different from non-tiled renderers, and enough apps will hit this path, that it's well worth it.  For Processing, I saw an improvement of about 10% on the demo I was looking at.Sadly, Processing won't be a good comparison for open vs closed drivers even after these fixes.  For the closed driver Processing uses EGL, which says that depth buffers become undefined on eglSwapBuffers. In its default mode, though, processing uses GLX, which says that the depth buffer is retained on glXSwapBuffers().  To avoid this extra overhead in its GLX mode, you'd need to use glDiscardFramebufferEXT() right before the swap.  Unfortunately, this isn't connected in gallium yet but shouldn't be hard to do once we have an app that we can test. [Less]
Posted 18 days ago
I procristanated rather badly on this one, so instead of the previous kernel release happening the v4.8 release is already out of the door. Read on for my slightly more terse catch-up report. Since I’m this late I figured instead of the usual ... [More] comprehensive list I’ll do something new and just list some of the work that landed in 4.8, but with a bit more focus on the impact and why things have been done. Midlayers, Be Gone! The first thing I want to highlight is the driver de-midlayering. In the linux kernel community the mid-layer mistake or helper library design pattern, see the linked article from LWN is a set of rules to design subsystems and common support code for drivers. The underlying rule is that the driver itself must be in control of everything, like allocating memory, handling all requests. Common code is only shared in helper library functions, which the driver can call if they are suitable. The reason for that is that there is always some hardware which needs special treatment, and when you have a special case and there’s a midlayer, it will get in the way. Due to the shared history with BSD kernels DRM originally had a full-blown midlayer, but over time this has been fixed. For example kernel modesetting was designed from the start with the helper library pattern. The last hold is the device structure itself, and for the Intel driver this is now fixed. This has two main benefits: First we can get rid of a lot of pointer dereferencing in the compiled binaries. With the midlayer DRM allocated a struct drm_device, and the Intel driver allocated it’s own, separate structure. Both are connected with pointers, and every time control transferred from driver private functions to shared code those pointers had to be walked. With a helper approach the driver allocates the DRM device structure embedded into it’s own device strucure. That way the pointer derefencing just becomes a fixed adjustement offset of the original pointer. And fixed offsets can be baked into each access of individual member fields for free, resulting in a big reduction of compiled code. The other benefit is that the Intel driver is now in full control of the driver load and unload sequence. The DRM midlayer functions for loading had a few driver callbacks, but for historical reasons at the wrong spots. And fixing that is impossible without rewriting the load code for all the drivers. Without the midlayer we can have as many steps in the load sequence as we want, and where we want it. The most important fix here is that the driver will now be initialiazed completely before any part of it is registered and visible to userspace (through the /dev node, sysfs or anywhere else). Thundering Herds GPUs process rendering asynchronously, and sometimes the CPU needs to wait for them. For this purpose there’s a wait queue in the driver. Userspace processes block on that until the interrupt handler wakes them up. The trouble now is that thus far there was just one wait queue per engine, which means every time the GPU completed something all waiters had to be woken up. Then they checked whether the work they needed to wait for completed, and if not, again block on the wait queue until the next batch job completed. That’s all rather inefficient. On top there’s only one per-engine knob to enable interrupts. Which means even if there was only one waiting process, it was woken for every completed job. And GPUs can have a lot of jobs in-flight. In summary, waiting for the GPU worked more like a frantic herd trampling all over things instead of something orderly. To fix this the request and completion tracking was entirely revamped, to make sure that the driver has a much better understanding of what’s going on. On top there’s now also an efficient search structure of all current waiting processes. With that the interrupt handler can quickly check whether the just completed GPU job is of interest, and if so, which exact process should be woken up. But this wasn’t just done to make the driver more efficient. Better tracking of pending and completed GPU requests is an important fundation to implement proper GPU scheduling on top of. And it’s also needed to interface the completion tracking with other drivers, to finally fixing tearing for multi-GPU machines. Having a thundering herd in your own backyard is unsightly, letting it loose on your neighbours is downright bad! A lot of this follow-up work already landed for the 4.9 kernel, hence I will talk more about this in a future installement of this seris. [Less]
Posted 19 days ago
No more “which is now the index of this modem…?” DBus object path and index When modems are detected by ModemManager and exposed in DBus, they are assigned an unique DBus object path, with a common prefix and a unique index number, e.g.: ... [More] /org/freedesktop/ModemManager1/Modem/0 This path is the one used by the mmcli command line tool to operate on a modem, so users can identify the device by the full path or just by the index, e.g. this two calls are totally equivalent: $ mmcli -m /org/freedesktop/ModemManager1/Modem/0 $ mmcli -m 0 This logic looks good, except for the fact that there isn’t a fixed DBus object path for each modem detected: i.e. the index given to a device is the next one available, and if the device is power cycled or unplugged and replugged, a different index will be given to it. EquipmentIdentifier Systems like NetworkManager handle this index change gracefully, just by assuming that the exposed device isn’t the same one as the one exposed earlier with a different index. If settings need to be applied to a specific device, they will be stored associated with the EquipmentIdentifier property of the modem, which is the same across reboots (i.e. the IMEI for GSM/UMTS/LTE devices). User-provided names The 1.8 stable release of ModemManager will come with support for user-provided names assigned to devices. A use case of this new feature is for example those custom systems where the user would like to assign a name to a device based on the USB port in which it is connected (e.g. assuming the USB hardware layout doesn’t change across reboots). The user can specify the names (UID, unique IDs) just by tagging in udev the physical device that owns all ports of a modem with the new ID_MM_PHYSDEV_UID property. This tags need to be applied before the ID_MM_CANDIDATE properties, and therefore the rules file should be named before the 80-mm-candidate.rules one, for example like this: $ cat /lib/udev/rules.d/78-mm-naming.rules ACTION!="add|change|move", GOTO="mm_naming_rules_end" DEVPATH=="/devices/pci0000:00/0000:00:1d.0/usb4/4-1/4-1.5/4-1.5.5",ENV{ID_MM_PHYSDEV_UID}="USB1" DEVPATH=="/devices/pci0000:00/0000:00:1d.0/usb4/4-1/4-1.5/4-1.5.2",ENV{ID_MM_PHYSDEV_UID}="USB2" DEVPATH=="/devices/pci0000:00/0000:00:1d.0/usb4/4-1/4-1.5/4-1.5.3",ENV{ID_MM_PHYSDEV_UID}="USB3" DEVPATH=="/devices/pci0000:00/0000:00:1d.0/usb4/4-1/4-1.5/4-1.5.4",ENV{ID_MM_PHYSDEV_UID}="USB4" LABEL="mm_naming_rules_end" The value of the new ID_MM_PHYSDEV_UID property will be used in the Device property exposed in the DBus object, and can also be used directly in mmcli calls instead of the path or index, e.g.: $ mmcli -m USB4 ... ------------------------- System | device: 'USB4' | drivers: 'qmi_wwan, qcserial' | plugin: 'Sierra' | primary port: 'cdc-wdm2' ... Given that the same property value will always be set for the modem in a specific device path, this user provided names may unequivocally identify a specific modem even when the device is power-cycled, unplugged and replugged or even the whole system rebooted. Binding the property to the device path is just an example of what could be done. There is no restriction on what the logic is to apply the ID_MM_PHYSDEV_UID property, so users may also choose other different approaches. This support is already in ModemManager git master, and as already said, will be included in the stable 1.8 release, whenever that is. TL;DR? ModemManager now supports assigning unique names to devices that stay even across full system reboots.Filed under: Development, FreeDesktop Planet, GNOME Planet, Planets Tagged: gnome, gnu/linux, ModemManager, udev [Less]
Posted 20 days ago
I have been quite busy with various things in the last few weeks, but I have finally found some time to clean up and upload the code of the OpenGL terrain render demo to Github. Since this was intended as a programming exercise I have not tried to be ... [More] very elegant or correct during the implementation, so expect things like error handling to be a bit rough around the edges, but otherwise I think the code should be easy enough to follow. Notice that I have only tested this on Intel GPUs. I know it works on NVIDIA too (thanks to Samuel and Chema for testing this) but there are a couple of rendering artifacts there, specifically at the edges of the skybox and some “pillars” showing up in the distance some times, probably because I am rendering one to many “rows” of the terrain and I end up rendering garbage. I may fix these some day. The code I uploaded to the repository includes a few new features too: Model variants, which are basically color variations of the same model A couple of additional models (a new tree and plant) and a different rock type Collision detection, which makes navigating the terrain more pleasent Here is a new screenshot: In future posts I will talk a bit about some of the implementation details, so having the source code around will be useful. Enjoy! [Less]
Posted 20 days ago
systemd.conf 2016 is Over Now! A few days ago systemd.conf 2016 ended, our second conference of this kind. I personally enjoyed this conference a lot: the talks, the atmosphere, the audience, the organization, the location, they all were excellent! ... [More] I'd like to take the opportunity to thanks everybody involved. In particular I'd like to thank Chris, Daniel, Sandra and Henrike for organizing the conference, your work was stellar! I'd also like to thank our sponsors, without which the conference couldn't take place like this, of course. In particular I'd like to thank our gold sponsor, Red Hat, our organizing sponsor Kinvolk, as well as our silver sponsors CoreOS and Facebook. I'd also like to thank our bronze sponsors Collabora, OpenSUSE, Pantheon, Pengutronix, our supporting sponsor Codethink and last but not least our media sponsor Linux Magazin. Thank you all! I'd also like to thank the Video Operation Center ("VOC") for their amazing work on live-streaming the conference and making all talks available on YouTube. It's amazing how efficient the VOC is, it's simply stunning! Thank you guys! In case you missed this year's iteration of the conference, please have a look at our YouTube Channel. You'll find all of this year's talks there, as well the ones from last year. (For example, my welcome talk is available here). Enjoy! We hope to see you again next year, for systemd.conf 2017 in Berlin! [Less]