@pavel
I cannot program GPUs and do not desire a mythological protagonistic role :D Take my boost tho
@pavel No, it's
RGRGRG
GBGBGB
You lose meaningful data if you ignore half of green pixels.
I see no reason why it couldn't be done. Just take care not to introduce needless copies in your processing path. dmabufs are your friends.
@pavel Since I assume you're going to want to pass the rendered image into some kind of video encoder, you may want to make sure that you match stride and alignment requirements with your target buffer so etnaviv will be able to perform linear rendering rather than de-tile it afterwards (though IIRC it's currently gated behind ETNA_MESA_DEBUG).
@dos @pavel
adding to that, what data type is the image data (float, int, ???) and what data type is expected to come out?
instead of trying to outsource to the GPU, have you considered SIMD? (I assume librem5 and pinephone support NEON)
if the GPU is better suited, another question is whether there's support for compute shaders on the respective GPUs (what is the supported OpenGL version, assuming there is no Vulkan support on these devices)
@tizilogic @pavel It's either 8-bit int, or 10-bit int stored as 16-bit.
GC7000L supports compute shaders, but etnaviv isn't there yet.
Naive debayering is easy, but for good picture quality you need much more than that.
@pavel do you have a single frame of raw pixel data? What is the target API (OpenGL, -ES, Vulkan)?
@pavel It would be great to have some actual frame data from the camera sensor, or some test data, that I can load into a texture and write a shader to do the conversion. With OpenGL-ES (which is what you have) the trick is to load the pixels into a RG texture that is twice as wide and half as high as the original frame, so that "upstairs"/"downstairs" neighbor pixels in consecutive row are of the same primitive color; this avoids issues with arithmetic and texel addressing precision.
This is a OpenGL ES 2.0 solution.
https://github.com/rasmus25/debayer-rpi
There's also support for a software isp in libcamera. I think I've seen some mentions of GPU backed debayering too.
@pavel I'm confused. V4L lets you stream to a CMA dmabuf which should be importable as GL_TEXTURE_EXTERNAL_OES, right? Or am I missing something?
@pavel On 9f076a5, I'm getting 88MB/s with one green channel, 82MB/s with two and 105MB/s with nothing but static gl_FragColor. The three copies it does could be eliminated and I believe texelFetch could make it slightly faster on the GPU side too.
@pavel Megapixels is not an example of how to do things in the most performant way :) OpenGL operates in a VRAM-centric model, it's very copy-heavy. We don't need to copy things around, as our GPUs operate on the exact same memory CPUs do.
See GL_OES_EGL_image_external and https://docs.kernel.org/userspace-api/media/v4l/dmabuf.html
@pavel After eliminating glReadPixels and having the output buffer mmaped instead: "18.9 MB in 0.08s = 244.4 MB/s"
After putting glTexImage2D out of the loop to emulate zero-copy import from V4L as well:
"18.9 MB in 0.05s = 400.1 MB/s"
@pavel Not only you had copies in- and out- of GLES context there, but these copies were sequential - and your benchmark waited until things were copied before proceeding with the next frame, so it was pretty much useless in assessing GPU performance. In practice, GStreamer can happily encode the previous frame while the GPU is busy with the current one, all while CSI controller is already receiving the next one.
@pavel Also, it gets faster when you increase the buffer size, because rendering is so fast you're mostly measuring API overhead 😁
With full 13MP frames: 315.1 MB in 0.62s = 511.3 MB/s
@pavel are you limited to OpenGL ES 2.0 or can you use a more modern version? ES-2.0 is very bare bones in its image format and shader capabilities and efficiently converting 10 bpp will be a PITA, due to lack of texelFetch function.
Anyway, spent the day finding a nice polynomial to linearize the sensor values (LUTs should be avoided if possible, memory access has latency and costs energy, if you can calculate in a few instr. prefer that).
@pavel @datenwolf Current Mesa can do bunch of GLES3 stuff already, including texelFetch, once you force it with MESA_GLES_VERSION_OVERRIDE.
I still have some issues with the linearization LUT. But if you want to get the basic gist of how I approach the whole de-Bayering, here's the code.
@pavel I left the memcpy line commented out for a reason - with it uncommented, the result is exactly the same as with glReadPixels (which is effectively a memcpy on steroids). The point is to pass that buffer to the encoder directly, so it can read the data straight from the output buffer without waiting for memcpy to conclude.
I've also verified that the approach is sound by having the shader output different values each frame and accessing it via hexdump_pixels inside the loop. Still fast ;)
@pavel That said, rendering to a linear buffer can be slower, that's expected. The question is whether gains from passing buffers around for free are higher, which for an actual "record video from a camera" use case will almost certainly be true (and which has very different performance characteristics from reading images from files - you can't directly attach a file as a texture).
@pavel > I can't easily connect gstreamer to that
Why not? I quickly hacked up passing dma-bufs to GStreamer and even though I'm glFinishing and busy-waiting on a frame to get encoded sequentially it still manages to encode a 526x390 h264 stream in real time on L5.
@pavel Plugged it into V4L2 - with a caveat that for now I fed the GPU full-res 13MP frames to meet stride alignment requirement (the shader output is still 526x390). It says it does 240 frames in 10.55s. I wonder if it's really slightly too slow, or just bad timing from our camera stack :)
@pavel Seems it's the latter, as the result's exactly the same with 1052x780 camera frames and 263x195 video 😁
@pavel https://paste.debian.net/1384224/
It's ugly, hardcodes everything, lies on frame timing, occasionally segfaults. Most of it is copied straight from LLM, I just massaged the pieces to work together. Not the kind of code I'd like to sign off on :) But it's a working example, so have fun with it.
@pavel The first thing to do to improve it (after cleaning it up) would be to actually make use of the buffer pool. Dequeue the buffer, attach it as a texture, kick off rendering, get a fence and pass it with the output buffer to GStreamer without waiting on rendering to finish, then queue it back asynchronously once rendering is done. This should allow for much more complex shaders than this sequential code does.
@pavel BTW. The fact that I could stream full-res frames and bin them down in the shader at real time is an interesting news, as this may open up possibility to use phase detection autofocus.
@pavel Good question. Not sure what license would be appropriate to put on something that's mostly an output of a model trained on code on all sorts of licenses anyway...
But given that it's just a bit of glue code between three APIs put together as an example, consider it to be under MIT-0 😜
@pavel (the parts that I added at least, there are parts of your code in there still)
@pavel There's plenty of low-hanging fruits in there. Higher frame rates and 10-bit output are also likely some debugging session or two away 😜
@pavel LDLIBS = -lEGL -lGLESv2 -lm -ldrm -I/usr/include/libdrm -lgbm -lgstvideo-1.0 -lgstapp-1.0 -lgstallocators-1.0 -lgstreamer-1.0 -lgobject-2.0 -lglib-2.0 -I/usr/include/gstreamer-1.0 -I/usr/include/glib-2.0 -I/usr/lib/aarch64-linux-gnu/glib-2.0/include
@pavel There's a question whether it will be worth elevated power consumption though. I've also stumbled upon csi erroring out with "Rx fifo overflow" requiring a reboot to recover that I haven't seen at lower resolutions, but haven't looked closer.
@pavel Toggling the killswitch makes it appear though.
IIRC PDAF was also usable at half-res.
RAW10 is just a matter of setting up clocks for higher bandwidth and more lanes. Switching data format is then just a single register away.
@pavel When I lie to GStreamer and tell it that its input is in YUY2, it gets faster - perhaps even fast enough to encode at 1052x780. That's another opportunity for improvement.
(and there's nothing magic about fences, it's just a simple synchronization primitive 😛)
@pavel Pretty sure it will just work fine once it's rewritten cleanly and does such arcane magic as releasing the buffers at the right time etc. :)
@pavel Yes, of course.
BTW. Turns out that streaming to YouTube instead of a local file is just a matter of using rtmpsink instead of filesink 😁
@pavel I'm playing with GStreamer now (which is new for me) and it seems like most of this code could be replaced with GStreamer elements, and the rest should neatly plug in as custom elements 😂
@pavel just FYI I wrote it for testing on desktop. It will require a few adjustments for mobile regarding setup of context and framebuffer.
Regarding the other code: Addressing specific pixels just with normalized coordinate texture functions of GLSL-ES-1 is... hard. OpenGL(ES) doesn't put texel centers but outer edges on coordinate values 0 and 1. So you'll have to do some fencepost problem and determine the fractional numbers that hit the texel centers for a given texture size.
@pavel problem is, that the precision used by mobile in doing the addressing calculations from normalized coordinate often isn't sufficient for this to even work reliably. And some GPUs take some extra leeway. For direct addressing the texels you need texelFetch which is available only with ES-3. Hence my row unzipper trick to sidestep that problem.
@pavel another (minor) problem with the other code is, that it uses a quad for drawing. This is bad: it causes all the blocks/tiles along the diagonal to be touched twice, since the GPU will split it into 2 triangles and the edge that isn't parallel and aligned with the processing blocks/tiles must be calculated in full for both faces. Especially mobile GPUs suffer a lot from that.
When doing full screen stuff always just draw a single triangle that covers the whole viewport.
@pavel You've got a dma-buf handle, already mapped buffer and even GStreamer with all its sinks available, so... however you want? Pretty much anything will be able to consume it easily.
@pavel Not sure what you mean. GStreamer is internally multi-threaded, but its API is thread-safe and there's only one thread in this code. Of course any kind of production-quality code will use some mainloop and enqueue buffers based on callbacks rather than while(!processed){} loop, but it's not exactly rocket science.
@pavel That one line is the only thing that runs from another thread and it's neither scary nor requires any locking 😁
But there are several other smelly things in this code and lots of missing error handling, so I'd rather start with that when looking for suspects.
@pavel For GTK: either https://docs.gtk.org/gdk4/class.DmabufTextureBuilder.html or https://gstreamer.freedesktop.org/documentation/gtk4/index.html
For SDL with GL: just import it the same way V4L buffers are imported.
Frankly, it's flexible enough that your choice of toolkit should only depend on other factors.
@pavel Passing the right buffer size to gst_dmabuf_allocator_alloc helps it to not crash and not have garbage at the end of the frame 😂