social.kernel.org

Conversation

Pavel Machek

Edited 3 months ago

edit: Thanks for all the heroes that chimed in. In meantime, I got help from entity that shall not be named, and currently have something fast enough. And I have great human experts on line, with patches to test. Thanks again! :-)

Can you program GPUs and do you want to become a HERO? #linuxphone
community needs your help.

We are trying record video, and have most pieces working, but one is
missing: fast enough debayering. That means about 23MB/sec on #librem5.

Debayering is not hard; camera images have subpixels split on two
lines, which need to be corrected. They also use different color
representation, but that's fixable by some table lookup and two matrix
multiplies.

Librem 5 has Vivante GPU, 4 in-order CPU cores and 3GB RAM. My feeling
is that it should be fast enough for that. If task is for some reason
impossible, that would be good to know, too.

Image data looks like this

RGRGRG...
xBxBxB...
.........
.........

Task is to turn that into usual rgbrgb.... format. rgb = RGB * color
matrix, with table lookups for better quality. I can fix that once I
get an example.

I'm looking for example code (#pinephone would work, too), reasons it
can not be done... and boosts if you have friends that can program
GPUs. #gpu #opensource

Ozzelot

ozzelot@mstdn.social

3 months ago

Reply to @pavel

@pavel
I cannot program GPUs and do not desire a mythological protagonistic role :D Take my boost tho

Sebastian Krzyszkowiak

dos@librem.one

3 months ago

Reply to @pavel

@pavel No, it's

RGRGRG
GBGBGB

You lose meaningful data if you ignore half of green pixels.

I see no reason why it couldn't be done. Just take care not to introduce needless copies in your processing path. dmabufs are your friends.

Sebastian Krzyszkowiak

dos@librem.one

3 months ago

Reply to @dos@librem.one

@pavel Since I assume you're going to want to pass the rendered image into some kind of video encoder, you may want to make sure that you match stride and alignment requirements with your target buffer so etnaviv will be able to perform linear rendering rather than de-tile it afterwards (though IIRC it's currently gated behind ETNA_MESA_DEBUG).

tizilogic

tizilogic@mastodon.gamedev.place

3 months ago

Reply to @dos@librem.one

@dos @pavel
adding to that, what data type is the image data (float, int, ???) and what data type is expected to come out?

instead of trying to outsource to the GPU, have you considered SIMD? (I assume librem5 and pinephone support NEON)

if the GPU is better suited, another question is whether there's support for compute shaders on the respective GPUs (what is the supported OpenGL version, assuming there is no Vulkan support on these devices)

Sebastian Krzyszkowiak

dos@librem.one

3 months ago

Reply to @tizilogic@mastodon.gamedev.place

@tizilogic @pavel It's either 8-bit int, or 10-bit int stored as 16-bit.

GC7000L supports compute shaders, but etnaviv isn't there yet.

Naive debayering is easy, but for good picture quality you need much more than that.

datenwolf

datenwolf@chaos.social

3 months ago

Reply to @pavel

@pavel do you have a single frame of raw pixel data? What is the target API (OpenGL, -ES, Vulkan)?

Pavel Machek

pavel

3 months ago

Reply to @dos@librem.one

@dos Lets keep the example simple :-). Yes, g = G1+G2/2 is superior, and there are advanced debayer algorithms. I know them. Examples are at https://gitlab.com/tui/debayer-gpu/ . There's just one small problem: It takes minute and I need it to take 10 seconds.

Pavel Machek

pavel

3 months ago

Reply to @dos@librem.one

@dos That's problem for future Pavel :-). Right now, I'm storing frames on ramdisk, as "RGB3" basically.

Pavel Machek

pavel

3 months ago

Reply to @tizilogic@mastodon.gamedev.place

@tizilogic @dos I tried simd, https://gitlab.com/tui/tui/-/blob/master/ucam/bayer2rgb.rs?ref_type=heads , it did not have good enough performance. (I could not do 512x384 at 23fps).

GL versions are:
Vendor: etnaviv
Renderer: Vivante GC7000 rev 6214
OpenGL Version: OpenGL ES 2.0 Mesa 21.2.6
GLSL Version: OpenGL ES GLSL ES 1.0.16

Doing input in u8, with output in u8 and internal computation in u16 fixed point should be "good enough". Doing everything in u16 would be even better. Floats are okay, too.

Pavel Machek

pavel

3 months ago

Reply to @dos@librem.one

@dos @tizilogic I know. And I have "the rest" prototyped here: https://gitlab.com/tui/debayer-gpu/-/blob/master/isp.frg?ref_type=heads But I feel I need fast-enough naive debayering first, so that I can improve that.

Pavel Machek

pavel

3 months ago

Reply to @datenwolf@chaos.social

@datenwolf Example of frame is here: https://gitlab.com/tui/tui/-/blob/master/ucam/bayer2rgb.rs?ref_type=heads (I also have frame generator and real frames captured from libobscura).

Anything that works on Librem 5 is fine, bonus points if I can understand it. Robot generated code using -lEGL -lGLESv2 -lm ... and that builds and does something. Librem 5 reports:

Vendor: etnaviv
Renderer: Vivante GC7000 rev 6214
OpenGL Version: OpenGL ES 2.0 Mesa 21.2.6
GLSL Version: OpenGL ES GLSL ES 1.0.16

datenwolf

datenwolf@chaos.social

3 months ago

Reply to @pavel

@pavel It would be great to have some actual frame data from the camera sensor, or some test data, that I can load into a texture and write a shader to do the conversion. With OpenGL-ES (which is what you have) the trick is to load the pixels into a RG texture that is twice as wide and half as high as the original frame, so that "upstairs"/"downstairs" neighbor pixels in consecutive row are of the same primitive color; this avoids issues with arithmetic and texel addressing precision.

Pavel Machek

pavel

3 months ago

Reply to @datenwolf@chaos.social

@datenwolf Sorry. Example frame is here: https://gitlab.com/tui/debayer-gpu/-/blob/master/test.png?ref_type=heads (You probably want to run pngtopnm it, so that your code only works with uncompressed data).

Alternatively, I started file format for this. https://gitlab.com/tui/tui/-/tree/master/4cc?ref_type=heads dirgen.sh can generate example frames using gstreamer. You get raw data after 128 bytes header.

Pavel Machek

pavel

3 months ago

Reply to @datenwolf@chaos.social

@datenwolf This is probably easiest to use (after pngtopnm): https://gitlab.com/tui/debayer-gpu/-/blob/master/test.png?ref_type=heads .

Pavel Machek

pavel

3 months ago

Reply to @dos@librem.one

@dos As for copies... Yes, I'm currently doing more copies than needed. I measured Librem 5 at about 2GB/sec memory bandwidth, and stream is about 30MB/sec. At 1Mpix/24fps resolution, gstreamer should be able to encode it in real time.

Here's huge problem with v4l, which gives uncached memory buffers to userspace. That means one whole CPU core is dedicated to copying that to "normal" memory. If that is ever solved, yes, other optimalizations are possible. Currently, this means it is not even possible to copy anything bigger than 1Mpix out of the v4l.

memcpy_io

robertfoss@mastodon.social

3 months ago

Reply to @pavel

Edited 3 months ago

@pavel

This is a OpenGL ES 2.0 solution.
https://github.com/rasmus25/debayer-rpi

There's also support for a software isp in libcamera. I think I've seen some mentions of GPU backed debayering too.

Pavel Machek

pavel

3 months ago

Reply to @robertfoss@mastodon.social

@robertfoss I know about that one, see gitlab.com:tui/debayer-gpu.git . I could not get that to anywhere near the required performance.

Pavel Machek

pavel

3 months ago

Reply to @dos@librem.one

@dos It seems that ignoring half of green pixels is right thing to do at the moment: https://gitlab.com/tui/debayer-gpu/-/tree/master/bwtest?ref_type=heads "Normal" debayer is 40% too slow. (That's better than 5 times too slow, but still not good enough). If you can can get it to 24 loops in second, you'll become a hero :-).

Sebastian Krzyszkowiak

dos@librem.one

3 months ago

Reply to @pavel

@pavel I'm confused. V4L lets you stream to a CMA dmabuf which should be importable as GL_TEXTURE_EXTERNAL_OES, right? Or am I missing something?

Pavel Machek

pavel

3 months ago

Reply to @dos@librem.one

@dos If you have example of that, that would be welcome :-). That's not how megapixels work, at least.

Sebastian Krzyszkowiak

dos@librem.one

3 months ago

Reply to @pavel

@pavel On 9f076a5, I'm getting 88MB/s with one green channel, 82MB/s with two and 105MB/s with nothing but static gl_FragColor. The three copies it does could be eliminated and I believe texelFetch could make it slightly faster on the GPU side too.

Sebastian Krzyszkowiak

dos@librem.one

3 months ago

Reply to @pavel

@pavel Megapixels is not an example of how to do things in the most performant way :) OpenGL operates in a VRAM-centric model, it's very copy-heavy. We don't need to copy things around, as our GPUs operate on the exact same memory CPUs do.

See GL_OES_EGL_image_external and https://docs.kernel.org/userspace-api/media/v4l/dmabuf.html

Pavel Machek

pavel

3 months ago

Reply to @dos@librem.one

@dos Sorry, hero, that's dark magic behind my understanding. I see the words but don't understand the sentences. :-(

I'd need working example here. I got surprisingly far vibecoding this, but even robots have their limits.

Pavel Machek

pavel

3 months ago

Reply to @dos@librem.one

Yep, recent bwtest shows that (extremely simple) debayer is feasible, and possibly more. So far I integrated debayer-gpu + gstreamer, and I'm meeting the deadlines.

Sebastian Krzyszkowiak

dos@librem.one

3 months ago

Reply to @pavel

@pavel After eliminating glReadPixels and having the output buffer mmaped instead: "18.9 MB in 0.08s = 244.4 MB/s"

After putting glTexImage2D out of the loop to emulate zero-copy import from V4L as well:
"18.9 MB in 0.05s = 400.1 MB/s"

https://dosowisko.net/stuff/bwtest.patch

Sebastian Krzyszkowiak

dos@librem.one

3 months ago

Reply to @dos@librem.one

@pavel Not only you had copies in- and out- of GLES context there, but these copies were sequential - and your benchmark waited until things were copied before proceeding with the next frame, so it was pretty much useless in assessing GPU performance. In practice, GStreamer can happily encode the previous frame while the GPU is busy with the current one, all while CSI controller is already receiving the next one.

Sebastian Krzyszkowiak

dos@librem.one

3 months ago

Reply to @dos@librem.one

@pavel Also, it gets faster when you increase the buffer size, because rendering is so fast you're mostly measuring API overhead 😁

With full 13MP frames: 315.1 MB in 0.62s = 511.3 MB/s

datenwolf

datenwolf@chaos.social

3 months ago

Reply to @pavel

@pavel are you limited to OpenGL ES 2.0 or can you use a more modern version? ES-2.0 is very bare bones in its image format and shader capabilities and efficiently converting 10 bpp will be a PITA, due to lack of texelFetch function.
Anyway, spent the day finding a nice polynomial to linearize the sensor values (LUTs should be avoided if possible, memory access has latency and costs energy, if you can calculate in a few instr. prefer that).

Pavel Machek

pavel

3 months ago

Reply to @datenwolf@chaos.social

@datenwolf So linearization should be doable with functions, too. Small trouble is that it is sensor-dependent but ... we have enough mathematical tools to deal with that. Actual functions can be seen here: https://blog.brixit.nl/fixing-the-megapixels-sensor-linearization/

Pavel Machek

pavel

3 months ago

Reply to @datenwolf@chaos.social

@datenwolf I believe I'm limited to OpenGL ES 2.0. Presumably hardware can do more but our current drivers can not, so we are stuck there.

On the other hand... #librem5 main sensor can not do 10bpp at the moment, due to missing drivers. So maybe we can focus on 8bpp, first. Probably ineffecient conversion is "good enough" too, as GPU is a bit overpowered for this job.

Sebastian Krzyszkowiak

dos@librem.one

3 months ago

Reply to @pavel

@pavel @datenwolf Current Mesa can do bunch of GLES3 stuff already, including texelFetch, once you force it with MESA_GLES_VERSION_OVERRIDE.

Pavel Machek

pavel

3 months ago

Reply to @dos@librem.one

@dos Thanks. That really looks like black magic on the first look. On second look, maybe it is not that bad. Let me take another look tommorow.

datenwolf

datenwolf@chaos.social

3 months ago

Reply to @pavel

@pavel

I still have some issues with the linearization LUT. But if you want to get the basic gist of how I approach the whole de-Bayering, here's the code.

https://git.datenwolf.net/glsldebayer/tree/?h=dev_dw

Pavel Machek

pavel

3 months ago

Reply to @dos@librem.one

@dos Thanks for a patch. And yes, it makes the loop faster.. if you don't actually use the data. When used for loading/saving 720 images from the ramdisk, speed went from ~16 sec to ~21 sec.

Sebastian Krzyszkowiak

dos@librem.one

3 months ago

Reply to @pavel

@pavel I left the memcpy line commented out for a reason - with it uncommented, the result is exactly the same as with glReadPixels (which is effectively a memcpy on steroids). The point is to pass that buffer to the encoder directly, so it can read the data straight from the output buffer without waiting for memcpy to conclude.

I've also verified that the approach is sound by having the shader output different values each frame and accessing it via hexdump_pixels inside the loop. Still fast ;)

Sebastian Krzyszkowiak

dos@librem.one

3 months ago

Reply to @dos@librem.one

@pavel That said, rendering to a linear buffer can be slower, that's expected. The question is whether gains from passing buffers around for free are higher, which for an actual "record video from a camera" use case will almost certainly be true (and which has very different performance characteristics from reading images from files - you can't directly attach a file as a texture).

Pavel Machek

pavel

3 months ago

Reply to @dos@librem.one

@dos But you only hexdumped first few pixels, right?

Is that buffer uncached or something?

I pushed current code to https://gitlab.com/tui/debayer-gpu .

Yes, with memcpy(), I'm getting same results as before. If I get rid of the memcpy(), and attempt to fwrite() the buffer directly, things actually slow down.

I can't easily connect gstreamer to that, I'm going through ramdisk for now. I'm using time ./ocam.py debayer for testing -- https://gitlab.com/tui/tui/-/blob/master/ucam/ocam.py?ref_type=heads

Sebastian Krzyszkowiak

dos@librem.one

3 months ago

Reply to @pavel

@pavel > I can't easily connect gstreamer to that

Why not? I quickly hacked up passing dma-bufs to GStreamer and even though I'm glFinishing and busy-waiting on a frame to get encoded sequentially it still manages to encode a 526x390 h264 stream in real time on L5.

Sebastian Krzyszkowiak

dos@librem.one

3 months ago

Reply to @dos@librem.one

@pavel Plugged it into V4L2 - with a caveat that for now I fed the GPU full-res 13MP frames to meet stride alignment requirement (the shader output is still 526x390). It says it does 240 frames in 10.55s. I wonder if it's really slightly too slow, or just bad timing from our camera stack :)

Pavel Machek

pavel

3 months ago

Reply to @dos@librem.one

@dos Camera is 23.5 FPS, IIRC. Do you have it under version control somewhere? This is a bit of achievement :-).

Sebastian Krzyszkowiak

dos@librem.one

3 months ago

Reply to @dos@librem.one

@pavel Seems it's the latter, as the result's exactly the same with 1052x780 camera frames and 263x195 video 😁

Pavel Machek

pavel

3 months ago

Reply to @dos@librem.one

@dos I am not brave enough to debug gstreamer + openGL problems in the same process. You are either lucky or WIZARD :-).

Pavel Machek

pavel

3 months ago

Reply to @dos@librem.one

@dos If you want to make sure, just point camera at the clock :-). gstreamer should get timing information at the input, so I'd expect dropped frames (not wrong speed) if things go wrong.

Sebastian Krzyszkowiak

dos@librem.one

3 months ago

Reply to @pavel

@pavel https://paste.debian.net/1384224/

It's ugly, hardcodes everything, lies on frame timing, occasionally segfaults. Most of it is copied straight from LLM, I just massaged the pieces to work together. Not the kind of code I'd like to sign off on :) But it's a working example, so have fun with it.

Sebastian Krzyszkowiak

dos@librem.one

3 months ago

Reply to @dos@librem.one

@pavel The first thing to do to improve it (after cleaning it up) would be to actually make use of the buffer pool. Dequeue the buffer, attach it as a texture, kick off rendering, get a fence and pass it with the output buffer to GStreamer without waiting on rendering to finish, then queue it back asynchronously once rendering is done. This should allow for much more complex shaders than this sequential code does.

Pavel Machek

pavel

3 months ago

Reply to @dos@librem.one

@dos Fences; that must be some kind of dark magic.

This code seems too good to be true. So, just to be sure, and in case you disappear tomorrow, can I add /* Copyright 2025 Sebastian Krzyszkowiak, GPLv2 */ and act according to that?

Sebastian Krzyszkowiak

dos@librem.one

3 months ago

Reply to @dos@librem.one

@pavel BTW. The fact that I could stream full-res frames and bin them down in the shader at real time is an interesting news, as this may open up possibility to use phase detection autofocus.

Pavel Machek

pavel

3 months ago

Reply to @dos@librem.one

@dos Exactly. That's a bit of big deal. That's why I'm trying to make sure this code does not go away. I had phase-detection auto-focus working at one point, but decided it is unusable as I did not see a way to scale down images quickly enough.

Plus it also adds possibility of zooming.

Sebastian Krzyszkowiak

dos@librem.one

3 months ago

Reply to @pavel

@pavel Good question. Not sure what license would be appropriate to put on something that's mostly an output of a model trained on code on all sorts of licenses anyway...

But given that it's just a bit of glue code between three APIs put together as an example, consider it to be under MIT-0 😜

Sebastian Krzyszkowiak

dos@librem.one

3 months ago

Reply to @dos@librem.one

@pavel (the parts that I added at least, there are parts of your code in there still)

Sebastian Krzyszkowiak

dos@librem.one

3 months ago

Reply to @pavel

@pavel There's plenty of low-hanging fruits in there. Higher frame rates and 10-bit output are also likely some debugging session or two away 😜

Pavel Machek

pavel

3 months ago

Reply to @dos@librem.one

@dos Thank you! I'll take closer look tomorrow or over the weekend. In the meantime, would you have Makefile or build command that goes with it?

Sebastian Krzyszkowiak

dos@librem.one

3 months ago

Reply to @pavel

@pavel LDLIBS = -lEGL -lGLESv2 -lm -ldrm -I/usr/include/libdrm -lgbm -lgstvideo-1.0 -lgstapp-1.0 -lgstallocators-1.0 -lgstreamer-1.0 -lgobject-2.0 -lglib-2.0 -I/usr/include/gstreamer-1.0 -I/usr/include/glib-2.0 -I/usr/lib/aarch64-linux-gnu/glib-2.0/include

Pavel Machek

pavel

3 months ago

Reply to @dos@librem.one

@dos :-) Hopefully. I'll believe things when I see them running locally.

BTW there's one more important thing this can probably do: take full-resolution photos while recording video.

Sebastian Krzyszkowiak

dos@librem.one

3 months ago

Reply to @pavel

@pavel There's a question whether it will be worth elevated power consumption though. I've also stumbled upon csi erroring out with "Rx fifo overflow" requiring a reboot to recover that I haven't seen at lower resolutions, but haven't looked closer.

Pavel Machek

pavel

3 months ago

Reply to @dos@librem.one

@dos Yes, there's more work to be done in the kernel; sometimes camera does not work after reboot, bayer-10 modes are not supported, ... :-(. And yes, it will take more power, but with phase-detection AF, it should be significantly better camera.

Sebastian Krzyszkowiak

dos@librem.one

3 months ago

Reply to @pavel

@pavel Toggling the killswitch makes it appear though.

IIRC PDAF was also usable at half-res.

RAW10 is just a matter of setting up clocks for higher bandwidth and more lanes. Switching data format is then just a single register away.

Sebastian Krzyszkowiak

dos@librem.one

3 months ago

Reply to @dos@librem.one

@pavel When I lie to GStreamer and tell it that its input is in YUY2, it gets faster - perhaps even fast enough to encode at 1052x780. That's another opportunity for improvement.

(and there's nothing magic about fences, it's just a simple synchronization primitive 😛)

Pavel Machek

pavel

3 months ago

Reply to @dos@librem.one

@dos Thanks, I got it to work. I'm putting it into tui repository... and will probably need to reindent it.

For me, there's about 50% CPU usage, so there's still some room.

Yes, YUY2 will be faster; it will also have lower color resolution.

And agreed, there's nothing magic about fences. There's nothing magic about riding horse w/o reins and nothing magic about flying 737, either :-).

Pavel Machek

pavel

3 months ago

Reply to @dos@librem.one

@dos Yeah, I played a bit. Nice. But segfaults, occasionaly, and may segfault more when I switch to matroskamux. So I guess crash may be gstreamer-related? :-). There's also some kind of noise in bottom right corner, maybe that's related, too.

Sebastian Krzyszkowiak

dos@librem.one

3 months ago

Reply to @pavel

@pavel Pretty sure it will just work fine once it's rewritten cleanly and does such arcane magic as releasing the buffers at the right time etc. :)

Pavel Machek

pavel

3 months ago

Reply to @dos@librem.one

@dos Okay, I pushed code to https://gitlab.com/tui/tui/-/tree/master/icam?ref_type=heads . Debugging this may be a bit "fun".

Do I guess correctly that shaders can do arbitrary resolutions, such as 800x600?

I like the v4l+shaders integration. I'm not sure if I like the v4l+shaders+gstreamer integration.

Sebastian Krzyszkowiak

dos@librem.one

3 months ago

Reply to @pavel

@pavel Yes, of course.

BTW. Turns out that streaming to YouTube instead of a local file is just a matter of using rtmpsink instead of filesink 😁

Sebastian Krzyszkowiak

dos@librem.one

3 months ago

Reply to @dos@librem.one

@pavel I'm playing with GStreamer now (which is new for me) and it seems like most of this code could be replaced with GStreamer elements, and the rest should neatly plug in as custom elements 😂

Pavel Machek

pavel

3 months ago

Reply to @dos@librem.one

@dos I don't believe gstreamer can handle complex cameras. But yes, eventually this code should disappear into libraries somewhere.

Pavel Machek

pavel

3 months ago

Reply to @datenwolf@chaos.social

@datenwolf Thanks a lot. I took a look and got it to build but not run so far. I got distracted by.. other camera code and real life. https://gitlab.com/tui/tui/-/tree/master/icam?ref_type=heads (We had blackout today).

datenwolf

datenwolf@chaos.social

3 months ago

Reply to @pavel

@pavel just FYI I wrote it for testing on desktop. It will require a few adjustments for mobile regarding setup of context and framebuffer.

Regarding the other code: Addressing specific pixels just with normalized coordinate texture functions of GLSL-ES-1 is... hard. OpenGL(ES) doesn't put texel centers but outer edges on coordinate values 0 and 1. So you'll have to do some fencepost problem and determine the fractional numbers that hit the texel centers for a given texture size.

datenwolf

datenwolf@chaos.social

3 months ago

Reply to @datenwolf@chaos.social

Edited 3 months ago

@pavel problem is, that the precision used by mobile in doing the addressing calculations from normalized coordinate often isn't sufficient for this to even work reliably. And some GPUs take some extra leeway. For direct addressing the texels you need texelFetch which is available only with ES-3. Hence my row unzipper trick to sidestep that problem.

datenwolf

datenwolf@chaos.social

3 months ago

Reply to @datenwolf@chaos.social

@pavel another (minor) problem with the other code is, that it uses a quad for drawing. This is bad: it causes all the blocks/tiles along the diagonal to be touched twice, since the GPU will split it into 2 triangles and the edge that isn't parallel and aligned with the processing blocks/tiles must be calculated in full for both faces. Especially mobile GPUs suffer a lot from that.
When doing full screen stuff always just draw a single triangle that covers the whole viewport.

Pavel Machek

pavel

3 months ago

Reply to @dos@librem.one

@dos Do you have some ideas how to do viewfinder easily?

Pavel Machek

pavel

3 months ago

Reply to @dos@librem.one

@dos That gstreamer code in C is scary. Multiple threads, no locking, what could go wrong?

Sebastian Krzyszkowiak

dos@librem.one

3 months ago

Reply to @pavel

@pavel You've got a dma-buf handle, already mapped buffer and even GStreamer with all its sinks available, so... however you want? Pretty much anything will be able to consume it easily.

Sebastian Krzyszkowiak

dos@librem.one

3 months ago

Reply to @pavel

@pavel Not sure what you mean. GStreamer is internally multi-threaded, but its API is thread-safe and there's only one thread in this code. Of course any kind of production-quality code will use some mainloop and enqueue buffers based on callbacks rather than while(!processed){} loop, but it's not exactly rocket science.

Pavel Machek

pavel

3 months ago

Reply to @dos@librem.one

@dos There are at least two threads in this code: main one, and whatever runs "on_buffer_released". I don't yet know what causes the segfaults, but I suspect gstreamer.

Pavel Machek

pavel

3 months ago

Reply to @dos@librem.one

@dos I don't have much experiences with GUI programming, so I was looking for suggestions. I've got dma-buf handle but would not know what to do with it in gtk, and maybe SDL is better match. Or perhaps stick to original plan and do user interface ("take picture" button etc) in another process.

Sebastian Krzyszkowiak

dos@librem.one

3 months ago

Reply to @pavel

@pavel That one line is the only thing that runs from another thread and it's neither scary nor requires any locking 😁

But there are several other smelly things in this code and lots of missing error handling, so I'd rather start with that when looking for suspects.

Sebastian Krzyszkowiak

dos@librem.one

3 months ago

Reply to @pavel

@pavel For GTK: either https://docs.gtk.org/gdk4/class.DmabufTextureBuilder.html or https://gstreamer.freedesktop.org/documentation/gtk4/index.html

For SDL with GL: just import it the same way V4L buffers are imported.

Frankly, it's flexible enough that your choice of toolkit should only depend on other factors.

Sebastian Krzyszkowiak

dos@librem.one

3 months ago

Reply to @dos@librem.one

@pavel Passing the right buffer size to gst_dmabuf_allocator_alloc helps it to not crash and not have garbage at the end of the frame 😂

Pavel Machek

pavel

3 months ago

Reply to @dos@librem.one

@dos Heh. Is my guess correct that it should be stride * HO, not WO * HO * 4? :-)

Pavel Machek

pavel

3 months ago

Reply to @dos@librem.one

@dos Yep, that helps. Thanks! :-)

Pavel Machek

pavel

3 months ago

Reply to @datenwolf@chaos.social

@datenwolf I'm still hitting uncached memory... so things are 10x slower than they should be. Is there way to get data from dmabuf to CPU without dealing with that? With 500x400 resolution I can make the deadlines, but it hurts at higher resolutions and is just wrong.

datenwolf

datenwolf@chaos.social

3 months ago

Reply to @pavel

@pavel Which side of the processing chain?

Camera → GPU

before de-Bayering. Or

GPU → CPU

after de-Bayer?

Pavel Machek

pavel

3 months ago

Reply to @datenwolf@chaos.social

@datenwolf gpu -> cpu after debayer. Same problem exists with camera -> cpu, but that's less critical.

datenwolf

datenwolf@chaos.social

3 months ago

Reply to @pavel

@pavel my first instinct would be to put memory management in the hands of GL (instead of tying an external mmap into an image object) by means of pixel buffer objects, mapping those into VA space and then providing those as destination buffers for V4L and source buffers for readout.

Unfortunately my vacation ended last weekend, so I'm shorter on time than last week.

Pavel Machek

pavel

3 months ago

Reply to @datenwolf@chaos.social

@datenwolf Ok, time to ask the lists:

Hi!

It seems that DMA-BUFs are always uncached on arm64... which is a
problem.

I'm trying to get useful camera support on Librem 5, and that includes
recording vidos (and taking photos).

memcpy() from normal memory is about 2msec/1MB. Unfortunately, for
DMA-BUFs it is 20msec/1MB, and that basically means I can't easily do
760p video recording. Plus, copying full-resolution photo buffer takes
more than 200msec!

There's possibility to do some processing on GPU, and its implemented here:

https://gitlab.com/tui/tui/-/tree/master/icam?ref_type=heads

but that hits the same problem in the end -- data is in DMA-BUF,
uncached, and takes way too long to copy out.

And that's ... wrong. DMA ended seconds ago, complete cache flush
would be way cheaper than copying single frame out, and I still have
to deal with uncached frames.

So I have two questions:

1) Is my analysis correct that, no matter how I get frame from v4l and
process it on GPU, I'll have to copy it from uncached memory in the
end?

2) Does anyone have patches / ideas / roadmap how to solve that? It
makes GPU unusable for computing, and camera basically unusable for
video.

Best regards,
Pavel

Pavel Machek

pavel

2 months ago

Reply to @datenwolf@chaos.social

@datenwolf So I did some asking, and it looks like it is cache coherency problem on ARM CPUs and there's no easy way around.

I got your code to run on both Librem 5 and a notebook, thanks. Could you add a copyright notice and some kind of license, preferably GPLv2+ compatible?

datenwolf

datenwolf@chaos.social

2 months ago

Reply to @pavel

@pavel
License is zlib/libpng, seems the most fitting for this kind of thing. Added the notice as file and SPDX headers.

Pavel Machek

pavel

2 months ago

Reply to @datenwolf@chaos.social

@datenwolf Thanks! I added ui overlay to the code and pushed it to https://gitlab.com/tui/debayer-gpu/-/tree/master/glsl?ref_type=heads ... Would you be willing to hack a bit more on it? Flow should be camera -> debayer -> fbo -> downscale and combine with ui overlay -> display. Copying data from fbo is slow, so it needs to be done in separate thread, in paralel with GPU computations, but I can handle that, once I get memory mapping (I believe).