social.kernel.org

Conversation

Harry (Hyeonggon) Yoo

hyeyoo

1 year ago

until yesterday I didn't know that my laptop has 2 NUMA nodes, but why?

vbabka

1 year ago

Reply to @hyeyoo

@hyeyoo weird indeed, does it have CXL already? :D

Harry (Hyeonggon) Yoo

hyeyoo

1 year ago

Reply to @vbabka

@vbabka that would have been much slower 🤣
hmm it makes no sense because it has 8GB and 32GB DIMMs and node 0 has 12GB ❓

Maybe the board designer knows why

lkundrak@octodon.social

1 year ago

Reply to @hyeyoo

@hyeyoo but i thought you were the expert :(

Harry (Hyeonggon) Yoo

hyeyoo

1 year ago

Reply to @lkundrak@octodon.social

Edited 1 year ago

@lkundrak
expert? me?
@vbabka @ljs @sj are the experts.
I am just a dumb/curious undergraduate without Ph.D nor B.S. (yet) XD

The main benefit of NUMA architecture is to distribute memory bus traffics to several memory buses instead of a single global bus, because the global bus can be bottleneck as the number of CPUs and memory capacity grows.

A set of CPUs and memory near to those CPUs is called a NUMA node. If a CPU wants to access memory not in the local node, it reads data from a remote node via interconnect (instead of the local, faster bus)

Because local (to cpu) and remote NUMA node has different access latency and bandwidth, OS tries to utilize local node's memory first (ofc that depends on NUMA memory policy of the task/VMA)

But a laptop is too cheap and small system for a single bus to be a bottleneck, so I don't get why the hardware designer decided to adopt NUMA architecture.

And it's really strange that different ranges of physical memory from a single DIMM chip belongs to different NUMA nodes. Do they really have different performance characteristics?

ljs

1 year ago

Reply to @hyeyoo

@hyeyoo @lkundrak @sj @vbabka bro I'm not I'm an imposter, you're the real one. I don't even work in kernel mm (atm anyway)

Plus age/qualifications don't matter, you got talent which can't be taught. I have an undergrad in civ eng + taught myself :)

I'd say main benefit of NUMA isn't bottleneck, but rather accounting for different time taken for memory accesses thus allowing the kernel to stop you doing something stupid.

I always picture the literal physical setup of a 2 socket system where there's ram attached to each core and a slow interconnect between the two, you don't want to be using that interconnect!

I guess you could say you are trying to avoid the 'global bus' if this == the interconnect.

You find that by default most x86 just has NUMA turned on anyway even in laptop situations, I mean my desktop does too, just with a single node.

I sort of feel like we should have CONFIG_NUMA turned on by default, as it would simplify the code, and just say ok there's 1 node, and all the various mem policy stuff won't make any difference.

One thing that bugged me on my arm64 laptop is that put everything in ZONE_DMA because there zones don't matter. But still kind of... ugly

ljs

1 year ago

Reply to @ljs

@hyeyoo @lkundrak @sj @vbabka oh wait wtf you have 2 numa nodes on your laptop? LOL.

Do you have shared video ram or something?

Harry (Hyeonggon) Yoo

hyeyoo

1 year ago

Reply to @ljs

@ljs @lkundrak @sj @vbabka

no, I have only a 8 GB DIMM and a 32GB DIMM, and node 0 has 12GB of memory,
node 1 has 23GB of memory lol

ljs

1 year ago

Reply to @hyeyoo

@hyeyoo @lkundrak @sj @vbabka ok what the fuck

ljs

1 year ago

Reply to @ljs

@hyeyoo @lkundrak @sj @vbabka what cores live in each? What does numactl --hardware say?

Harry (Hyeonggon) Yoo

hyeyoo

1 year ago

Reply to @ljs

@ljs @lkundrak @sj @vbabka

node 1 is cpuless, and
$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11
node 0 size: 11937 MB
node 0 free: 6770 MB
node 1 cpus:
node 1 size: 23954 MB
node 1 free: 23854 MB
node distances:
node 0 1
0: 10 20
1: 20 10

I was like "wtf did I turn on fake numa?" but no.

$ cat /proc/cmdline
BOOT_IMAGE=(hd1,gpt3)/vmlinuz-6.6.0-rc4+ root=UUID=96f9e501-caa5-4c39-bc11-5d104517f08d ro rootflags=subvol=root loglevel=8

Anisse

Aissen@social.treehouse.systems

1 year ago

Reply to @hyeyoo

@hyeyoo
The topology usually comes from the firmware, acpi tables srat/slit. Tell us more about:
Your CPU
Your bios vendor (and ideally the Numa related config...)
Your dimm topology (speed?)
@lkundrak @ljs @sj @vbabka

ljs

1 year ago

Reply to @Aissen@social.treehouse.systems

@Aissen @hyeyoo @lkundrak @sj @vbabka

^ this

The only cpuless structure I'd heard of before was some powerpc madness.

I guess it could be some 2-tier thing where node 1 is just slower for EVERYTHING.

You can have weirdness actually if there's a dimm size mismatch because node 0 might be dual channel (whole of 1 dimm, portion of another I guess?) and the rest is single channel?

Fact numactl --hardware suggests node 1 twice as slow hints at this

Harry (Hyeonggon) Yoo

hyeyoo

1 year ago

Reply to @Aissen@social.treehouse.systems

@Aissen @lkundrak @ljs @sj @vbabka

It's a Lenovo ThinkBook 15 G4 ABA,
AMD Ryzen 5 5625U with Radeon Graphics.

hmm 'bios vendor' is a bit unclear but looks like written by Lenovo itself?

# dmidecode -t memory
# dmidecode 3.4
Getting SMBIOS data from sysfs.
SMBIOS 3.3.0 present.

Handle 0x0022, DMI type 16, 23 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: None
Maximum Capacity: 64 GB
Error Information Handle: 0x0025
Number Of Devices: 2

Handle 0x0023, DMI type 17, 92 bytes
Memory Device
Array Handle: 0x0022
Error Information Handle: 0x0026
Total Width: 64 bits
Data Width: 64 bits
Size: 32 GB
Form Factor: SODIMM
Set: None
Locator: DIMM 0
Bank Locator: P0 CHANNEL A
Type: DDR4
Type Detail: Synchronous Unbuffered (Unregistered)
Speed: 3200 MT/s
Manufacturer: Unknown
Serial Number: 0CCD0E1E
Asset Tag: Not Specified
Part Number: KD4BGSA80-32N220A
Rank: 2
Configured Memory Speed: 3200 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V
Memory Technology: DRAM
Memory Operating Mode Capability: Volatile memory
Firmware Version: Unknown
Module Manufacturer ID: Bank 9, Hex 0x98
Module Product ID: Unknown
Memory Subsystem Controller Manufacturer ID: Unknown
Memory Subsystem Controller Product ID: Unknown
Non-Volatile Size: None
Volatile Size: 32 GB
Cache Size: None
Logical Size: None

Handle 0x0024, DMI type 17, 92 bytes
Memory Device
Array Handle: 0x0022
Error Information Handle: 0x0027
Total Width: 64 bits
Data Width: 64 bits
Size: 8 GB
Form Factor: Row Of Chips
Set: None
Locator: DIMM 0
Bank Locator: P0 CHANNEL B
Type: DDR4
Type Detail: Synchronous Unbuffered (Unregistered)
Speed: 3200 MT/s
Manufacturer: Hynix
Serial Number: 00000000
Asset Tag: Not Specified
Part Number: HMAA1GS6CJR6N-XN
Rank: 1
Configured Memory Speed: 3200 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V
Memory Technology: DRAM
Memory Operating Mode Capability: Volatile memory
Firmware Version: Unknown
Module Manufacturer ID: Bank 1, Hex 0xAD
Module Product ID: Unknown
Memory Subsystem Controller Manufacturer ID: Unknown
Memory Subsystem Controller Product ID: Unknown
Non-Volatile Size: None
Volatile Size: 8 GB
Cache Size: None
Logical Size: None

lkundrak@octodon.social

1 year ago

Reply to @hyeyoo

@hyeyoo @ljs @sj @vbabka
> "wtf did I turn on fake numa?"
did you buy your numa on vaclavak in prague?

Petr Tesarik

ptesarik@fosstodon.org

1 year ago

Reply to @ljs

@ljs @lkundrak @hyeyoo @sj @vbabka
@Aissen Bah, if you only saw cpuless nodes on IBM POWER, you're lucky. I saw them on Itanium machines… 😭

ljs

1 year ago

Reply to @ptesarik@fosstodon.org

@ptesarik @lkundrak @hyeyoo @sj @vbabka @Aissen I feel for you!

Petr Tesarik

ptesarik@fosstodon.org

1 year ago

Reply to @hyeyoo

@hyeyoo @lkundrak @ljs @sj @vbabka @Aissen I would try to report this to Lenovo. I cannot imagine how this setup could be useful, so I assume the NUMA tables are bogus and copied from another system by mistake.

Anisse

Aissen@social.treehouse.systems

1 year ago

Reply to @hyeyoo

Edited 1 year ago

@hyeyoo
My initial thought was a BIOS bug, especially since it does a 12GB and a 24GB node. CPU seems to have all 6 cores with similar access to the two memory controllers. The fact that one DIMM has Unknown vendor would tend tend to support this, BIOS like to hardcode things per-vendor for some reason. Also, they had years and years of QA on Intel machines, and tend to botch things up on AMD, even today.

Make sure your BIOS is up-to-date, otherwise, report it.

@lkundrak @ljs @sj @vbabka

ljs

1 year ago

Reply to @ptesarik@fosstodon.org

@ptesarik @hyeyoo @lkundrak @sj @vbabka @Aissen I wonder whether it's because the mismatched DIMMs cause a portion of the available RAM to be accessible only slowly though?

Some is dual-channel, some is single-channel

Anisse

Aissen@social.treehouse.systems

1 year ago

Reply to @ljs

@ljs Oh yeah, totally plausible. But why the 12/24 split ?
Also note that it's an APU, so some memory might be dedicated to the GPU.

vbabka

1 year ago

Reply to @Aissen@social.treehouse.systems

@Aissen @ljs maybe 4GB is dedicated because it's 36GB total reported by numactl vs 40GB physically?

vbabka

1 year ago

Reply to @ptesarik@fosstodon.org

@ptesarik @ljs @lkundrak @hyeyoo @sj @Aissen all Itanium machines were cpuless.
/me hides

Petr Tesarik

ptesarik@fosstodon.org

1 year ago

Reply to @vbabka

@vbabka @lkundrak @hyeyoo @ljs @sj @Aissen 🤣

Petr Tesarik

ptesarik@fosstodon.org

1 year ago

Reply to @ptesarik@fosstodon.org

Edited 1 year ago

@vbabka @lkundrak @hyeyoo @ljs @sj @Aissen For the record, I was talking about SGI Altix 3000, which was built of “bricks”. The C-brick contained four Itanium 2 CPUs and 32 GB RAM (that should be enough for everybody), but if it wasn't enough for you, you could buy an M-brick, which contained only memory, no CPU.

Harry (Hyeonggon) Yoo

hyeyoo

1 year ago

Reply to @ptesarik@fosstodon.org

Edited 1 year ago

@ptesarik @lkundrak @ljs @sj @vbabka @Aissen

I got a reply from Lenovo HW team. For a reason that I don't understand, he said a part of 32GB DIMM belongs to a dual channel with 8GB DIMM and the remaining part belongs to a single channel thus a separate NUMA node.

```
[...]
the summary is that the performance for the two nodes is different. Node 0 is dual channel interleaved whereas Node 1 is single channel - so this is working as designed.
[...]
```

https://forums.lenovo.com/t5/ThinkBook-Laptops/BIOS-problem-Bogus-NUMA-configuration-on-thinkbook/m-p/5260517?page=1#6212698

Petr Tesarik

ptesarik@fosstodon.org

1 year ago

Reply to @hyeyoo

@hyeyoo @lkundrak @ljs @sj @vbabka @Aissen Oh! Thank you for the update. Yes, dual-channel RAM access is generally faster than single-channel.

Anisse

Aissen@social.treehouse.systems

1 year ago

Reply to @hyeyoo

@hyeyoo @ptesarik @lkundrak @ljs @sj @vbabka I still don't understand. Where does the 4GB GPU carveout come from ? I can understand the 8GB of overlap for dual-channel, but not the additional 4GB.

vbabka

1 year ago

Reply to @Aissen@social.treehouse.systems

@Aissen @hyeyoo @ptesarik @lkundrak @ljs @sj dual channel is not like a raid0, but like raid1, so the 8GB dimm together with 8GB from the 32GB dimm gives you a 16GB node, from that 4GB is used by the GPU and 12GB is available and reported as node 0.