[en] Debugging Nintendo Switch Linux power management – battery desync edition

When Fail0verflow revealing unpatchable 0-day in Tegra X1 BootROM, they released Switch Linux port as PoC payload. This Linux port works quite a lot but there were some quirky parts and one of a most significant bug was “battery desync” bug.

When you play a game on Switch which has booted Linux on it, Switch suddenly shuts itself off when the battery level reaches around 46%. Also, when you don’t play a game on Switch, it also shut itself at around 26%. The only way to fix this bug is power cycling full system.

This symptom is very similar to one of common symptom in a generic battery powered device. When battery powered system draws a huge amount of power from the battery which is incapable to supply that power surge, the system will shut down.

By combining this knowledge with Device Tree from Switch Linux, I deduced:

  1. Nintendo Switch has these chips in its power system.
  2. Full system power cycle will turn off and on these chips which normally cannot be power cycled.
  3. “Battery desync” bug may be caused by wrong values in registers in those chips which Linux sets but Horizon OS does not reset.
  4. By comparing register values in those chip, I will be able to get some clue about how to fix “desync bug”



NX-MemXPlorer – Nintendo Switch Low-Level Explorer

So, I know knowing register values in power system chips can help my research about “desync bug”. There was no tool exists for that when I did my research, so I wrote one for it.

I wrote NX-MemXPlorer by hacking Memloader source code. I poured my previous written-in-hurry interactive debug shell code into Memloader and wrote another helper functions for dumping BQ24190 and MAX17050.

With this tool, I was able to run diff in register values between Horizon OS and Linux.

Dumping BQ24193 and analysis

I got the dump of BQ24193 after booting Horizon and Linux and run diff on it.

$ diff batt-horizon.log batt-linux.log 
2,3c2,3
< 0x00 = 0x02
< 0x01 = 0x10
---
> 0x00 = 0x32
> 0x01 = 0x15
5c5
< 0x03 = 0x00
---
> 0x03 = 0x31
7c7
< 0x05 = 0x82
---
> 0x05 = 0x8a

Hmm, it does not look meaningfully different. Let’s dump another one: MAX17050

Dumping MAX17050

Also, I dumped MAX17050 after booting Horizon and Linux.

$ diff fuel-horizon-after-linux-after-horizon.log fuel-horizon.log 
6c6
< 0x03 = 0x6563
---
> 0x03 = 0xff00
8,18c8,18
< 0x05 = 0x2571
< 0x06 = 0x6400
< 0x07 = 0x66a3
< 0x08 = 0x2084
< 0x09 = 0xd1b8
< 0x0a = 0x0017
< 0x0b = 0x00dc
< 0x0d = 0x6480
< 0x0e = 0x6480
< 0x0f = 0x25a1
< 0x10 = 0x2571
---
> 0x05 = 0x258d
> 0x06 = 0x63f3
> 0x07 = 0x6707
> 0x08 = 0x211e
> 0x09 = 0xd194
> 0x0a = 0x000b
> 0x0b = 0xffba
> 0x0d = 0x64be
> 0x0e = 0x646d
> 0x0f = 0x25bc
> 0x10 = 0x2592
22c22
< 0x16 = 0x2063
---
> 0x16 = 0x2118
25c25
< 0x19 = 0xd0f6
---
> 0x19 = 0xd101
31c31
< 0x1f = 0x2573
---
> 0x1f = 0x258a
37c37
< 0x27 = 0x75fa
---
> 0x27 = 0x74b2
53c53
< 0x3e = 0xe220
---
> 0x3e = 0xc139
58c58
< 0x4d = 0x122f
---
> 0x4d = 0x124b
107,108c107,108
< 0xfb = 0xd16e
< 0xff = 0x6498
---
> 0xfb = 0xd19f
> 0xff = 0x64eb

It does not look meaningful at all except 0x03: battery level interrupt. However, restoring 0x03 into 0xFF didn’t fix the battery bug.

The only remaining one is MAX77620, but here comes the dragon: The NDA.

MAX77620 register analysis

MAX77620 is PMIC which has lots of other features like GPIO and RTC. The problem is, Maxim integrated homepage does not contain any information about this chip. Only Jetson TX1 dev board and other devices like Nintendo Switch let us know about the existence of MAX77620.

Also, Linux MAX77620 driver is submitted from Nvidia. The worst thing is, according to post in Nvidia developer forum, detailed information is only available under NDA.

This means I only can reference Linux driver code when I explore this chip. Another problem, analysis target is PMIC, which will fry the whole system when I do some wrong read or write. Undocumented killer POKE and PEEK. Not so good.

Under these limitations, I got a clever idea which will shine a light on undocumented register map: Observing which registers driver writes.

To monitor what registers Linux MAX77620 driver access, I patched Linux I2C system with this patch.

diff --git a/drivers/i2c/i2c-core-base.c b/drivers/i2c/i2c-core-base.c
index 1ba40bb2b966..626ca9a8b11e 100644
--- a/drivers/i2c/i2c-core-base.c
+++ b/drivers/i2c/i2c-core-base.c
@@ -1971,6 +1971,17 @@ int i2c_transfer_buffer_flags(const struct i2c_client *client, char *buf,
                .len = count,
                .buf = buf,
        };
+       int i;
+       char foo[512];
+
+       if (strcmp(client->name, "max77620") == 0 || strcmp(client->name, "max77621") == 0) {
+               snprintf(foo, 512, "I2C XFER: device = %s, flags = 0x%02x, data = [", client->name, msg.flags);
+               for (i = 0; i < msg.len; i++) {
+                       snprintf(foo+strlen(foo), 512, "0x%02x, ", msg.buf[i]);
+               }
+               printk("%s]\n", foo);
+               dump_stack();
+       }
 
        ret = i2c_transfer(client->adapter, &msg, 1);
 

After that, I grep’ed kernel log and extracted I2C XFER logs from it.

This is the final state of MAX77620 registers which Linux change.

[    1.544424] I2C XFER: device = max77620, flags = 0x00, data = [0x56, 0x22, ] // Differ on pure HOS - 0x23
[    1.939605] I2C XFER: device = max77620, flags = 0x00, data = [0x1d, 0x70, ] // Differ on pure HOS - 0x40
[    2.138520] I2C XFER: device = max77620, flags = 0x00, data = [0x1e, 0x70, ] // Differ on pure HOS - 0x40
[    2.337222] I2C XFER: device = max77620, flags = 0x00, data = [0x1f, 0x70, ] // Differ on pure HOS - 0x40
[    2.536109] I2C XFER: device = max77620, flags = 0x00, data = [0x20, 0x70, ] // Differ on pure HOS - 0x40
[    2.741083] I2C XFER: device = max77620, flags = 0x00, data = [0x25, 0xca, ] // Differ on pure HOS - 0x0a
[    2.945738] I2C XFER: device = max77620, flags = 0x00, data = [0x29, 0xee, ] // Differ on pure HOS - 0x2e
[    3.348048] I2C XFER: device = max77620, flags = 0x00, data = [0x2b, 0xc4, ] // Differ on pure HOS - 0x10
[    3.741160] I2C XFER: device = max77620, flags = 0x00, data = [0x2d, 0xd4, ] // Differ on pure HOS - 0x2e
[    4.134812] I2C XFER: device = max77620, flags = 0x00, data = [0x2f, 0xea, ] // Differ on pure HOS - 0x28
[    4.323549] I2C XFER: device = max77620, flags = 0x00, data = [0x31, 0xc5, ] // Differ on pure HOS - 0x05
[    4.522174] I2C XFER: device = max77620, flags = 0x00, data = [0x33, 0xc5, ] // Differ on pure HOS - 0x05
[    4.721309] I2C XFER: device = max77620, flags = 0x00, data = [0x00, 0xff, ] // Differ on pure HOS - 0x92 * DIFFER FROM HOS AFTER LINUX - 0xFF
[    5.310380] I2C XFER: device = max77620, flags = 0x00, data = [0x0d, 0xe7, ] // Differ on pure HOS - 0x75
[    5.924899] I2C XFER: device = max77620, flags = 0x00, data = [0x0e, 0x08, ] // Differ on pure HOS - 0x00
[    6.276275] I2C XFER: device = max77620, flags = 0x00, data = [0x27, 0xd4, ] // ?? Why 0xf2 in memxplorer?
[    7.397757] I2C XFER: device = max77620, flags = 0x00, data = [0x3c, 0x09, ] // Differ on pure HOS - 0x02

The noticeable part of this dump is register 0x00. Horizon OS wants 0x92 in this register, but Linux sets it to 0xFF.

Let’s overwrite it with 0x92 and run Korokmark2018Zelda BotW to drain battery.

I got succeeded in this case. Now I know I can fix “battery desync” by writing 0x92 to 0x00 of MAX77620, I tried to find out why.

According to the source code, that register name is CNFGGLBL1 and name of bits in that register is quite meaningful. LBRSTEN? Maybe acronym of Low Battery ReSeT ENable? Maybe those registers are related to low voltage emergency shutdown. Also, Horizon OS does not touch those bits.

It leads to this conclusion: “Battery desync” nametag was red herring. It wasn’t related with battery fuel level desync at all.

The root cause of “battery desync”

Then, why Linux sets it to 0xFF? Let’s follow the stack trace.

[    4.721309] I2C XFER: device = max77620, flags = 0x00, data = [0x00, 0xff, ]
[    4.728382] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.17.0-00059-g87bd3c37c1cd-dirty #16
[    4.736641] Hardware name: Nintendo Switch (DT)
[    4.741165] Call trace:
[    4.743614]  dump_backtrace+0x0/0x1b0
[    4.747274]  show_stack+0x14/0x20
[    4.750589]  dump_stack+0x9c/0xbc
[    4.753903]  i2c_transfer_buffer_flags+0x120/0x178
[    4.758691]  regmap_i2c_write+0x1c/0x50
[    4.762526]  _regmap_raw_write_impl+0x5f4/0x750
[    4.767053]  _regmap_bus_raw_write+0x60/0x78
[    4.771321]  _regmap_write+0x58/0xa8
[    4.774893]  _regmap_update_bits+0xf0/0x108
[    4.779073]  regmap_update_bits_base+0x60/0x90
[    4.783513]  regmap_irq_update_bits.isra.1+0x44/0x50
[    4.788472]  regmap_add_irq_chip+0x4c4/0x8a8
[    4.792739]  devm_regmap_add_irq_chip+0x8c/0x100
[    4.797354]  max77620_gpio_probe+0x10c/0x1a8
[    4.801620]  platform_drv_probe+0x58/0xb8
[    4.805628]  driver_probe_device+0x298/0x468
[    4.809895]  __device_attach_driver+0x88/0x140
[    4.814335]  bus_for_each_drv+0x78/0xc8
[    4.818168]  __device_attach+0xd4/0x150
[    4.822001]  device_initial_probe+0x10/0x18
[    4.826180]  bus_probe_device+0x90/0x98
[    4.830012]  device_add+0x3ec/0x5f8
[    4.833497]  platform_device_add+0x110/0x278
[    4.837766]  mfd_add_device+0x2a8/0x2f8
[    4.841598]  mfd_add_devices+0xac/0x148
[    4.845431]  devm_mfd_add_devices+0x78/0xd8
[    4.849610]  max77620_probe+0x49c/0x6d8
[    4.853444]  i2c_device_probe+0x264/0x2c8
[    4.857450]  driver_probe_device+0x298/0x468
[    4.861717]  __driver_attach+0x114/0x118
[    4.865637]  bus_for_each_dev+0x70/0xc0
[    4.869469]  driver_attach+0x20/0x28
[    4.873042]  bus_add_driver+0x248/0x278
[    4.876872]  driver_register+0x60/0xf8
[    4.880617]  i2c_register_driver+0x44/0xa0
[    4.884713]  max77620_driver_init+0x18/0x20
[    4.888891]  do_one_initcall+0x70/0x144
[    4.892724]  kernel_init_freeable+0x180/0x21c
[    4.897078]  kernel_init+0x10/0x104
[    4.900565]  ret_from_fork+0x10/0x18

It is quite weird. Why GPIO driver touches register 0x00? Let’s look inside.

Huh, Interrupt masking code is writing 0xFF into 0x00. Why? because Linux didn’t support Non-interrupt maskable regmap.

Also, Nvidia’s original patch for MAX77620 GPIO driver touches 0x00 using Device Tree until v4 of that patch.

Additionally, Nvidia knew this problem and try to solve this problem with patch with the flaw (think when interrupt mask is located in 0x00). Of course, this patch wasn’t mainlined and this bug finally succeeded to screw up Switch Linux users.

So, there is no clean solution for this bugs but I can patch it with duct-tape code until Linux get support for it.

Also, CTCaer Hekate IPL included battery fix which will write 0x92 to MAX77620 0x00. My Linux and U-Boot patch will prevent Linux from setting wrong value in CNFGGLBL1 and CTCaer Hekate will provide the fix for affected peoples.

p.s.

Nvidia, f**k you for poor quality patches.