I’ve been working a lot with bluetooth and my library has recently been giving me occasional errors. The error is when calling connect() on an l2cap socket (this is the only bluetooth low energy socket and is packet oriented with retries) I occasionally and at random get ENOSYS.
That means “Function not implemented” or “System call not implemented”. Very strange. Even stranger is that it was not on connect(), but on return getsockopt. This is an async call, so connect() returns either EINPROGRESS (if all is well) or another error. Connection errors (i.e ETIMEOUT, EREFUSED, or if you’re unlucky EIO or ENOMEM) are then collected when you come back later and pick them up with getsockopt.
My first thought was my program had bugs (of the form of pointer related memory errors) and I was somehow corrupting my system calls. Strace revealed that actually my system calls were precisely as expected and identical and so it wasn’t (maybe) my fault.
I then came across https://www.raspberrypi.org/forums/viewtopic.php?f=63&t=145144&p=962612, which seemed to imply that this was a problem on the RPi 3, and not entirely unique to me. I was beginning to really suspect that it wasn’t my fault and something else was causing it. I then started (unfairly as it transpires) cursing the kernel devs responsible in my head because that should never happen: either the syscall is implemented or it is not. There’s no dynamic behaviour there.
Anyway it’s clearly a problem with bluetooth, so it was time to break out btmon, a bluez tool which gets copies of all HCI packets, parses them and pretty prints them. And I was getting this:
HCI Event: LE Meta Event (0x3e) plen 12 [hci0] 461.313621 LE Read Remote Used Features (0x04) Status: Connection Failed to be Established (0x3e) Handle: 64 Features: 0x1f 0x00 0x00 0x00 0x00 0x00 0x00 0x00 LE Encryption Connection Parameter Request Procedure Extended Reject Indication Slave-initiated Features Exchange LE Ping
It had me going a while because you’ll see that the error code (0x3e ) is by irritating coincidence the same number as the code indicating it’s a BLE related event. To cut a rather long and winding story short, I eventually ended up digging into the kernel sources to find where bluetooth errors got translated into system call errors. And I found this:
The rather handily named “bt_to_errno()” function. Now 0x3e was missing from the list. Checking with the bluetooth 4 spec, we eventually find in table 1.1 in Volume 4, Part D, Section 1.1 the list of error codes. And it corresponds to “Connection Failed to be Established”. There’s no real explanation and this code seems to mean “something went wrong”.
As I mentioned, that was missing from bt_to_errno(), I’m guessing because it’s rare in the wild, and possibly no hardware until recently actually ever returned the code. I’m generally in favour of the idea of never writing code to handle a condition you’ve never seen, since it’s awfully hard to test.
And flipping to the end you can see that if a code arrives and no code has been written to handle it, the function returns ENOSYS. And you know that’s kind of sensible. The list of errors is not very rich, and there isn’t really anything more suitable.
Of course now we know this happens and seems to correspond to a sporadic error from the hardware, I think the correct choice is to return EAGAIN, which is more or less “try again: it might work, fail again or fail with a new error”. I’ll see if the kernel bluetooth people agree.
Edit: they don’t: EAGAIN and EINPROGRESS are the same error code! Time to figure out a better code.