Espruino Wifi Socket recovery

Posted on
  • Hi,

    Using an Espruino Wifi with the 'EspruinoWifi' module running Firmware 1.99.

    The question may expand into other issues but directly my question is around an un-recoverable socket disconnection. I am using the 'Net' module and connecting to a simple Tcp Server. All works well until the socket appears to be closed (either from native code or the module itself but NOT from my code)

    This might be a transient socket disconnection for whatever reason but the main problem is that I can't reconnect. Subscribing to the socket.on('close') event works and I have tried a simple reconnect via a timeout function. This function runs and the callback passes an empty 'err' object and a complete socket as it does first time round but no data is received and the 'close' event is triggered again instantly.

    My code looks like this

    function connect() {
      console.log("Attempting connection");
      client.connect(MY_OPTIONS, function(socket, err) {
        if (err) {
          console.log("Its all gone bad and here's why: "+err);
          return;
        }
        console.log("Connected to My Server ");
    
        //my socket.on('data', .... function lives here
    
        socket.on('close', function() {
          console.log("socket closed");
          setTimeout(function() {
            connect();
          }, 5000);
        });
      });
    }
    

    In summary, it works great until the socket closed event gets called. The 2nd connect runs with no error and return a socket object but no data is received and the close event fires immediately.

    Any thoughts on how to recover from the socket disconnection or maybe I can do something to prevent it. Thanks

  • I'm not the socket specialist, but I see some discrepancy between the code pattern as posted and as found on Espruino internet sockets documentation page.

    From your code I read that it is a client... and the events are on the client and not the socket directly, and the client is the result of a .connect() applied on the retrieved module object. To me it also matters what the server is doing... Always - unconditionally - going into a reconnect may not be the final solution. It may work to talk directly to the socket, but it may as well interfere how Espruino socket client handles the life cycle of the socket. I like a higher level interface such as the client, but the client may be to simple because it hides complexity / flexible options. Anyway, take a look at the client.

  • Thanks @allObjects. I have rewritten the socket code many times to try and make this work but the result is always similar. I have refactored things now to be more inline with the guide (although the API reference shows a similar pattern to the one above). Here is the exact code of my running application

    function connectNmea() {
      var client = require("net").connect(NMEA_OPTIONS, function(socket, err) {
        if (err) {
          console.log("Connection error: "+err);
          setTimeout(connectNmea, 5000);
          return;
        }
        client.on('end', function() {
          console.log("Nmea socket closed");
          EventBus.clear('gpsData');
          setTimeout(connectNmea, 5000);
        });
        EventBus.subscribe('gpsData', function(data) {
          client.write(data);
        });
      });
    }
    

    The EventBus is a simple implementation that queues messages in a simple pub / sub pattern. The EventBus.subscribe('gpsData',... code is simply pulling data out of a queue from the Serial Port. The server is a separate application that I cannot alter but I have stress tested it and it will handle hundreds of concurrent connections without difficulty.

    Debugging the state of the client when the 'end' event is called it looks like

    =Socket: { "type": 0,
      "#onconnect": function (socket,err) { ... },
      "opt": {
        "host": "192.168.1.12",
        "port": 10110 },
      "conn": false,
      "#onend": function (data) { ... },
      "dSnd": "",
      "cls": true, "endd": true }
    

    and then 5 seconds later in the reconnection it appears to be properly initialized

    =Socket: { "type": 0,
      "#onconnect": function (socket,err) { ... },
      "opt": {
        "host": "192.168.1.12",
        "port": 10110 },
      "sckt": 2, "conn": true }
    

    but will immediately be closed again and won't sent data. Ignoring the 'end' and leaving the code still running will report (obviously!)"function called from system Uncaught Error: This socket is closed"

    I agree that blind socket reconnects don't always make sense but the link is still up the server is still running and the code doesn't call for a disconnect. Happy to do any amount of debugging but very keen to make this work. Thanks

  • Push out the 5 secs - make it say 30 or 60 seconds.

    I would guess that you are running out of sockets. You could tell by, on connect, print out the socket number. These should get reused - but I think you are re- connecting so quickly that they don't get a chance to get reused.

  • Hi,

    I will extend the timeout (I am fairly sure I have tried this already). I have also checked the socket number and it's the same on both occasions (when initially opening the connection and subsequently on the reconnect). I would have thought the library would have returned an error in the event of being starved of socket resources ?

    My feeling is that the socket 'end' event is erroneous in some way;

    My next attempt will be to build a small desktop 'proxy' so that I have more control over the other end of the link. I am also trying to capture some Wireshark logs but they are never easy to debug when the event is intermittent and hard to force.

    Thanks for your thoughts

  • Hi,

    I should have consulted my notes from a year ago, this problem is linked with an issues I raised here previously (http://forum.espruino.com/conversations/­304601/#comment13692804 ) about sockets being closed erroneously during a 'send'. I did some work on this and had comms with @Gordon but then had to move on to other projects.

    The problem will occur very quickly if you open one socket and start sending data in a loop whilst you receive data on another socket.

    The EspruinoWifi code for sending via the AT commands of the ESP8266 has a final else statement that catches anything other than "OK", "SEND OK" , "Recv" or "Busy s..." and in that event nulls that socket.

    if (d=="OK") {
            at.register('> ', function() {
              at.unregister('> ');
              at.write(data);
              return "";
            });
            return cb;
          } else if (d=="Recv "+data.length+" bytes" || d=="busy s...") {
            // all good, we expect this
            // Not sure why we get "busy s..." in this case (2 sends one after the other) but it all seems ok.
            return cb;
          } else if (d=="SEND OK") {
            // we're ready for more data now
            if (socks[sckt]=="WaitClose") netCallbacks.close(sckt);
            socks[sckt]=true;
          } else {  // <- this clause here shouldn't fire but it does !!
            socks[sckt]=undefined; // uh-oh. Error.
            at.unregister('> ');
          }
    

    It all looks OK as the IPDHandler should deal with anything else BUT the reality is that 'd'(the data provided in the at.cmd callback) often contains incoming data meant for the other socket. It looks like data is being 'lost' within the Espruino stack rather than being lost packets from the ESP8266.

    In my case the incoming and outgoing data will use checksums so some corruption is OK and to simply keep the link up and running I have amended the code as follows

    } else {
              console.log("Send error, received :" + d);
              if (d){
                var ok = d.indexOf("SEND OK")
                if (ok !== -1) {
                  //ipdHandler(d.substring(0, ok));
                  if (socks[sckt]=="WaitClose") netCallbacks.close(sckt);
                  socks[sckt]=true;
                  return;
                }     
                if(d.indexOf("> ") !== -1){
                  at.unregister('> ');
                }
              }
    
              return cb;
              //socks[sckt]=undefined; // uh-oh. Error.
              //at.unregister('> ');
            }
    

    This is a pretty ugly hack but it proves the issue and I have had the link running all night when it previously wouldn't last a few minutes. I tried passing the data back to the ipdHandler (commented line above) but it doesn't 'repair' the incoming data.

    One important observation is that the erroneous buffer always contains the 'end' of an incoming 'line' meant for the other socket. It looks very much like we are missing the start of the IPD sentence coming from the ESP2866. I will try and debug at the AT module level but hopefully it gives @gordon something to go on.

  • Hi, I've run out of time for today, but I'll take a look at this tomorrow.

  • Thanks for all your work looking into this - that helps a great deal!

    I just came up with a test for this where the WiFi opens two sockets to a server - one of them blasts out data, and then the WiFi tries to write to the other. I can get it to fail within a few seconds, and I'm working on a fix now.

  • Hi,

    I have spent more time on it and there is definitely some issues around the parsing inside the ser.on("data" function in the AT module. You will no doubt come up with a better fix than me, my brief attempts felt kludgy at best.

    It does feel like the parsing is slightly fragile although the constraints of the environment and the language don't make it easy. Thanks for looking at it, I think it might be the cause of a few issues for others at times so well worthwhile.

    Thanks, Steve

  • Ok, should now be fixed on cutting edge builds.

    It seems there were two issues:

    • If you got two separate data packets in a single callback it only handled the first and then passed the second on as a normal unhandled line
    • The socket write function waited for "> ", but then after it got it, it accidentally cleared the entire contents of buffer after that (which could have contained the start of +IPD)

    edit: There is still some strangeness (which has been a longstanding thing with Espruino) where if you write a bunch of stuff to USB and you write a lot of stuff to Serial, occasionally you end up with a duplicated character. It means that if you turn on wifi.debug you can still hit reliability issues even though it'd be rock solid without it :(

  • OK, sounds great.

    Haven't loaded a cutting edge build, exactly which firmware is used on the Espruino Wifi ??

    Thanks

    Steve

  • Hi,

    Looking at this fix (haven't tested it yet) it looks very simple but doesn't feel quite right. Are you sure it will handle the more standard situation with the "> " arriving in the next data after ("OK").

    Maybe there is something in the underlying native code that needs updating as well ? Do I need a new firmware to go along with the changes in AT.js and EspruinoWifi.js ?

    Thanks, Steve

  • You'd normally get the builds from http://www.espruino.com/binaries/travis/­master/, and it's the _wifi one. Looking at it, it seems like the build VM on travis actually failed for the WiFi this time (it happens sometimes even when there is no software fault). I'll push some more changes in a few minutes which will kick off the build again and it should work fine.

    You shouldn't need a new firmware... I think the handling is ok - it does AT+CIPSEND then gets OK. It returns cb - the same function - so gets called back with > then it writes data and hopefully gets Recv ... and then SEND OK at which point it returns undefined which lets it carry on with other stuff

  • Hi @Gordon,

    Sorry that this is dragging on, for me adding the new files (AT.js and EspruinoWifi.js) results in not being able to do simple sends. No stack trace obviously but it looks like the AT+CIPSEND gets to the "OK" clause in EspruinoWifi.js and the callback is returned and assigned to the lineCallback in AT.js.

    At this stage (after the lineCallback and before if (handled&&dataCount) return cb("");) the variable line equates to "OK\r\n> ". This is then substring'd with line = line.substr(i+2); to "> " and the final statement of the loop i = line.indexOf("\r\n"); will return -1 and the loop exits.

    I think this leaves the ESP8266 module waiting for data and therefore not sending anymore and we have effectively discarded the incoming "> ". If I add a check in the end of the loop as follows then it works but obviously this has no place in the AT module

    if (line=="> " && lineCallback){
       //debugger;
       lineCallback(line);
    }
    

    I am trying to work out what I have missed but the files are straight cut and paste from your recent commits. Thanks for your ongoing help, this module is in a very important part of a project I am working on and we are very keen to get this working.

    Steve

  • Hi Steve,

    Ok sorry about this - I assumed that the ESP8266 was actually just sending "> \r\n" - its seems that in fact that only happens when it's then trying to return some +IPD data - so if you try this when you're pumping data into it as I was it'll work great. If you run it normally it fails.

    Give me a half hour or so and I'll get a new build.

    On new builds, AT and EspruinoWiFi libs are actually built in, so you shouldn't have to worry about updating them. A new firmware should have them included.

  • If you try a build now, it should be fixed.

  • Awesome ! The simple fixes are the best and I really appreciate your support.

    Not wanting to to drag this out but I am sure you want to make this 'networking' as robust as possible. I see at least one 'edge' case that would result in a tight infinite loop right now (which I assume locks pretty hard on this platform). It's an easy fix I think and I am happy to do it my end but probably best you maintain your own code base.

    The handlers loop in the ser.on("data" function is now a while loop which checks against the string returned from the handler. There could be cases (like the current IPDHandler in the wifi code if it's not complete) where the same line is returned back to the function which would loop indefinitely I think.

    Probably just a simple precondition vs result check with a break would solve it and if the handler didn't consume anything then it must be OK to continue ?

    Thanks again, Steve

  • Thanks - great spot! You could still ctrl-c out of that with Espruino, but it's definitely not good news!

    Just fixed :)

  • @Gordon, same fix as me, thanks for the continued support. This has been a massive step forward and the board has been running all night with one socket broadcasting UDP data from the serial port and another one receiving an incoming TCP stream. I think this will help a number of other libraries (MQTT etc) to appear way more stable. Nice work !

    The only other small one I had to catch was quite subtle so I will report it. Occasionally the AT module only receives single bytes at a time and at the end of one of these periods there is sometimes a character missing. A typical sequence looks like this

    "\r\nRecv 31 bytes\r\n\r\nSEND OK\r\n"
    "\r"
    "\n"
    "O"
    "K"
    "\r"
    "\n"
    ">"
    " "
    "\r\nRecv 52 bytes\r\n\r\nSEND OK\r\n"

    and everything is fine but sometimes the last \n is missing and the sequence looks like

    "\r\nRecv 31 bytes\r\n\r\nSEND OK\r\n"
    "\r"
    "\n"
    "O"
    "K"
    "\r" <-- missing \n
    "OK\r\n> "

    The logic of the parser then breaks as everything is split on \r\n. My (hacky) solution was to check for the situation within the AT module as follows, but it seems to be hiding an underlying issue.

    ser.on("data", function cb(d) {
        //console.log(JSON.stringify(d));
        if (line=='\r' && d.charAt(0) != '\n') line = "\r\n";
    

    I have seen other situations with a UART that will miss just the last character when receiving data. I have tried running the Wifi serial at everything from 115200 all the way through to 921600 with a similar result.

    Posting more for your information than anything else as I can fix it in my implementation.

    Regards,

    Steve

  • Hi Steve, thanks! Any chance you could put a USB-TTL UART in there on the RX line so we can see if that missing \n is caused by a real Espruino Serial issue, or an ESP8266 bug? How often does it happen?

    Annoyingly I had another bug recently reported because the AT module used just to split on \r and not \r\n - it got changed maybe 2 weeks ago. Actual issue is here: https://github.com/espruino/Espruino/iss­ues/1512

    Unfortunately I think your fix may well fail in the case that you get socket data that contains just a \r?

    In the current builds, presumably that error just causes a socket timeout which resets your connection? It's not then a big deal to trap and reconnect - which you'd have to do because WiFi could be flaky?

    Best solution could be to revert that AT command lib change and to just tweak the offending library to use a 'handler' rather than a 'linehandler'

  • Hi, I will dig out a USB-TTL converter and try my other UART code but I am I right in thinking that there is no exposed Rx line to the ESP8266 on an Espruino Wifi board ?

    The code wasn't really a proposed fix, I think it helps explain and debug the problem more than anything. I wouldn't want you to bounce around with the AT handler, this is a genuine exception created elsewhere. For sure I can catch the error and close / reopen the socket and it's not stopping me making progress.

    One interesting observation is that I have written a small conditional debug log for when this happens and in several hours of running I have had around 100 of these situations and every one of them has happened during the OK response to an AT+CIPSEND. Don't spend too long on it !

    Regards, Steve

  • am I right in thinking that there is no exposed Rx line to the ESP8266 on an Espruino Wifi board ?

    There isn't but the pin spacing on the ESP8266 isn't too bad - in fact there are little holes in the pads IIRC, so if you flip the board over you can just poke a bit of solid core wire in there.

    One interesting observation is that I have written a small conditional debug log for when this happens and in several hours of running I have had around 100 of these situations and every one of them has happened during the OK response to an AT+CIPSEND.

    Wow, that's a lot. Something I'd definitely want to get fixed.

    The fact it always happens in the same place makes me think it could be ESP8266 - in which case reverting the changes to the AT handler really might be the best bet

  • Post a reply
    • Bold
    • Italics
    • Link
    • Image
    • List
    • Quote
    • code
    • Preview
About

Espruino Wifi Socket recovery

Posted by Avatar for SteveHayles @SteveHayles

Actions