Parsing a string with inline C

Posted on
  • In a piece of code I am working on I am trying to count the number of newline characters '\n' (i.e. the number of lines) in a string read from storage via something like the following:

    var c = E.compiledC(`
    // int sum(int, int)
    int sum(int len, unsigned char *data){
      int s = 0;
      while (len--)
        if (*(data++)=='\n') s++;
      return s;
    }
    `);
    
    function readFile(fn) {
      var fb = require("Storage").read(fn);
      var nlines = c.sum(fb.length, E.getAddressOf(E.toString(fb),true));
    //  for (var i=0; i<fb.length; ++i) if (fb[i]=="\n") nlines++;
      return nlines;
    }
    

    Unfortunately my understanding of how the string returned from the require("Storage").read() call is stored in memory is clearly flawed since I am not at all getting the results I am expecting.
    I am reading in a text file that I uploaded to storage via the web ide; the string is read in correctly, as confirmed by outputting it via console.log or using any of the other string functions on it. However, the internal representation of the chunk of memory is clearly not simple ASCII.
    Or is the memory returned by read() simply not contiguous?
    Thanks for any advise,
    Marko

  • Well, the string in fb is actually just a special variable type that references external flash memory. The flash memory isn't memory-mapped by the hardware so it's magically all handled in software.

    However E.toString should fix that by converting it to a big flat string - assuming there's enough RAM.

    To try and debug this, I'd:

    • output the value returned by E.getAddressOf
    • maybe use trace(fb) and so on to see what the variable really is

    But as far as I can see what you have there looks good

  • Thanks, Gordon.
    I did check the address returned by E.getAddressOf, it seems that, as you write, E.toString moves it into RAM, but it still does not seem contiguous (by examining it via peek8).
    At any rate, I just tried to speed up the line counting, but I will stick to doing it in JS since I do not want to incur the cost of copying the whole file to RAM.
    Thanks again for your quick reply. The Espruino ecosystem is an impressive piece of software!
    Marko

  • I can't help you with the inline C but if your string is null terminated you do not need the len variable at all. Just use

    while (*data)
    

    and the if statement could be avoided by writing

    s = s + (*(data++)=='\n');
    

    This exploits that the boolean expression returns 0 or 1.
    Whether this is faster is to be benchmarked as it will depend on compiler and how the code is actually compiled.

  • Thanks. The C indeed wasn't optimized, but since I don't want to copy the whole file into RAM I probably won't pursue this path going forward anyways...

  • @user113695 it is contiguous. I tested it myself. I think your mistake is coming from this part:

    var nlines = c.sum(fb.length, E.getAddressOf(E.toString(fb),true));
    

    Try this instead:

    var flatstring = E.toString(fb);
    var nlines = c.sum(fb.length, E.getAddressOf(flatstring,true));
    

    Not exactly sure why your version doesn't work, will look into it.

  • I think its related to how the c.sum line is converted into E.nativeCall , and the way the arguments are processed, they are expecting resolved values, so the arguments are not eval'ed or anything. Thats all.

  • @d3nd3-o0, it all depends what resolved means... For Espruino JS, the arguments are 'resolved enough' - so to speak - because any code (JS interpreter implementation) - knows to handle a String object: 'talk to it' thru the its method, since it is object-oriented and one of the main goals i to hide the implementation. If the C code does the same as, for example the String method implementations of String.charAt(pos) and String.indexOf(needleString,[startPos]) and (under the cover/indirectly E.toString(aString)) do, it would be just fine. After all, the method/property accessor String.length can do it without reading the String into memory where C code similar as suggested could do the job. An indication that String in JS are obviously looked at differently - have a different implementation - than in standard C - otherwise the length would not have to be passed: a reference to an object in an object-oriented context is all it needs to handle the job.

    From what I get out of this 'exercise' is that an implementation for counting line feeds in JS may be almost as fast as doing it in C because the meat is not really in parsing the string but actually accessing the. bytes of the string. (@Gorden) I hope I can assume that something along the lines of

    var str = "xoxo" // as: = require("Storage").read("FileName");
      , cnt = 0, start=-1, pos; // count occurrences of "x"
    while ((pos = str.indexOf("x",start+1)) > start) { cnt++; start = pos; }
    

    does / could (?) work without String.indexOf() having to read the whole string in memory. On the other hand I know that streaming implementations are demanding and it may well be that it is not worth - or possible - to spend so much code and temp memory for it since the majority of string that are processed aren't that long that they would need a 'streaming' treatment.

    The statement 'almost equally as fast as C' ignores the parsing and interpreting of the JS source which Is though significant and the efficiency depends on how String.indexOf() with start pos is implemented for a String 'living' in the storage.

  • If you don't want to load the string into RAM then you're pretty limited in what you can do with inline code, because you're reliant on Espruino to do the loading for you.

    @allObjects suggestion of str.indexOf is great and I think would be a pretty good way to count newlines. A marginally faster version might be:

    var n=0,i=0;while(i=s.indexOf("\n",i)+1)n++;
    

    Another option is to use regex:

    var s = "hello\nthere\nworld\nlots\nof\nnewlines";
    s.match(/\n/g).length
    

    That'll be fast however it's going to allocate memory for each newline, so I wouldn't recommend if if you've got a file with more than a few hundred newlines.

  • In theory you could avoid some of the memory overhead of the array returned by match() by instead using replace to generate a new string that contains only the newlines:

    s.replace(/[^\n]/g, "").length
    

    Unfortunately, there seems to be a bug in replace() that prevents this from working perfectly.

  • @Gordon, nice play with the +1 on the -1 when not found for ending the loop.

    Since there is obviously a need for operating on potentially 'large' strings in storage, I could see a new method, like require("Storage").do(fn,fnct,optStartPos,optEndPosOrLength), where function can be JS function - plain or compiled - or inline C function. The function accepts byte and - when second argument provided -the byte's position. Function is called one more time with undefined when on 'end of file' with byte position equal to file/string length. The function is also called just once when out of bound with undefined and -1 as byte position. The function returning something else than undefined breaks 'the loop' and .do() returns that value.

    .do() is kind of modeled after forEach / map / reduce / filter. For something similar to reduced, the 'accumulator' is though still missing in the function's argument list as is the decision whether .do() 's optEndPos or optLength should be used. .do(...)needs some more thoughts... ;-)

  • By the way, those regex bugs are now fixed (in cutting edge builds, of 2v07 when released)

    I could see a new method, like require("Storage").do

    Because of the way Espruino works with Storage, data in storage is actually memory-mapped. If you do a read or readArrayBuffer then you're actually accessing the data directly from flash.

    As a result, you can actually do what you want right now using forEach/map/reduce/etc without any extra functions:

    var s = require("Storage");
    // for every element
    new Uint8Array(s.readArrayBuffer("z")).forEach(print)
    // count newlines
    new Uint8Array(s.readArrayBuffer("z")).reduce((a,b)=>a+(b==10),0)
    

    If you want to iterate over just part of the data, you can use the Uint8Array constructor to choose an offset and length :)

  • ...with s the Storage module and "z" the file in the storage... excellent.

  • Yes - sorry, forgot to add that...

  • Post a reply
    • Bold
    • Italics
    • Link
    • Image
    • List
    • Quote
    • code
    • Preview
About

Parsing a string with inline C

Posted by Avatar for user113695 @user113695

Actions