performance tips for computations on an ESP32 #7459
Replies: 1 comment
-
Posted at 2021-11-08 by @fanoush for MDBT42Q code you may try to put "compiled" in the methods that are doing the math, see if it helps then you know how much this would help ESP32 too but this feature is not there (?) for ESP32 but maybe it wouldn't be that much hard to add it btw there is no "integer arithmetics" in javascript, everything is in double floating point done in software. However the "compiled" code may figure out you are using only integers and do integer math. as for raspberry pico, it has hand optimized floating point code in ROM Posted at 2021-11-08 by Andreas_Rozek indeed, I forgot about "compilation" when optimizing my code for the ESP32 (well, as you said, it's not supported anyway) While there is indeed no explicit integer arithmetics in JavaScript, there are still tricks to inform JavaScript about the preference of integer arithmetics (these are used by ASM.js, e.g.). And, perhaps, there is any other hidden (or less documented) "trick" to speed things up, e.g., by using single-precision FP rather than double-precision etc. Performance of the RasPi pico is, nevertheless, impressive! (particularly when compared with its price!) Posted at 2021-11-08 by @fanoush
still would be interesting to test with MDBT42Q how much it helps
this may work in the "compiled" case but not in normal js interpreter (?)
This would need different build with -DUSE_FLOATS Posted at 2021-11-08 by Andreas_Rozek I briefly looked through the build instructions for Espruino under macOS - and it looked like too much effort for the limited time I have - thus, I'll have to stick with double precision floats. Using "compiled" failed with the message
which I do not understand right now Posted at 2021-11-08 by Andreas_Rozek By the way: I just added some timing measurements to my code and found out that the MDBT42Q is almost twice as fast as the ESP32! However, an acceleration by the factor of 3 would still be desirable... Addendum: I also found a continuous slow-down while the program was running (regardless of the platform) Posted at 2021-11-08 by @fanoush
Hard to tell without seeing the code, see "Caveats" and "Doesn't Work" on the bottom of Compile page. Posted at 2021-11-08 by Andreas_Rozek I looked into the mentioned sections but did not find anything applicable. Just to stop being abstract: here is my code for the MDBT42Q
it runs even without an attached Neopixel matrix display - but having one is far more impressive... Posted at 2021-11-08 by @fanoush don't see anything either in the
maybe you can try to assign values when declared so it can better guess the type also assigning Math.round and other frequently called methods to variables may speed it up Posted at 2021-11-08 by @fanoush I've put compiler output here https://gist.github.com/fanoush/8d7f38ab2cab05676f5aad677ec89203 @gfwilliams maybe you can look into it? there is ton of suspicious a bit edited source is here https://gist.github.com/fanoush/6fe1e02951113293a6959a7e31cfdad9 Posted at 2021-11-08 by Andreas_Rozek Thank you very much for your effort! I did not imagine that my question could cause so much trouble... Posted at 2021-11-08 by Andreas_Rozek Just to let you know: binding Posted at 2021-11-08 by @fanoush well, the code looks simple. anyway, just tried the simple example from Compile page and there is tons of it looks like that one is caused by one of the for loop not having first section EDIT: after removing Posted at 2021-11-08 by @fanoush after fixing Temperature I get numbers like
so it is like 2x speedup? not that much. Posted at 2021-11-08 by Andreas_Rozek Just to let you know:
I'll post my results here as soon as I have some.... Posted at 2021-11-08 by Andreas_Rozek well, that speedup would already be sufficient for me - I could live with that refresh rate Posted at 2021-11-08 by Andreas_Rozek Which device did you use for testing? Both my MDBT42Q and my original Espruino run out of memory with compiled code (and minification does not help)...and my Espruino Pico hangs itself up after transmitting the code - Espruino does not like me today (perhaps, I should not have mentioned the RasPi pico?) Posted at 2021-11-08 by @fanoush Oh, you're right, I used NRF52840. the final flat string binary with compiled code is 7860 bytes long. And the atob("xxx") string is 11 kB so it may momentarily need 20kB if loaded to ram. When uploaded to flash it could be better. It is true that compiled javascript is quite bloated. InlineC https://www.espruino.com/InlineC would be much smaller (and faster). Posted at 2021-11-08 by Andreas_Rozek breaking news (although I can not believe them): uploaded to flash on an original Espruino gives approx. 195ms per frame (no typo!) for the compiled code. Let me see if this can be true. Posted at 2021-11-08 by @fanoush nice, if you put Posted at 2021-11-08 by Andreas_Rozek damn...it's not my day... it seemed to work once but now it does no longer. I'm currently working with the original Espruino:
S.th. seems to be terribly wrong right now... Amendment: my code works fine - albeit slowly - even after uploading into flash as long as I do not compile - as soon as "compiled" is added, the Espruino hangs itself up, does no longer properly connect via Web Serial and I'll have to reflash the firmware in order to "unbrick" it. 2nd amendment: 3rd amendment: by "compiling" the actual computations only and moving everything else into an uncompiled function, I was able to avoid the hang ups and the code still runs fast enough - but Posted at 2021-11-08 by Andreas_Rozek Ok, I have to give up for today. My impression is that "compilation" of functions has some important (and perhaps undocumented) limitations I do not know yet. @gfwilliams should we limit compiled functions to computations only? Can they access local variables in their lexical context? Or should we pass any such variables as invocation parameters? Can compiled functions handle Posted at 2021-11-09 by Andreas_Rozek There is s.th. terribly going wrong with "compiled" code: I just restructured my source into several functions which I can individually "compile" or not - and I did not start the program automatically, giving me the chance to simply "reset" the board after a hang-up. Additionally, this allows me to invoke individual functions after flashing (or reset) But now the board even hangs itself up after calling an uncompiled function - without ever calling a compiled function before! Just for the records: the program runs fine (albeit slowly) without compilation! Amendment: meanwhile, I found a combination of optimizations (e.g., While the performance increases of compilation are impressive, I still can't really recommend going that way as it seems completely unclear how one can get compiled code running - it may work or it may not, but a recipe cannot be given... Posted at 2021-11-09 by Andreas_Rozek This is my final code (for an original Espruino, with a 16x16 Neopixel LED matrix wired to pin B15) - just for the case that you want to play with it:
Posted at 2021-11-09 by @MaBecker use fill(0) to init the TemperatureMap let TemperatureMap = new Uint8ClampedArray(DisplaySize).fill(0); Posted at 2021-11-09 by @MaBecker What about writing a own library for this? https://github.com/espruino/Espruino/blob/master/libs/README.md Posted at 2021-11-09 by Andreas_Rozek
Indeed, I should do so next time - but it won't make any difference with respect to performance and stability. Posted at 2021-11-09 by Andreas_Rozek
well, I've chosen Espruino in order to avoid having to code in C/C++! Posted at 2021-11-09 by @MaBecker So check this way to compile the code http://forum.espruino.com/comments/14787576/ Posted at 2021-11-09 by @fanoush It is sane for such javascript to work with the compiler, that is precisely what it is there for. So looks like there are some bugs. So only that As for speed I would avoid accessing global variables so much so for anything inside for loop I would cache global reference to local variable but I am not sure, will try some tests. the compiler is pretty simple, there are basically no optimizations. Still, it should work also as is. BTW you cannot reuse output of the compiler between nrf52 and stm32 but I guess you did not do that ( = pull saved code from MDBT42Q storage and upload to original Espruino) Posted at 2021-11-09 by @fanoush
it is tricky to access javascript globals from inside InlineC and call methods on them Posted at 2021-11-09 by Andreas_Rozek I always send my code directly to the attached device - thus, I never touch (nor even see) the compiled code. And, indeed, trying to compile This sounds as if s.th. gets overwritten during upload. Posted at 2021-11-09 by Andreas_Rozek In the mean time, I got I tried the same trick with Amendment: I tried over two dozens of variants for My current impression is that success is nothing but good luck and depends on side effects of compilation (such as overwriting memory areas because of invalid length calculations or similar) Amendment #2: wow, indeed, success is pure luck! After all my experiments I tried to upload my working example again - and failed because I had also changed Which means: simply adding Hopefully, this information may help Gordon et al. finding the reason for this weird compiler bug! Technical details:
Posted at 2021-11-09 by @fanoush for me it does not crash, however I am turning neopixel off as I don't have it. with your code I started with
and ended with
cachin globals to local variables give speedup and also the native code produced is smaller,
Posted at 2021-11-09 by Andreas_Rozek which device are you using? your code still crashes for me even without Neopixel-specific calls (which you may - and perhaps should - leave enabled as they do not expect any feedback but, effectively, just send pulses to pin B15) Amendment: moving outer variables into the three main functions (as you suggested) also already hangs up my code. Personally, I'd recommend to "diff" the code generated for the working and the non-working versions of my source for the original Espruino and look at the changes as s.th. really weird must be going on... Posted at 2021-11-09 by @fanoush It is nrf52840 dongle, enabled neopixel in the build and the source and it still works
output
Posted at 2021-11-09 by Andreas_Rozek ...but the original Espruino is an STM32F103RCT6 - thus, the compiler bug may be controler-specific Posted at 2021-11-09 by @fanoush I tried that too, changed top of https://github.com/gfwilliams/EspruinoCompiler/blob/master/src/utils.js to compile cortex M3 for my board and result looks the same. can try on nrf52832 with 64K RAM but don't have original espruino board (with 48K RAM?) or any other cortex M3 board with espruino, also don't have neopixels Posted at 2021-11-09 by Andreas_Rozek I just wanted to try on my MDBT42Q, but I'm afraid, I burned it in the hurry...thus, I cannot confirm your observations right now Posted at 2021-11-09 by @fanoush BTW, with cached globals like that even not compiled is like 2x faster than before, just changed "compiled" strings and
Posted at 2021-11-09 by Andreas_Rozek This is indeed an interesting outcome (although I had to learn that variables declared outside of functions are called "globals" here) Thank you very much for your effort! Posted at 2021-11-09 by @fanoush
sorry, that's probably wrong, I mean outer scope , outside of native method, in this case it is global scope
https://www.espruino.com/Compilation#performance-notes EDIT: also there is generic performance page, it is there too Posted at 2021-11-09 by Andreas_Rozek don't worry - perhaps, I've just been too narrow-minded here... Posted at 2021-11-10 by @gfwilliams Hi, Sorry for the slow replies here - I've had my head down trying to get all the new Bangles shipped, and I likely won't have much time during the next week or two until they're all out. When things calm down I'll try and look into the compiler bugs though. It is definitely only a subset of JS that is compiled, but it should be pretty easy to extend it to sort out some of the issues you are seeing. Posted at 2021-11-10 by Andreas_Rozek Hello Gordon, I assumed s.th. like that and did not expect you to answer too soon. Just take the time to need to ship all Bangles (I'm waiting for two of them myself), then take a breath or two, so that your family does not forget how you look like and then, perhaps, look at the bugs we found. With greetings from Germany, Andreas Rozek Posted at 2022-05-23 by maze1980 Just looked at the code, and I think there are still options for improvement: All the clamping and rounding for TemperatureMap values can be omitted:
One day I'll test this code, I think the effect will be nice. Posted at 2022-05-24 by Andreas_Rozek Well, that's indeed an interesting finding, but I doubt that it would have a significant effect on performance:
outputs Anyway, thanks for your effort! Posted at 2022-05-24 by maze1980 Well, optimizing for speed is often bad for readability. I wouldn't add "+0.5" to the code, just to save the time needed for this extra operation.
Since prepareDisplay is one of the slowest functions: ColorMap has only 8 different values, Temperature has 208 different values - so it might be feasible to pre-compute things and avoid all division and multiplication operations with Temperature in this function - if there's enough RAM available on the ESP32. Anyway, that's much harder than just removing rounding and clamping for TemperatureMap values. Removing 262 function calls will result in a measurable improvement, even for the compiled version. |
Beta Was this translation helpful? Give feedback.
-
Posted at 2021-11-08 by Andreas_Rozek
Hello,
are there any tips how I may speed up numeric computations (particularly basic integer arithmetics) on an ESP32 running Espruino?
My current project basically mimics a cellular automaton modeling s.th. like thermal diffusion and visualizing the result on a Neopixel matrix display. Numerics is limited to basic integer arithmetics but involves a lot of Math.round/min/max calls. Colors and brightnesses are then looked up using tables built using Uint8ClampedArray.
The program is running, but turns out to be surprisingly slow - especially when compared with a similar implementation for a Raspberry Pi Pico running Kaluma (and the latter has not yet been optimized in any way!) By the way: running the same code on an Espruino MDBT42Q gives a performance comparable to the ESP32 (which surprised me as I expected the ESP32 to run faster)
Are there any tips what I could do to speed the program up?
(if my description sounds to abstract, I could also share my code - approx. 120 lines)
Thanks in advance for any help!
With greetings from Germany,
Andreas Rozek
Beta Was this translation helpful? Give feedback.
All reactions