Posted on by

Today’s tutorial is about performance optimization, a topic that concerns all developers. Some of these tips will be obvious — others, not so much. In every case, optimizing code should be viewed from a time vs. benefit perspective. If a certain optimization might yield a 2% performance boost on older devices, but implementing it requires 50 hours of additional coding, then it’s illogical to do so. However, if 10 hours of coding will likely yield noticeable performance improvements across a wide array of devices, then the task is absolutely worthwhile.

In regards to new projects, adhering to as many performance tricks as possible is highly recommended, as they will ultimately result in a faster, cleaner app, and thus a better user experience across all devices.

“Time-Critical” Routines

Most of the performance tricks presented in this tutorial pertain primarily to “time-critical” routines — that is, points in your app where there is a lot happening or where the user experience could be adversely affected by sluggish performance. For example, the gameplay stage of an action game, the wait between loading a new scene, etc. If the user notices frame skips or if he/she must wait longer than is deemed “acceptable,” it reflects poorly on the app.


1. Localize, Localize

No matter how many times this is mentioned, it’s worth emphasizing again. While avoiding global variables and functions isn’t always possible “across the board,” minimal usage is the best practice. Access of local variables and functions is simply faster, especially in time-critical routines.

Non-Local — Discouraged

CCX = display.contentCenterX  --global variable
for i = 1,100 do
   local image = display.newImage( "myImage" )
   image.x = CCX
end

Local — Recommended

local CCX = display.contentCenterX  --local variable
for i = 1,100 do
   local image = display.newImage( "myImage" )
   image.x = CCX
end
-

This also applies to core Lua libraries like the math library. In time-critical routines, you should always localize library functions.

Non-Local — Discouraged

local function foo( x )
   for i = 1,100 do
      x = x + math.sin(i)
   end
   return x
end

“External” Local — Recommended

local sin = math.sin  --local reference to math.sin
local function foo(x)
   for i = 1,100 do
      x = x + sin(i)
   end
   return x
end
-

Lastly, remember that functions should always be localized if possible. Of course, doing so will require proper scoping. If you’re new to Lua, please refer to Understanding Scope for Beginners.

Non-Local — Discouraged

function func1()
   func2( "myValue" )
end

function func2( y )
   print( y )
end

func1()

Local — Recommended

--"func2" properly scoped above "func1" 
local function func2( y )
   print( y )
end

local function func1()
   func2( "myValue" ) 
end

func1()

2. Avoid Functions as Arguments for Other Functions

In loops or time-critical code, it’s essential to localize functions that will be parameters of other functions. Examine these two cases:

Defined as an Argument of Another Function — Discouraged

local func1 = function(a,b,func) 
   return func(a+b) 
end

for i = 1,100 do
   local x = func1( 1, 2, function(a) return a*2 end )
   print( x )
end

Localized — Recommended

local func1 = function( a, b, func )
   return func( a+b )
end
local func2 = function( c )
   return c*2
end

for i = 1,100 do
   local x = func1( 1, 2, func2 )
   print( x )
end

3. Avoid “table.insert()”

Let’s compare four methods that all achieve the same thing: the common act of inserting values into a table. Of the four, the Lua table.insert function is a mediocre performer and should be avoided.

table.insert() — Discouraged

local a = {}
local table_insert = table.insert

for i = 1,100 do
   table_insert( a, i )
end

Loop Index Method — Recommended

local a = {}

for i = 1,100 do
   a[i] = i
end

Table Size Method — Acceptable

local a = {}

for i = 1,100 do
   a[#a+1] = i
end

Counter Method — Recommended

local a = {}
local index = 1

for i = 1,100 do
   a[index] = i
   index = index+1
end

4. Minimize use of “unpack()”

The Lua unpack() function is not a great performer either. Fortunately, it’s usually possible to write a simple, faster loop to accomplish the same thing.

Lua “unpack()” method — Discouraged

local a = { 100, 200, 300, 400 }

for i = 1,100 do
   print( unpack(a) )
end

Loop Method — Recommended

local a = { 100, 200, 300, 400 }

for i = 1,100 do
   print( a[1],a[2],a[3],a[4] )
end

The caveat is that you must know the length of the table to retrieve all of its values in the “Loop Method” case. Thus, unpack() still has its uses — in a table of unknown length, for example — but it should still be avoided in time-critical routines.


5. Cache Table Item Access

Caching table items, especially inside of loops, can boost performance slightly and might be considered in time-critical code.

Non-Cached — Acceptable

for i = 1,100 do
   for n = 1,100 do
      a[n].x = a[n].x + 1
      print( a[n].x )
   end
end

Cached — Recommended

for i = 1,100 do
   for n = 1,100 do
      local y = a[n]
      y.x = y.x + 1
      print( y.x )
   end
end

6. Avoid “ipairs()”

When iterating through a table, the overhead of the Lua ipairs() function does not justify its use, especially when you can accomplish the same thing using a Lua construct.

ipairs() — Discouraged

local t1 = {}
local t2 = {}
local t3 = {}
local t4 = {}
local a = { t1, t2, t3, t4 }

for i,v in ipairs( a ) do
   print( i,v )
end

Lua Construct — Recommended

local t1 = {}
local t2 = {}
local t3 = {}
local t4 = {}
local a = { t1, t2, t3, t4 }

for i = 1,#a do
   print( a[i] )
end

7. Math Performance Comparisons

Various mathematical functions and processes are faster than others and should be favored.

Avoid “math.fmod()” for Positive Numbers

--math.fmod method (discouraged)
local fmod = math.fmod
for i = 1,100 do
   if ( fmod( i,30 ) < 1 ) then
      local x = 1
   end
end

--modulus operator method (recommended)
for i = 1,100 do
   if ( ( i%30 ) < 1 ) then
      local x = 1
   end
end

Multiplication is Faster Than Division

x * 0.5 ; x * 0.125  --recommended
x/2 ; x/8            --discouraged

Multiplication is Faster Than Exponentiation

x * x * x  --recommended
x^3        --discouraged

8. Conserve Texture Memory

Texture memory is too often ignored until it reaches “critical mass”, then it’s difficult and time-consuming to make the required changes to art assets.

In regards to texture memory, 8-bit or 24-bit PNG images (32-bit with the “alpha” channel) all unpack into 32-bit images. They are rectangular arrays of pixels and there are effectively 4 color arrays (channels) per image: red, green, blue, and alpha (RGB+A).

In OpenGL, textures — meaning both single images or image sheets — also obey the Power of 2 (PoT) rule. This means that any texture will round up to the next highest Power of 2 (2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, …) in regards to the texture memory it will occupy. Thus, an image sized 320×480 and another sized 280×400 will both consume 512×512 of texture memory. Note that the next highest PoT occurs independently on either the horizontal or vertical, and the effective size does not always adhere to a “square” configuration — thus, an image sized 920×40 will round up to 1024×64 in required texture memory, not 1024×1024.

While this might seem innocuous at first glance, let’s calculate the actual memory consumption. Not only must the PoT size be considered, but also the 4 color channels. This means that each pixel in the texture array requires 4 bytes of memory, and it adds up faster than you think.

Image(sheet) sized 350×500:

512×512 (pixels) × 4 (bytes) = 1,048,576 bytes = 1 MB

Image(sheet) sized 514×1024:

1024×1024 (pixels) × 4 (bytes) = 4,194,304 bytes = 4 MB

Notice that the next PoT up requires four times the texture memory! This becomes even more concerning when you consider development for both “normal” devices and Retina/HD devices. If the display size between the two varieties doubles in pixels, as with the iPad vs. the Retina iPad, all of your images will likewise need to double in size to keep them crisp and sharp. However, doubling their size requires more than double the texture memory — 4× in the above example — and generally speaking, the Retina/HD devices do not contain 4× the texture memory compared to their predecessors!

Before you begin to panic, realize that texture memory can usually be managed without an excessive amount of effort or (gasp!) reworking an entire set of artwork. Just remember these tips:

  1. Always “unload” textures (remove them from the display stage) when they’re no longer needed.
  2. If you have a background texture that needs to be 525×600 on the screen, you might be able to create it as a proportional 448×512 image to constrain it to the 512 PoT range. Then, in code, scale it up slightly by setting its desired width and height. If it’s just a small increase in size, the minor loss of clarity likely won’t be perceived by the user, especially on a small phone screen.
  3. Re-use textures if possible, and apply tinting with the setFillColor() API. For example, if you have a “red apple” and a “green apple”, you might be able to create the apple as a greyscale image and apply a red and green tint accordingly.
  4. If you’re using image sheets, consider using a tool like TexturePacker to pack your images into the smallest PoT configuration possible.

9. Pre-create Physics Bodies

If you intend to use a considerable number of physics bodies in your scenario, it might be wise to pre-create them in non-time-critical code. Those that won’t be used immediately can be set to inactive and placed somewhere off screen or in an invisible display group, returning to active state when needed.

That being said, creating some physics bodies during time-critical code is fine — just avoid creating 10-20 physics bodies during one game cycle, as it will likely result in a noticeable frame rate skip.

Also, it should be noted that this is a “balancing act” to some degree. Pre-creating and deactivating 200 physics bodies will remove them from the Box2D world, but not from Corona’s memory, so it might not benefit performance to take this practice to such an extreme.


10. Utilize Audio “Best Practices”

Sound effects for an app should always be pre-loaded in non-time-critical code, for example, before a scene or level begins. Additionally, you should compress/sample sounds to the smallest acceptable quality in most cases. 11khz mono (not stereo) is considered “acceptable” in most cases, as the user will likely be listening through the phone/tablet speaker or a bundled set of earbuds. Also, using simple, cross-platform formats like WAV do not tax the CPU heavily.

If desired, sound effects can be organized in a table as follows, for easy reference and eventual disposal when they’re no longer needed.

--load these sounds during NON-time-critical code
local soundTable = {
   mySound1 = audio.loadSound( "a.wav" ),
   mySound2 = audio.loadSound( "b.wav" ),
   mySound3 = audio.loadSound( "c.wav" ),
   mySound4 = audio.loadSound( "d.wav" ),
   mySound5 = audio.loadSound( "e.wav" ),
   mySound6 = audio.loadSound( "f.wav" ),
   mySound7 = audio.loadSound( "g.wav" ),
   mySound8 = audio.loadSound( "h.wav" ),
}
-

With this structure, playback is as simple as:

local mySound = audio.play( soundTable["mySound1"] )
-

As always, do not forget to clean up your sounds by disposing them when they’re no longer needed, and clearing the reference from the table:

local ST = soundTable
for s,v in pairs(ST) do
   audio.dispose( ST[s] ) ; ST[s] = nil
end

That’s it for today’s tutorial. As a developer, performance optimization requires endless diligence, and you should always adhere to “best practices.” Hopefully these tips provide a small amount of insight into boosting the performance of your app. As always, please respond with your questions and comments below.


Posted by . Thanks for reading...

14 Responses to “Performance Optimizations”

  1. Nevin Flanagan

    I feel like I need to reinforce the “time-critical” point made near the beginning. Some of these enhancements, such as avoiding table.insert or ipairs, provide better performance, but come with their own costs, such as a reduction in readability or simplicity. They are, precisely, optimizations, code constructs which show advantages primarily when performance is of the utmost concern.

    In particular, if you have to perform several operations on each element in an array, make sure to set a local reference to it inside the loop, or the repeated array lookups will penalize performance more than the function call:

    for i=1, #t do
    local item = t[i]
    action(item)
    item:method()
    end

    Reply
      • Brent Sorrentino

        Hi Jack,
        “pairs” isn’t great, but it’s slightly better than “ipairs”… and with “pairs”, in many cases it’s just the easiest way to iterate over a dictionary table. You should be fine using it, just not inside time-critical (game-loop) code, inside a huge loop.

        Reply
  2. Jon Simantov

    Realistically, when will we start to see the performance impact of changing a single extra table access into a local?

    Will it be noticeably faster in a for loop of 100 iterations? 1,000? 10,000?

    Reply
  3. Chris Leyton

    Might have been a good idea to put a timeStamp on to show the value of these improvements.

    I cannot stress the importance of localising math.* functions however. I was building a virtual dStick class that uses a fair bit if trig, and was getting 5fps at times on device – localising the relevant math. functions brought this right back to a solid 30fps.

    Reply
  4. Thomas Vanden Abeele

    Regarding point 8:
    Could you please verify the following for me: are textures always rounded up to be square in proportion (width = height), or are width and height rounded up to the next power of 2 separately?

    In other words, is a 380 x 700 pixels texture rounded up to 512 x 1024 px or to 1024 x 1024 px. You seem to be saying the latter, but I always thought is was the former.

    Thanks,
    Thomas

    Reply
    • Brent

      Hi Thomas,
      It’s the first example that you state (each dimension rounds up to the next PoT independently). I clarified the section on that. Thanks for bringing it to my attention.

      Reply
  5. Christopher

    One thing to point out, in #3, I wouldn’t’ recommend using #a as in “critical” loops it doesn’t perform well. If possible, assign the total count to a local variable and use that inside the loop instead.

    Rather than this:
    local a = {}

    for i = 1,100 do
    a[#a+1] = i
    end

    Use this:
    local a = {}
    local aSize = #a

    for i = 1,100 do
    a[aSize+1] = i
    end

    Reply
    • open768

      @christopher aSize doesnt change in your example so you are effectively writing a[1]=i. I think you meant to autoincrement aSize.

      local a = {}
      local aSize = 0

      for i = 1,100 do
      aSize=aSize+1 –autoincrement
      a[aSize] = i
      end

      Reply
    • Aurélien Defossez

      Well, your last code only writes one value over and over again, in the same place.
      You set aSize to 0 and then set a[0] to i, from 1 to 100, but aSize is never updated.

      Modifying the loop line to this will do the trick. If we don’t need aSize to be consistent that is.
      for i = 1,100 do
      a[aSize+i] = i
      end

      If we need consistency :
      for i = 1,100 do
      aSize = aSize + 1
      a[aSize] = i
      end

      Reply
  6. Rais

    Very good post on the “Optimization topic”, please write about “Prevention Of Memory Leakage Tips” as well.

    Reply
  7. Rémi Papillié

    Hi,

    Is there a way to profile games in Corona in order to find the hotspots? Most of these optimizations are great, but will provide a great speed boost only if applied to actual bottlenecks…

    Thanks,
    Rémi

    Reply
  8. open768

    just ran a little test on the simulator

    local iCount=0
    local iRounds=5000000

    function test1(picounter)
    return picounter +1
    end

    function test2()
    iCount = icount +1
    end

    local cClass={}
    cClass.test1=test1
    cClass.test2=test2

    print (“running “..iRounds..” iterations of each test”)
    t1=system.getTimer()
    print( “start: “..t1 )
    iCount=0
    for i=1,iRounds do
    icount=test1(iCount)
    end

    t2=system.getTimer()-t1
    t1=t2
    print( “end – simple function:”..t2)
    iCount=0
    for i=1,iRounds do
    test2()
    end
    t2=system.getTimer()-t1
    t1=t2
    print( “end – global variable:”..t2)

    iCount=0
    for i=1,iRounds do
    icount=cClass.test1(iCount)
    end
    t2=system.getTimer()-t1
    t1=t2
    print( “end – table fn:”..t2)

    iCount=0
    for i=1,iRounds do
    cClass.test2()
    end
    t2=system.getTimer()-t1
    t1=t2
    print( “end – table fn – global:”..t2)

    running 5000000 iterations of each test
    start: 11
    end – simple function:2687
    end – global variable:2355
    end – table fn:4577
    end – table fn – global:4313

    which shows that on the simulator using global variables is faster – but at the expense of coupling, and using tables is about twice as slow – but again at the expense of not being able to encapsulate.

    so its a tradeoff – if its a really time critical beast – go dirty, otherwise maintainability is cheaper in the long run.

    Reply

Leave a Reply

  • (Will Not Be Published)