How we scaled Ticketbis

A startup should be designed to grow fast, so its IT architecture too. This year has been very challenging here at Ticketbis, the engineering team has been focused on scaling-out the platform in order to support our rapid business expansion keeping our platform flexible and scalable.

Today, our system offers more than half million tickets spread around more than 40 countries in 34 different languages.

This post does not intend to show how we evolved from one single server to the current architecture, so let's see what we finally got:

Scaled architecture

image by Xabi Larrakoetxea

It turns out 90% of our requests are just “reads” before users pick up the ticket they want to buy, so there is no need to hit the database up to this point. Caching was knocking at our door, Redis and Memcached were our candidates, and being both fast memory-based key-value stores, Redis fitted better our needs because of its simplicity and advanced features such as data structures (hashes, lists and so on).

We use a Redis master-slave approach, so every “read-only” server has a slave instance accessed by a lightweight Grails application. Any time a transactional operation is required, users are redirected to the “secure” application (Grails + MySql).

¿How caching works?

A set of Python processes are running as daemons watching any change on user-sensitive data in order to populate and denormalize it to the Redis master. This processes also take care of cache invalidation to keep fresh data and remove past events.

Some of the event-related data is automatically expired by setting their TTL to the event date.

Every changed entity is also queued in a Redis list so other processes can consume them (i.e. process to feed the ElasticSearch cluster).

How we manage our queues

Redis queues

  • (1) A LUA script recovers all failed entities from the processing queue and queues them back to the original one.
def recover_unprocessed_items(self, incr_key, full_key):
    lua_script = """
        local element = "1"
        local cont = 0
        while element and cont < 500 do
            element ='RPOPLPUSH',KEYS[2], KEYS[1])
            cont = cont + 1
        return (cont - 1)

    recover_unprocessed = self.redis_con.register_script(lua_script)
    return recover_unprocessed(keys=[incr_key, full_key + '-processing'], args=[])
  • (2) Any changed entity is pushed to the queue (LPUSH).
  • (3) Starts consumer process (RPOPLPUSH from queue to processing queue allows us to keep the queue reliable, if the process fails, the script will recover the unprocessed entities as mentioned in step 1).
  • (4) Items are removed from the processing queue.

This architecture allows us to rapidly scale horizontally by adding more “read-only” servers to the balancer, or more servers to the ElasticSearch cluster (supports 34 languages right now). Adding support to a new language or site is a one-click action from our backoffice.

In upcoming posts, we will go into details about ElasticSearch indexation and other parts of the process. If anyone wants more info, raise your hand, and we will be pleased to go deeper ;)