Bringing confidence to our API’s uptime through Elastic Search

Bartek Ciszkowski
G Adventures Technology
9 min readMay 14, 2018

--

Four years ago, we at G Adventures had a vision to fix the plumbing between our systems. Creating a single source of entry for all of our data, via an API connected to many sources. As the vision became reality, we were introduced to a variety of issues around reliability, consistency, and scale. This is a part of that story.

This vision came in the form of creating an API. One that would be available, and designed, for both internal system, and external partner consumption. A single API for any client. We had a goal to place ourselves on the bleeding edge of APIs in the travel industry, and continue to aim for that. Ambitious, eh?

A primary challenge we knew we’d face was for the ability to scale our API ecosystem to handle hundreds, if not, thousands of requests per second. Each new application from a partner, or internally, would add significant load to our API, ensuring the fun would never end.

Today, we’ve successfully gained significant traction by our partners (and internal systems!), but with any success, comes … challenges *ominous music*

First, let’s zip through the events that got us to our problem.

Let’s step back and look at how one would approach building an API. You decide you want to expose some database of information. You’ll use your favourite language, and framework (we love Python & Django Rest Framework!), connect things up to that database of data, and Voilá! You have an API. Brilliant!

Now let’s add more data from another source. Let’s just connect to that second database and fetch it directly. Hmm, that data is coming from a database used heavily by our website…

Ah! Downtime has hit us. The website was seeing an influx of traffic, and the API resource it was backing was not available for some time. Let’s remember, we as an API product team, do not own this data. It’s managed by the web team (although we do work with them heavily!)

Didn’t you hear? We’re an API-first business now (Let’s ignore the buzzword for now) Let’s continue to make sure all our systems speak to the API, rather than building their own implementation to speak to a source system.

We’ve now added a connection to our Salesforce CRM platform, so we can offer Traveller data in our applications.

Captain. We’ve lost our requests.

Well, some degradation in service is bound to happen, right?

And as the story continues, more and more systems began sourcing data into our API. Now, on a regular basis, we’d see some form of downtime, even if for a few minutes. Sentry, our choice for error reporting, became quite noisy!

One principle we must follow is that this is not a problem for the underlying systems supplying data to the API. The website was not built to be an API. Its focus is on optimizing the delivery of content to a desktop or mobile user. Its core responsibilities significantly differ from the API’s.

As our scope of data increased, so did our API customer base. We kept growing, again, both internally, and with our partners. The pressure was intensifying!

Receiving thousands of requests per second, each request being routed to one of the dozens of source systems … Tracking any issue at this capacity became unmanageable for our small team.

Downtime is money. Your team members are affected by trying to identify issues (and in this case, over and over again in systems they had little knowledge in). And of course, if a customer can’t book a trip. That’s a huge loss!

We were not able to build confidence when it came to identifying each issue linked to downtime. Each team had their own core responsibilities, and it was simply not fair to take them away from that. It simply made sense to solve this issue at a single point, within the API.

But how?

Not much confidence in consistency, here.

Looking at the above image of our problem, what is your first thought on resolving the issue?

It’s a rite of passage for any developer that when you are thrown a software problem, you will at some point cache. But, this problem goes beyond caching.

We definitely don’t want every source system to implement an API object cache. Nor do we want to attempt to solve by throwing more hardware at each system. That breaks one of the first things you learn in Computer Science — Don’t Repeat Yourself!

It’s fairly clear to the even mildly observant eye what the common entry point is for all this data. The API Gateway. A single entry point to all source systems. If a request is made to the API, we can capture that response, return it to the client, and then cache it. Simple, no?

Yes! We can be quite naive here. Let’s add a short-lived cache that places any requested object into a Fastly cache 5 minutes. This feels comfortable. It’s too short for anyone to notice stale data — like a price, and it could alleviate some level of our failed request issues.

Solutions should be simple, clear, and accessible for the team. But … this just feels too naive. First, Fastly runs on hundreds of nodes, and even with shielding enabled, there’s no guarantee a newly-cached resource will always be a HIT.

Second, this still doesn’t solve a problem if a system goes down beyond 5 minutes. If we can’t even fetch the object from the source system, we won’t be able to cache it. Let’s think beyond this. Let’s define a new principle.

No single request to the API will hit a source system. Rather, it will always hit a persistent cache managed by API services.

A persistent, invalidated-when-necessary cache for an API serving over two million objects? Yeah, that sounds pretty darn cool.

We began to review our options for persistent storage. We love Postgres and use it heavily. However, the store that spoke to us was Elastic Search. It spoke to us through these key items: Distributed, built-in search, and a general, versioned document store.

So, we spun up a cluster of Elastic Search. Which, by the way, is ridiculously simple. We used Ansible to do so, and it’s essentially running these tasks:

  • Ensure Java is installed
  • Download Elastic distribution
  • Unarchive Elastic binary into some desired root
  • Write, or use default Elastic configuration
  • Run Elastic.
  • Repeat on each box, and adjust config in minor ways as per documentation. Voilá, a cluster!

Then, to test things out, we bulk loaded an entire resource set in Elastic. The first set we focused on was our places resource, which contains just under 9 million objects. This resource is essentially all the Places which can be referenced by various objects in our API (hotels, restaurants, activities, …). What we saw right away:

Loading the entire places dataset into Elastic, and leveraging reading from it, rather than the source system improved response times significantly.

Elastic Search natively supports a wide range of search functions on its documents. We were able to build a simple Python service which could translate a query string in the URL to an Elastic Search query.

Let’s talk about these search functions some more. First, here’s how our API looks today, with Elastic Search as part of the picture.

This Sieve app, named after the kitchen tool is the glue between our API and Elastic Search. It’s a Python 3 app, leveraging aiohttp to handle some serious load. To bind to Elastic, we simply transform query string parameters into Elastic Search queries, and query the Elastic Search API within the Sieve app, on demand.

This was good. We could see this become a single, consistent search layer not only for this places resource but well, everything. No reason for every source system to put in place their own search functionality. What a relief that’d be for those product owners!

There was one obvious problem before we could route API requests to this new functionality. You can guess what it is, right? Resources change all the time. So, let’s talk about cache invalidation.

Prior to this, we’d already implemented a webhook delivery system (For the record, we call it major tom). major tom acts like what you may think a webhook system would. When a source system modifies data in some way, it sends a simple ping to major tom, telling it to notify people of the change. What major tom receives looks something like this:

{
"resource": "tour_dossiers",
"language": "en",
"data": {
"id": "24270"
}
}

major tom then sends out an event to every subscriber of tour _dossiers webhooks, doing some extra work to add timestamps and an event type. The final result major tom produces looks like this:

[
{
"event_type": "tour_dossiers.updated",
"resource": "tour_dossiers",
"created": "2018-05-07T05:34:38Z",
"language": "en",
"data": {
"id": "24270",
"href": "https://rest.gadventures.com/departures/24270/"
}
}
]

On average, we receive 175,000 object changes from our systems each day. Data changes all. the. time! A data change can be as clear-cut as a customer adding their middle name onto their profile. Or, it can be a collection of many events, e.g. a customer booking a tour, which modifies that customer, a booking, services, departure availability, etc …

We were pretty confident about webhooks, delivering millions of events from major tom. And with that in mind, we’re seeing an interesting pipeline here.

Woah! Did we just implement cache invalidation? On that persistent Elastic store? Dang, that felt too easy — It wasn’t an obvious answer at the time! We just make it look easy now.

Yes, we’ve now gained the confidence to continue following our new core principle, again:

No single request to the API will hit a source system. Rather, it will always hit a persistent cache managed by API services.

Now that we had this real-time invalidation in place, we enabled our API to hit our Elastic Store (via Sieve) for every request to our places resource.

Since that day, we’ve been moving more of our 50+ resources to this pattern. Initial migrations came from demand, and then we moved resources that traditionally struggled to return within reliable time frames.

We’ve run into our problems. Although we had confidence in major tom to deliver events, we’ve identified that source systems would not always fire the appropriate — or any — webhooks when an object changed. Ultimately, these [minor] issues have helped us fix underlying problems, and improved our consistency, and trust with the data viewed in the G API.

We continue to expand the responsibilities of our Sieve app. Offering functionality like GraphQL through Elastic Search, but, we’ll save those details for another day.

Curious about more detail? Feel free to email me, and we can discuss. Thank you for reading.

This article is part of a series introducing the G API to the world.

Keep following us on Medium and Twitter for more snippets of the Tech world according to G!

Want to help G Adventures change people’s lives and travel the world? Check out all our jobs and apply today.

--

--

I manage Platform Systems at G Adventures. We provide the ecosystem that ensures success for developers, ultimately leading to a great customer experience.