Crawling is the new API - a legal and technical rough guide for the travel industry

For the past six years, we have been asking ourselves one fundamental question for our business.

"How do we get data from the transport providers we want to integrate?"

NB: This is an analysis by Veit Blumschein, CEO of FromAtoB.

After following the old-fashioned way by kindly asking each partner for access to its API, we faced the fact that not all money in the world would suffice to finance this approach.

Therefore, we had to make a decision - either we wait for the market to open up for us or we open up the market ourselves, as our own crawling framework enables us to use the best interface every B2C company is offering: their own website.

We realized that we actually didn’t need any help from our partners for the integration: no maintained API, no complex security measures, effortful documentation or expensive technical support the partner had to provide us with.

There are three perspectives I want to explain in order to prove my statement: technical, market specific and legal, followed by an outlook and the key learnings from six years on the quest for gathering high quality travel data.

1. The technical perspective

A crawler is best described as a program that simulates the user’s behavior on a website, following all the steps a user does with his browser such as entering search parameters (e.g. destination, date, etc.), requesting a result by clicking on the search button and then scanning through them.

That is both the simple and genius system of crawlers: everything a website shows to its users and customers, the crawler can also read extract and store.

The extent and quality of code libraries and open source frameworks exploded over the past decade.

This especially concerns technologies that cover fundamental principle of the open source movement: free and easy accessible data – which is exactly what crawlers do.

This development affects all elements of a crawler.

There are analysis tools that allow us to request a search result without even loading the entire page, the downloader can decide while reading the data on a website if it is worth storing, advanced parsers filter the stored content almost in real-time and turn unreadable into structured re-usable data such as XML or JSON.

Since most of the basic libraries are ‘open source’, it gives our developers the opportunity to adapt it to our special needs.

For example, we taught our crawlers to adapt themselves to certain changes on the websites (for instance when the provider changes his layout) and to notify us if it failed.

In general, the change of a website is a problem for a crawler - in fact, it is very similar to the change of an official API - which also happens from time to time.

In preparation of this article, I reviewed the changes we had to make for APIs over the past years and compared it to the number of adaptions to the crawlers we had to come up with due to structural changes of the website.

In total, the ratio is only 1: 2, decreasing over the years due to the tolerance to change e.g. in design we taught our crawlers (see above).

Another interesting fact is, if we compare the ratio of down times, the ratio is almost the other way around, which means that - in our sample - websites are more available than regular APIs.

Of course, there are existing technical protection mechanisms for websites to avoid crawlers.

This is a sub topic that has to be addressed, although it is in our experience, (probably caused by our business model) no obstacle that we had to overcome with further technical counter measures.

In most cases, we have made the experience that many partners love to share their data with us, as we are a distribution channel for them.

Hence, it is standard for us to crawl websites with the partner’s prior consent. However there are server-based intrusion detection and protection mechanisms.

The most common is the blocking of the IP addresses of the requesting server, displaying results in JavaScript, cloaking parts of the content or requesting Captchas from the users - but all these hurdles can be overcome with state of the art technology, cloud based servers and free-of-charge frameworks.

2. Market-specific concerns

This brings us directly to travel market specific problems and concerns. As mentioned above, with a good sales pitch, using your partner’s website as an API is a win-win strategy for both parties.

This has several and very simple reasons: Many players – even big ones – do not possess an efficient API, apart from some GDS based interfaces that are extremely expensive and limited.

Although fromAtoB is depending on the quality of data from its partners, I understand why some providers choose not to set up their own API.

You suddenly have another interface, apart from your own website, that you have to maintain and protect via complex security measures.

Documentation for third parties needs to be crafted and these third parties need to be supported with technical manpower.

These are many investments you have to pre-finance before you know if there is a business case. Our answer: if you do not have an API, we do not need one.

We use the best maintained interface any company with e-commerce ambitions has: the website (or in some cases the mobile app).

In the past few years, we even switched back several times from using an existing API to crawling the websites, exactly for the reasons mentioned above.

In our cases: the API documentation was in Spanish, the partner forgot to tell us twice that they had changed the API keys: the partner’s backend changed and the API fell out (but the website did not).

For all these reasons, we are now also offering providers we are crawling the opportunity to use our API so that they can redistribute their data to third parties (affiliates for instance).

"We don’t want you to crawl our website, as we fear that you will kill our servers with too many requests."

That is a concern we are facing with many partners. Yes, and that was once true back in the 90s, when distributed crawlers sometimes had the same harmful effect as DDOS attacks.

But with modern “minimally invasive” crawlers you can specifically request the content you are looking for, without loading images, style sheets, videos, ads, etc.

Self-learning caching logics reduce this issue even more, especially for crawlers with high data demand, as similar requests can be used from the own previously stored database, without requesting the partner’s website again.

This is also the solution for another travel specific concern: the so-called “scan-to-book ratio” (also: look to book). The pure horror of every meta-search engine and website administrator - at least back in the days.

If we probe the causes of this problem, it is not as one would expect the costs for server capacities or DDOS concerns (as mentioned above).

No. It is, or at least was, in most cases caused by GDSs and their business model, in which a travel company had to pay for requesting and using their own data.

But since many companies are nowadays hosting their own data or having pure performance based data third party services, this issue is slowly fading.

3. Legal implications

Several European countries have long been waiting for a judgment establishing a principle from a national supreme court on how to proceed with crawlers, and finally clarifying long unanswered questions such as if the search results on websites are copyrighted and the extent of a virtual householder's rights, etc.

Finally, the Bundesgerichtshof (German supreme court) settled these questions in a trial between Ryanair and Beins Travel Group (Cheaptickets.de), in which it states that booking and search portals are not underlying the competition law when crawling the data from a website.

The court goes even a step further by allowing booking portals to charge the customers for a commission.

The court follows the argument of the portal, which justifies the fee with a service for searching and processing the data.

However, I would like to emphasize once again what I have already mentioned above, the prior consent of the website is in most cases something that is easy to get.

4. Outlook

Gathering data via crawlers is actually just the first step, covering the entire booking on a larger scale via crawling and bot comes next.

In the flight industry, there are already several online travel agencies) that are not only crawling and extracting flight data from an airline’s website, but they are also making an entire booking process via a crawler (resp. bot) by simulating the payment process on a website.

For several reasons, this is an easier process in the flight industry (e.g. via payment with a virtual credit card) as for example in the railway or even public transportation sector.

Still, in general the approach and solution stays the same and definitely feasible within the next years.

Key learnings and implications for the travel industry

Now, six years later, there are three very simple but essential learnings:

Crawlers make data fully accessible via websites: APIs are from the 90‘s – crawlers are today’s interfaces!

Accordingly, GDS‘ monopoly as a single source of travel data will fade

Even booking and payment will be handled via intelligent bots

What does this mean for the travel industry?

Does somebody, who occasionally needs travel data, have to build a crawler for every single website? No, there are specialized providers, like e.g. Travelfusion or ourselves offering that data to third parties through a web service

Meta search engines using advanced crawler technology for specific sectors such as the aviation and hospitality sector, shuttles or bus service providers will further advance, but are only in the first phase in the evolution of travel technology

Interlinking these sectors via easily accessible data and overcoming the disruption to book all-in-one through bots will trigger the next evolutionary phase: intermodal journey planners.

NB: This is an analysis by Veit Blumschein, CEO of FromAtoB.

NB2:Spider laptop via Shutterstock.

Ryanair