Around the World in 80 Years

Problem statement

My wife is an avid traveler. To help her find the next adventure, I use some data science magic to build an application that locates top tourist attractions based on the desired experience. For example, if she wants "forest" to be the theme of her next trip, she types in the word "forest" into my application, it will return some top locations that are famous for forests. The search term can be any word or combination of words, such as "intellectual and artistic", or "exotic adventure", or "family vacation cruise".

Strategy

First, I find a collection of location as the candidates. Then I apply NLP (Natural Language Processing) technique to the description of each location and calculate its vector representation. Last I compare these vectors to the vector representation of the input (such as "forest") using cosine similarity. The higher the similarity, the more similar the location is to the desired experience.

Data

I obtained a list of top tourist locations around the world. For each entry in the list, I used Wikipedia API to download its text description and Google API for images. I stored the name of the location and its description in a pandas dataframe shown below.

There are 1001 rows("location") and the description("details") for each location is 200 words on average. Due to the limited size of the corpus(the collection of all texts of interest), I decided to use a pre-trained model instead of training on my own.

Algorithm

I used Google's pre-trained word2vec model for vectorization. It was trained on Google News articles of roughly 100 billion words. The model contains about 3 million words and phrases such as "tree" and "Old_Testament". Each word is represented by a 300-dimensional vector. For example, running model['tree'] produces

array([ 0.484375 , 0.12255859, -0.15722656, 0.03466797,... ..., -0.04296875, 0.01916504], dtype=float32)

As pointed out above, I have two groups of text data to vectorize: (i) the "details" of each location (ii) the desired experience. For (i), the input to the model is always a collection of words so I average the vectors for all words returned by the model. (Note that some words do not have a vector representation.) For (ii), the input is sometimes a single word such as "forest", sometimes a collection of words such as "family vacation cruise". From (i) I obtained an averaged vector A and (ii) B, I calculated the cosine similarity between the two.

Example

For example, the cosine similarity between "classy" and "Borghese Gallery Italy" is 0.2822; on the end of the spectrum, the similarity between "classy" and "Lapland Ethnic Region Finland " is -0.0182. The results show that the former location is "classier" than the latter.

Web application

I created a web application based on the algorithms above. The user inputs desired experience and the app returns top locations based on the cosine similarity between the vectors. The following is a screenshot of a sample search.

Summary

The project made substantial use of the following skills:

APIs, Natural Language Processing(NLP), Flask

The following can be followed up in the future:

More detailed description of a location including the culture, history and economy
Better visuals such as the use of D3.js
A rating system that allows the user to provide feedback on the results