Applying clustering techniques on recipes and ingredients

In the last two weeks at the Metis Data Science Bootcamp we dived into NoSQL databases and natural language processing (week 7), unsupervised learning algorithms (week 7-8), dimensionality reduction, topic modeling and similarity (week 8), while working on an individual project: I choose to apply unsupervised learning techniques for clustering text data.

Firstly, I used the Pearson Kitchen Manager API for clustering almost 500 worldwide recipes basing on their ingredients. After applying several algorithms, I extracted the top keywords for some of the clusters and printed them in word clouds:

Secondly, I switched to the Yummly API for clustering ingredients on 17,000 Italian recipes from 1,000 blogs and kitchen websites. This time, the basic idea underneath ingredients clustering was to extract a bunch of not-too-rare and not-too-common ingredients, so I selected 71 ingredients with the help of a variance threshold. Then I built a relationship matrix between these ingredients using different scoring methods (jaccard similarity, joint probability score, etc). After trying some combinations of clustering algorithms and input matrices, I put one of the results in this d3 visualization:

As we should expect, items like (yeast, flour, milk) or (mozzarella, tomato sauce, pizza doughs) are grouped together, and the clusters themselves can help while compiling a shopping list.

At the end, I learned that tuning a clustering algorithm is more an art than a science: you need to explore not conventional approaches and combine different techniques basing on your intuition and the specific question you are trying to answer.

Applying Clustering Techniques on Recipes and Ingredients

Comments