Data Analysis and Visualization for Yelp



Motivation

As we all know, America is a country full of diversive culture. People from all over the world come here to work,study and live. Among different elements, food is one of the most interesting topics we want to discover since it is so deeply involved in our daily life. On a large scale, we are interested in the spatial distribution of number of restaurants in different states. Meanwhile, understanding restaurant’s price, rating, number of reviews and their kinds’ distribution will give us a rough idea about our own choice of food. As we go further in the discovery of restaurants, we expect we can find out connection between them by their kinds.


All those questions drive us to Yelp, a popular website displaying restaurants information. The source of our data is Yelp’s API and its website and both our analysis and visualization are displayed in the following sections.

Number of Restaurants Distribution in USA


The distribution of number of restaurants on map reveals huge spatial difference. Coastal areas tend to have more restaurants, while inland states contain less of them. For our following analysis, we choose three cities to represent west, middle and east America. They are San Francisco,Albuquerque and Detroit.


Data Collection and Processing


Data Type and Description:


Variable Description
Area String. The area the restaurant belongs to. Example: Detroit-Riverside
Claimed Status Bool. Whether the restaurant is claimed by some owner or not
Health Inspect Float. Health Grades
Id String. A unique id for each restaurant
Latitude Float. Geological coordinate
Longitude Float. Geological coordinate
Price Categorical. $: Inexpensive; $$: Moderate; $$$: pricey; $$$$: Higher end
Rating Float. The rating for the restaurant, range 0-5.
Related Business String. Another restaurant id, Which is recommended by Yelp, other customer will also view.
Review Float. Number of reviews the restaurant has, indicate the popularity of the restaurant.
Tag String. Label for the restaurant, the label can be food kind such as dessert etc. While it can also group the restaurant by area, such as “Chinese”. Each restaurant can have multiple labels.
Title String. The name of the restaurant.
Url String. The url link of the restaurant.

Strategy:

  1. Web scraping

    At first, we try to use yelp api. Unfortunately, Yelp api only returns 20 maximum restaurant records each time we call a search, which is far from enough for analysis. To get enough amount of data, we then turn to scrap the website page by page as an alternative choice.

    The total number of pages we scrap is over 20000. During the search process, we found out that using “City” as search term is not a wise choice. Since the upper limit of the number of records displayed for any search term is 1000. It largely limited the amount of data we can get for analysis. Our strategy to solve this problem is to split one city into multiple sub-areas, for instance detroit is splitted into Downtown Detroit, Detroit Riverside etc. In small areas, the number of restaurants will not exceed the upper bound for search records. Then we can get almost all restaurants in a city by simply adding up all the records we get from scraping each small area. There is also downside of this method, we can not using such method to scrap large cities like NYC, since there is no guarantee that the number of records in even small area will be less than 1000.

  2. Data Processing

    After getting the data from webpage, we first reformat some of the results. For instance, we treat reviews as numeric value and price as factors. Also we ignore the records with missing value. Finally we stack all those variables we got to be a dataframe.

    For the purpose of convenience, we group our raw tag set into 9 new categories. The original tags and our grouping methods are listed below.


Tags:

Categories Raw Tags
Chinese Chinese, Cantonese, Szechuan, Shanghainese, DimSum, etc
Alcohol WhiskeyBars, ChampagneBars, CocktailBars, Beer, etc
JanpaneseKorean Japanese, Korean, SushiBars, Izakaya, Teppanyaki, etc
American American(New), American(Traditional), FastFood, ChickenWings, etc
South American(Mexican) Mexican, Tacos, Tex-Mex, LatinAmerican, Salvadoran, etc
Southeast Asian Thai, Laotian, Vietnamese, Malaysian, Singaporean, etc
Indian Indian, Bangladeshi, Himalayan/Nepalese, etc
Europe Italian, French, Greek, Belgian, etc
Desert IceCream&FrozenYogurt, JuiceBars&Smoothies, Cupcakes, etc

Data Exploration

  1. Descriptive Statistics
  2. Network Analysis

Future Improvement