Business Exploration and Interactive Visualization for Yelp



Motivation

Yelp is a popular website that users can search certain business and get the users reviews and basic information associated with that business. It is helpful for making choices regarding eating, shopping for individual life. Since Yelp just lists information for particular business, it is very hard for users to go through lists of all business and have a high level view. In that way, it can’t help people who want to grow their business, improve their services. So we think of collecting all these information together from Yelp and providing advanced analytics.


In our analysis, we choose restaurant as business entity. We want to help these business owners to get an overview of how to choose site to start or expand a new business, how to set the right price and how to improve their business though some attributes such as wifi and parking.

Number of Restaurants Distribution in USA


The distribution of number of restaurants on map reveals huge spatial difference. Coastal areas tend to have more restaurants, while inland states contain less of them. For our following analysis, we choose three cities San Francisco,Albuquerque and Detroit in following analysis demo.


The source of our data is Yelp’s API and its website and both our analysis and visualization are displayed in the following sections.

Data Collection and Processing


Data Type and Description:


Variable Description
Area String. The area the restaurant belongs to. Example: Detroit-Riverside
Claimed Status Bool. Whether the restaurant is claimed by some owner or not
Health Inspect Float. Health Grades
Id String. A unique id for each restaurant
Latitude Float. Geological coordinate
Longitude Float. Geological coordinate
Price Categorical. $: Inexpensive; $$: Moderate; $$$: pricey; $$$$: Higher end
Rating Float. The rating for the restaurant, range 0-5.
Related Business String. Another restaurant id, Which is recommended by Yelp, other customer will also view.
Review Float. Number of reviews the restaurant has, indicate the popularity of the restaurant.
Tag String. Label for the restaurant, the label can be food kind such as dessert etc. While it can also group the restaurant by area, such as “Chinese”. Each restaurant can have multiple labels.
Title String. The name of the restaurant.
Url String. The url link of the restaurant.

Strategy:

  1. Web scraping

    At first, we try to use yelp api. Unfortunately, Yelp api only returns 20 maximum restaurant records each time we call a search, which is far from enough for analysis. To get enough amount of data, we then turn to scrap the website page by page as an alternative choice.

    The total number of pages we scrap is over 20000. During the search process, we found out that using “City” as search term is not a wise choice. Since the upper limit of the number of records displayed for any search term is 1000. It largely limited the amount of data we can get for analysis. Our strategy to solve this problem is to split one city into multiple sub-areas, for instance detroit is splitted into Downtown Detroit, Detroit Riverside etc. In small areas, the number of restaurants will not exceed the upper bound for search records. Then we can get almost all restaurants in a city by simply adding up all the records we get from scraping each small area. There is also downside of this method, we can not using such method to scrap large cities like NYC, since there is no guarantee that the number of records in even small area will be less than 1000.

  2. Data Processing

    After getting the data from webpage, we first reformat some of the results. For instance, we treat reviews as numeric value and price as factors. Also we ignore the records with missing value. Finally we stack all those variables we got to be a dataframe.

    For the purpose of convenience, we group our raw tag set into 9 new categories. The original tags and our grouping methods are listed below.


Tags:

Categories Raw Tags
Chinese Chinese, Cantonese, Szechuan, Shanghainese, DimSum, etc
Alcohol WhiskeyBars, ChampagneBars, CocktailBars, Beer, etc
JanpaneseKorean Japanese, Korean, SushiBars, Izakaya, Teppanyaki, etc
American American(New), American(Traditional), FastFood, ChickenWings, etc
South American(Mexican) Mexican, Tacos, Tex-Mex, LatinAmerican, Salvadoran, etc
Southeast Asian Thai, Laotian, Vietnamese, Malaysian, Singaporean, etc
Indian Indian, Bangladeshi, Himalayan/Nepalese, etc
Europe Italian, French, Greek, Belgian, etc
Desert IceCream&FrozenYogurt, JuiceBars&Smoothies, Cupcakes, etc

Data Exploration

  1. Descriptive Statistics
  2. Network Analysis

Future Improvement