Yelp Dataset

The Yelp academic dataset consist of a sample of the data from the Phoenix, AZ area. The data includes information about 11,537 businesses, 8,282 checkins, 43,873 users and 229,907 reviews. The goal of this post is to use this data and study the day of week that people post reviews. Are some days preferred than the others or all days are equally likely? If there is a preferred day when most reviews are posted, how statistically significant that finding is? Does the preferred day changes with the categories and is this statistically significant? In the second part of the post I ask the same questions with the checkins i.e., is there a preferred day for checkins, its statistical significance, and correlation with categories.

Methodology

First, I create a mysql database creating four tables containing the businesses, reviews, users and checkins data. I use Python to read in the json formatted data and place them in the sql database. The analysis part requires counting reviews posted for each day of week and the standard deviation associated with the average counts. The count is normalized with respect to the total number of reviews (or checkins for the second part) and multiplied by 100 to express the fraction in percent. The standard deviation is calculated using the jackknife sampling method.

Jackknife Sampling Error Estimates

The Jackknife method is a way to estimate the standard deviation error of the mean of a sample. First, the review data is divided into M=8 blocks. For each block, we exclude that block from the total data, then estimate the mean for the rest of the data. Repeating the steps for all the blocks, the standard deviation error can be calculated as

\delta \rho= \sqrt{\frac{M-1}{M}{\sum\limits_{m=1}^M} (\bar{\rho}_m-\bar{\rho})^2}
\rho is the normalized average count of reviews ( or checkins).

Selecting the Categories

The business data has more than 500 unique categories, with each business tagged by more than one category. Dividing the sample according to each category will leave a small sample size for each category. Also the fact that each business can be tagged by multiple categories implies most categories carries degenerate informations. To circumvent these issues, top 14 categories are selected out of 500+ such that each business is represented by the category that occur more frequently than the others. E.g., if a business is represented by both 'salons' and 'local business', it will be represented by 'local business' only, since 'local business' is a more frequent category than 'salon' in the dataset.

Results: Reviews

The normalized counts of reviews posted each day is shown in the figure below (two panels), both for the full data and for each category separately. The full dataset shows that people post most reviews on Monday, followed by Tuesday and Wednesday. Less reviews are posted on Friday and Saturday. The error on the estimates are small owing to the large size of the review dataset such that Monday is more popular than Wednesday by at least 3 standard deviation. The dashed line shows the case if all days in a week would equal share of reviews posted. A measure of the significance of a day is to ask how much more significant that day compared to the "equally likely" line. E.g., Monday, the most popular day to post reviews, is more than 4 standard deviation higher than the equally likely case.
The figure also shows the definite trend with 14 different categories. E.g., 'Automotive', 'Food', 'Home Services', 'Local Services' and 'Pets' categories show that Wednesday (not Monday) is the most popular day to post reviews. Compared to the "equally likely" case, this is about 2-3 standard deviation higher. Most other categories have Monday as the popular day to post reviews e.g., 'Active Life', 'Arts and Entertainment', 'Restaurants' and 'Nightlife'.

Results: Checkins

With the checkins dataset, one can ask the same question, namely what is the most popular day for checkins and how it varies with different categories. The size of the checkins data is at least factor 30 smaller than the review data and hence most of the results from the checkins data is not statistically significant. However, the figure below (two panels) shows that Saturday and Friday are the two most popular day of checkins both for the full dataset and also for most of the 14 categories. Notable exceptions are 'Health and Medical', 'Hotels and Travel' and 'Local Services'. As mentioned earlier, none of these claims are statistically significant because the dataset for checkins are smaller in size and more checkin data is needed for further study.

Summary

The Yelp dataset suggest that Monday is the most popular day to post review. This finding is statistically significant both with respect to the "equally likely" case and also compared to the next popular day- Wednesday. There is also some interesting variation to be seen with category, most of which are statistically significant. On the other hand, people typically like to checkin on Saturday. Although, because of the small size of checkin data, no trend in the checkin data is statistically significant.