The Yelp academic dataset consist of a sample of the data
from the Phoenix, AZ area. The data includes information about 11,537 businesses,
8,282 checkins, 43,873 users and 229,907 reviews. The goal of this
post is to use this data and study the day of week that people
post reviews. Are some days preferred than the others or all
days are equally likely? If there is a preferred day when most
reviews are posted, how statistically significant that finding
is? Does the preferred day changes with the categories and is
this statistically significant? In the second part of the post
I ask the same questions with the checkins i.e., is there a
preferred day for checkins, its statistical significance, and
correlation with categories.
First, I create a mysql database creating four tables
containing the businesses, reviews, users and checkins data. I use Python to read
in the json formatted data and place them in the sql database. The
analysis part requires counting reviews posted for each day of
week and the standard deviation associated with the average
counts. The count is normalized with respect to the total number
of reviews (or checkins for the second part) and multiplied by
100 to express the fraction in percent. The standard deviation
is calculated using the jackknife sampling method.
The business data has more than 500 unique categories, with each business tagged by more than one category. Dividing the sample according to each category will leave a small sample size for each category. Also the fact that each business can be tagged by multiple categories implies most categories carries degenerate informations. To circumvent these issues, top 14 categories are selected out of 500+ such that each business is represented by the category that occur more frequently than the others. E.g., if a business is represented by both 'salons' and 'local business', it will be represented by 'local business' only, since 'local business' is a more frequent category than 'salon' in the dataset.
The normalized counts of reviews posted each day is shown in
the figure below (two panels), both for the full data and for each category
separately. The full dataset shows that people post most reviews on
Monday, followed by Tuesday and Wednesday. Less
reviews are posted on Friday and Saturday. The error on the
estimates are small owing to the large size of the review dataset such that
Monday is more popular than Wednesday by at least 3
standard deviation. The dashed line shows the case if all days in
a week would equal share of reviews posted. A measure of the
significance of a day is to ask how much more significant that
day compared to the "equally likely"
line. E.g., Monday, the most popular day to post reviews, is
more than 4 standard deviation higher than the equally likely
case.
The figure also shows the definite trend with 14
different categories. E.g., 'Automotive', 'Food', 'Home Services',
'Local Services' and 'Pets' categories show that Wednesday (not
Monday) is
the most popular day to post reviews. Compared to the "equally
likely" case, this is about 2-3 standard deviation higher.
Most other categories have Monday as the popular day to post
reviews e.g., 'Active Life', 'Arts and Entertainment', 'Restaurants' and
'Nightlife'.
With the checkins dataset, one can ask the same question, namely what is the most popular day for checkins and how it varies with different categories. The size of the checkins data is at least factor 30 smaller than the review data and hence most of the results from the checkins data is not statistically significant. However, the figure below (two panels) shows that Saturday and Friday are the two most popular day of checkins both for the full dataset and also for most of the 14 categories. Notable exceptions are 'Health and Medical', 'Hotels and Travel' and 'Local Services'. As mentioned earlier, none of these claims are statistically significant because the dataset for checkins are smaller in size and more checkin data is needed for further study.
The Yelp dataset suggest that Monday is the most popular day to post review. This finding is statistically significant both with respect to the "equally likely" case and also compared to the next popular day- Wednesday. There is also some interesting variation to be seen with category, most of which are statistically significant. On the other hand, people typically like to checkin on Saturday. Although, because of the small size of checkin data, no trend in the checkin data is statistically significant.