Introduction to Analyzing NYT Restaurant Reviews

Posted on Wed 03 October 2018 in Data Science

As a person who lives in New York City and is interested in food, I generally enjoy reading the restaurant reviews published by the New York Times. There is a new review of a restaurant in NYC roughly every week that includes a written review of the restaurant and, most importantly for our purposes, a summary rating of the restaurant on a scale of 1 to 4 stars. Unlike other star rating systems, the lowest rating of one star doesn't mean the restaurant is bad. Most restaurants that merit a review in the New York Times are, happily, generally good spots, so even a restaurant with just one star is probably a fine place to eat. Some restaurants do receive no stars. Instead, they are rated as "Fair," "Satisfactory," or similar, but these lower grades are rare and usually reserved for not-very-good restaurants that are noteworthy for some other reason beyond the food. Perhaps the best example of this is the review of a restaurant in Times Square (since closed) by the TV chef Guy Fieri. When it received a rating of "Poor", the review went viral.

The star rating system for reviews is the source of some controversy among readers of the column, the main complaint being that the star ratings are essentially useless. This is perhaps especially true for the reviews written by the current dining critic, Pete Wells, who is accused of simply giving two stars to nearly every restaurant he reviews. Wells himself admitted that he found it easier to write two-star reviews in a profile in the New Yorker.

I've sometimes felt this way myself, and decided to do a little investigation. In this project, I'll analyze restaurant reviews written by Wells since 2013 to try to understand if he gives the two-star rating too often. More specifically, I'll try to build a model that can read a review written by Wells and predict the star rating he will give the restaurant.

This post will be the first of a short series discussing this project. The full code for this project will available at my GitHub Page, which will be updated with each new post in the series. For this first post, I'll briefly discuss the data collection portion of this project. I'll explain how I used Python to scrape the New York Times website to collect all the raw HTML of all the reviews written by Wells. This post discusses the script review_fetcher.py. The goal of this post is not to explain fully every line of code in that file, but rather to explain some of the data collection choices made in this part of the process. As such only snippets of the code will be discussed in this post, so head over to GitHub to see the full details.

Using the NYTimes API

The New York Times provides an API that allows one to search the archives of the newspaper. The full text of articles is not available through this API, but we can retrieve the URL of any individual article. (Other information, including the first paragraph of the article, keywords, etc. are also available through this API, but won't be needed here.) The function get_urls in review_fetcher.py does the work of querying the API, ultimately returning a list of URLs of restaurant reviews that we access in the next section.

The documentation for the API explains how to construct queries as a URL that, upon access, returns the queried data in JSON format. Using this documentation, we can construct the following query to find the articles that we're interested in:

query_url = 'http://api.nytimes.com/svc/search/v2/articlesearch.json' + \
                '?api_key=' + config.NYT_API_KEY + '&begin_date=20130101' + '&end_date = 20181003'+ '&fl=web_url' + \
                '&fq=byline:("Pete Wells")ANDtype_of_material:("Review")ANDnews_desk:("Dining","Food")'

In this particular project, we're searching for reviews written by Pete Wells after January 1, 2013 that fall under the Dining and Food sections of the paper. Specifying fl=web_url limits the data returned to the URL for each article. If you use this code yourself, you'll need to register for your own API key and use that value in place of NYT_API_KEY.

Using this URL and the requests package, we can easily query the API to obtain a list of URLs called returned_url_list. Unfortunately, there are some errors and miscategorizations in the articles returned by the API, in that it returns some URLs for articles that are not restaurant reviews. Luckily, many of these can be weeded out simply by inspecting the URLs:

bad_words = ["(blog)", "(interactive)", "(wine-school)", "(insider)", "(hungry-city)", "(best)",
             "(/books/)", "(slideshow)", "(obituaries)", "(recipes)", "(reader-center)", "(technology)"]
final_url_list = []
for url in returned_url_list:
    if not re.search("|".join(bad_words), url):
        final_url_list.append(url)

This uses a simple search to filter out some of articles that aren't true restaurant reviews, and saves what's left to final_url_list. The exceptions above were originally found in the next part of the project, which parses the HTML for each article to extract the review text and other features, as URLs associated to pages that couldn't be parsed. Many of the exceptions above are self-explanatory, or are understandable to readers of the New York Times. For example, cookbook reviews written by Wells are excluded by removing URLs containing "books", while articles from the "Hungry City" column are restaurant reviews of (generally cheaper) restaurants that aren't rated using the star system, and therefore aren't relevant to our analysis.

Unfortunately this still leaves some articles in the final_url_list that aren't true restaurant reviews with star ratings. These articles can only be found, however, by inspecting the HTML of the page itself. We consider this problem in the next step, as we download the the raw HTML for each review for later processing.

Downloading the Reviews

The next step in the process is to actually download the HTML for each article in the final_url_list, saving that data for future processing. This work is done by the get_reviews function defined in review_fetcher.py. This function is relatively straightforward - it simply downloads the HTML for each URL in final_url_list and saves it to disk. However, there are two special aspects of this function that are worth discussing.

The first concerns the additional filtering mentioned above - there are still some non-restaurant review articles that need to be removed from the list, but that cannot be excluded based on their URL. To deal with these cases, we define the following function:

def is_misclassified(bs):
    if len(bs.find_all('meta', {'content': re.compile('Critic.*Notebook')})) > 0 :
        return(True)
    if re.search('<p.*?>\s*[Cc]ritic.*[Nn]otebook\s*</p>', str(bs)):
        return(True)
    if len(bs.find_all('meta', {'content': 'hungry-city'})) > 0:
        return(True)
    return(False)

This function takes in the HTML from a website, parsed as a BeautifulSoup object using the BeautifulSoup library for web scraping. It then searches the page for two indications that the article is not a restaurant review with a star rating, by trying to determine if it is a "Critic's Notebook" column or a "Hungry City" column. This is most easily done by using the BeautifulSoup library to search for HTML metadata tags that describe the type of content.

Unfortunately, the New York Times has not always consistently labeled all content with accurate metadata, and so this search does not always find all the articles that need to be excluded. In this case, we instead have to search the raw HTML directly to see if the page contains the words "Critic's Notebook". This somewhat hacky solution of having to parse pages using multiple methods in order to deal with inconsistencies in the formatting and data tagging will unfortunately be very common in the next part of the project, when we extract the review data from the raw HTML for further analysis.

The other special aspect of this function is some simple error handling to deal with problems that may arise while attempting to download the different pages. The main issue I encountered came from making too many requests to the New York Times for articles in a short time span. When this occurred, making a request for a page returned a standard error page from the New York Times website, instead of the article itself. Since this request still returns the HTML for some web page, just not containing the content we want, we need a simple function to check that the content returned from any request is actually an article and not an error:

def find_server_error(bs):
    result = bs.find_all('meta', {'content': '500 - Server Error'})
    return(len(result) > 0)

Luckily, this is easily accomplished by looking for HTML metadata tags indicating an error. Any URL that returns such an error page is put back on a list of URLs to try again later. After a first pass through the URL list, any that have not been successfully accessed are tried again. Any URLs that can't be successfully accessed after multiple tries are printed to the terminal to be accessed by hand, although in my experience all URLs could be successfully accessed in less than 3 passes.

Next Time: Processing the Reviews

If you'd like to follow along with the next parts of this project, you can download this script from GitHub and run it yourself. This should create a directory called reviews in the same folder from which you ran the script, which will include the HTML files associated to 278 reviews. There are some caveats to this, however:

You will need to sign up for an API key with the New York Times. This API key is stored in a separate config.py file which is not included in the repository for security reasons.
Since this script only needs to run successfully once, I haven't put in very much time to make it more robust. It handles the errors the I encountered while using it, but you might have other issues that I didn't anticipate. If you have ideas to improve this script to handle other possible issues, you could make a pull request on GitHub or contact me using this form.
Relatedly, this script worked for me on the day this was posted, but the API or the articles themselves may change after that. When you try to fetch the data yourself, it might all be different.

After running the review_fetcher script, we have the raw data for our project, the HTML for all restaurant reviews (and a few other things) written by Wells over the last 5 years. In the next part of the project, we'll see how to parse these files to extract the relevant information (review text, number of stars, restaurant price, etc.) from these reviews into a tidy data set for use in the actual analysis.