Data Cleaning For NYT Restaurant Reviews

Posted on Wed 10 October 2018 in Data Science

This is the second post in my ongoing series analyzing the restaurant reviews in the New York Times. In the previous post, we described how to use an API provided by the New York Times to download copies of that paper's recent restaurant reviews. In this post, we'll discuss how to parse these reviews into a tidy data set that we will analyze later in order to predict star ratings from restaurant reviews.

The full code for this project is available on my GitHub page. This post discusses the clean_data.py script. In order to use this script, you'll have to have successfully run the review_fetcher.py script discussed in the previous post. In particular, you should have a folder named reviews in the directory from which you run the clean_data.py script that contains the HTML files of the restaurant reviews.

Extracting Review Data

Each file in the reviews directory is the raw HTML of a New York Times restaurant review. A simple inspection of these files shows that there's a lot of extraneous information in each one - lots of metadata about the article, various javascript elements, advertisements, links to other articles, etc. In order to extract only the information we actually need for the analysis, we use the BeautifulSoup library to parse and search these HTML files, along with the standard Python library for dealing with regular expressions.

from bs4 import BeautifulSoup
import re

This is the same library we used in the previous post to filter out some articles that weren't actually reviews. This time, we'll use this library to extract the text of the review and the star rating assigned to the restaurant for each review, so that later we can build a model that will predict the star rating from the review text. In additional to those features, we'll also extract two other features - the price of the restaurant, on a scale of $ (cheapest) to $$$$ (most expensive), and the number of dishes at the restaurant recommended by the reviewer.

For each feature, we define a function that extracts the relevant information from the raw HTML file parsed into a BeautifulSoup object. For example, the following function is used to extract the full text of the review:

def find_review(bs):
    # All reviews have the main text contained in paragraph elements of a
    # particular class
    tag_searches = [('p', re.compile('css-xhhu0i e2kc3sl0')),
                    ('p', re.compile('story-body-text story-content')),
                    ('p', re.compile('css-1i0edl6'))]
    for (tag, regex) in tag_searches:
        result = bs.find_all(tag, {'class': regex})
        if len(result) > 0:
            review_text = ''
            for p in result:
                review_text += p.get_text()
            review_text = re.sub(r'\s+', ' ', review_text)
            return(review_text)
    # Return EMPTY if review text cannot be found
    return("EMPTY")

In order to get the review text, we find all paragraph HTML elements that are of the class css-xhhu0i e2kc3sl0, story-body-text story-content, or css-1i0edl6. These classes were found by inspecting the HTML files by hand to see how the New York Times marks off the main story content in their HTML. After finding all such paragraphs, we go through them one-by-one to extract the actual text, combine it into one large string, and then strip extraneous whitespace. In the case that no appropriately marked paragraphs are found, we return "EMPTY" to denote that the review text could not be found.

The multiple searches that we must undertake are a preview of the main difficulty in this part of the project - the formatting of the reviews is not consistent through time, and so different reviews require different methods to find the relevant data. This is evident in the next function, which we use to extract the star ratings from the reviews.

def find_stars(bs):
    # Newer reviews have the rating set off from the story in special html tag.  
    # Find those first
    tag_searches = [('span', re.compile('ReviewFooter-stars')),
                ('div', re.compile('ReviewFooter-rating')),
                ('li', re.compile('critic-star-rating')),
                ('li', re.compile('critic-word-rating'))]
    for (tag, regex) in tag_searches:
        result = bs.find_all(tag, {'class': regex})
        if len(result) > 0:
            text = result[0].get_text()
            stars = re.sub(r'\s+', ' ', text).strip()
            if stars in ['Satisfactory', 'Fair', 'Poor']:
                return(stars)
            else:
                return(str(len(stars)))

    # Older stories have the rating just sitting in a plain paragraph - search
    # separately for them
    direct_search = re.search('<p.*?>\s*★+\s*</p>', str(bs))
    if direct_search:
        just_stars = re.search('★+', direct_search.group()).group()
        return (str(len(just_stars)))
    if re.search('<p.*?>\s*[Ss]atisfactory\s*</p>', str(bs)):
        return('Satisfactory')
    if re.search('<p.*?>\s*[Ff]air\s*</p>', str(bs)):
        return('Fair')
    if re.search('<p.*?>\s*[Pp]oor\s*</p>', str(bs)):
        return('Poor')

    # Return 'NA' if a rating can't be found
    return('NA')

In newer articles, the star rating of the review is set off in some kind of marked HTML element. In this case, we can use the BeautifulSoup library as before to search for these special tags and extract the information from them. Even still, the special HTML element that contains this information is not consistent over time, and so it took repeated inspections of the raw HTML files in order to identify the correct tags for which to search.

This method of setting off the review rating isn't present in older reviews, however. For these cases, we instead search the raw and unparsed HTML string for the review directly, looking for the stars themselves (or the text rating) and counting them up.

With either search method, this function ultimately returns one of the strings 'Poor', 'Fair', 'Satisfactory', '1', '2', '3', or '4', giving the star rating of the review. It also may return 'NA' if none of the above searches for a star rating return a result. This is necessary because even after the filtering described in the previous post, many of the articles we obtained in that post using the NYT API are still not proper restaurant reviews, and therefore have no star rating to extract.

The other features we extract from the reviews, the number of recommended dishes and the price of the restaurant, are handled similarly. We won't rewrite them here - instead, check out the full code on GitHub. These functions read the HTML and return an integer, representing either the number of recommended dishes or the number of dollar signs for the price category of the restaurant. Both return 0 if they cannot find the information in the HTML.

When run, the clean_data.py script reads the HTML data from the reviews folder created in the last post. It processes each review to attempt to find a star rating. If one is found then that rating, the review text, the number of recommended dishes, and the price of the restaurant are all saved, along with some identifying information, into a dictionary. The list of all of these dictionaries is saved to a JSON file named cleaned_reviews.json in a data folder created by the script. The URLs associated to articles for which a rating cannot be found are written to a text file, unprocessed_urls.txt, in that folder for further inspection.

Checking for Accuracy and Completeness

After running this script, we can do a few simple checks to see whether or not our methods are working correctly.

First, we can judge how well the get_rating function works by inspecting the unprocessed_urls.txt file created by the above script. It contains 27 different URLs, and although I haven't exhaustively checked each one, randomly looking at a few shows that these article do not appear to be restaurant reviews with stars. Some are misclassified articles (there are a few obituaries of chefs, a column about Thanksgiving etiquette, etc), and some are reviews of restaurants that are outside of New York City and therefore don't receive star ratings.

For the remaining 251 articles that did have an associated star rating, we can check that we were able to successfully find the review text, number of recommended dishes, and prices for each. To make this simpler, we can read the cleaned_reviews.json file into a pandas dataframe (this will be the data structure we use during the analysis phase of the project). We can then easily search the dataframe for records with errors - those with 'EMPTY' for the review text or 0 for the price or recommended dishes.

First, we can simply import and inspect the data to see that everything seems to be working correctly.

import pandas as pd
with open('./data/cleaned_reviews.json', 'r') as infile:
    data = pd.read_json(infile, orient = 'records')
data.head()
id price rating rec_dishes review_text review_url
0 1 3 2 2 The outside makes no grand statements. A glas... https://www.nytimes.com/2014/03/12/dining/rest...
1 2 3 2 7 Reading the opening lines of Hearth’s last re... https://www.nytimes.com/2013/10/30/dining/revi...
2 3 4 Satisfactory 7 Restaurant critics are supposed to be imparti... https://www.nytimes.com/2014/06/25/dining/rest...
3 4 3 2 11 We were halfway through appetizers at Cherche... https://www.nytimes.com/2014/10/01/dining/rest...
4 5 2 2 5 Before telling you how impressed I am by the ... https://www.nytimes.com/2015/05/06/dining/rest...

Next, we can search for reviews that are missing one or more features.

print(len(data[data['review_text'] == 'EMPTY']))
print(len(data[data['rec_dishes'] == 0]))
print(len(data[data['price'] == 0]))
0
0
0

In each case there are no records with errors, and so it seems that our script is able to successfully process every review. That doesn't mean it's necessarily accurate, however. To check this, we can randomly choose a few reviews and inspect them manually to compare what the script finds as the relevant information and what we can see ourselves. If we take record 123 from the above frame, for example, we have the following data:

data.iloc[123]
id                                                           130
price                                                          3
rating                                                         1
rec_dishes                                                     9
review_text     In some corners of the food media, the openin...
review_url     https://www.nytimes.com/2013/06/26/dining/revi...
Name: 123, dtype: object

If we follow that URL to the original review, then we see that the rating, price, and number of recommended dishes are all accurate. Checking a few more reviews in this way is enough to see that this cleaning script seems to be working ok.

Improvements and Next Steps

There are two comments to be made about this portion of the project:

  • Similar to review_fetcher.py, this script is not very robust to changes. In particular, the New York Times may change the formatting and tagging of its reviews at any time, and the parsing functions as written above may no longer work if this occurs. It should be relatively easy, however, to modify these functions to search for the new tagged elements in that case.
  • This method leaves out some other information that may be helpful in analyzing the reviews. We've extracted the number of recommended dishes and price from each review, but most reviews also contain a summary of the noise level, service, alcohol served, and other features of the restaurant. We could write similar functions to extract this information as well, but these parts of the review are usually short descriptions and not easily compressed into a simple number.

These possible issues noted, running the clean_data.py script produces a clean dataset that contains the review text, rating, price, and number of recommended dishes from 251 restaurant reviews written by Pete Wells. In the next post, we'll begin analyzing that data set to try to understand how Wells assigns star ratings to restaurants.