For my master’s thesis, I am looking at neighborhoods and how to define, classify, and describe them.
First off, what is a neighborhood? Though there are conflicting definitions of what exactly constitute a neighborhood, most would agree that it is a geographically localized, somewhat homogeneous community within a larger city. If one had to name neighborhoods in New York City, it would be easy to rattle off names such as Upper East Side, Soho, East Village, Chinatown, Midtown, etc. And when we picture these neighborhoods in our minds, each neighborhood often has a distinct feel or vibe to it that makes it easily identifiable.
Can you guess these neighborhoods? *Images taken from wikipedia.org
How do we have such a clear picture of what these neighborhoods are like? When thinking about what sort of characteristics differentiate one neighborhood from another, we think of the kind of places (restaurants, shops, stores, etc.), the kind of people (tourists, bankers, affluent people, young people, a certain ethnicity), and the kinds of activities that take place (working, shopping, partying, sightseeing) within this neighborhood. For instance, we would expect to see a lot more offices, working, and tourists in Midtown, but maybe more boutiques, shopping, and artists in Soho. Of course, there are many neighborhoods and characteristics of neighborhoods that are harder to guess off the top of our heads (NoLIta versus TriBeCa or NoHo?)
Another issue with neighborhoods is their fuzzy boundaries and ever-changing characteristics. People and places are not stationary, and over time, the characteristics and perhaps the boundaries of a neighborhood may change (gentrification, anyone?), or new neighborhoods emerge and old ones become consumed. For a recent example, see the shrinking of Little Italy. Who defines what the boundaries of a neighborhood are, anyway? These are often arbitrarily drawn by city officials going off of outdated information or natural boundaries, such as a river, that may no longer be there or do not represent cultural boundaries. Even worse, real estate agents have used these shifty boundaries to falsely stretch boundaries of more coveted neighborhoods or come up with new neighborhoods out of the blue to repackage and market a place. Hence, the aforementioned NoLIta, TriBeCa, NoHo, and now DUMBO, BoCoCa, BoHo, FiDi, and whatever else they have managed to come up with. Clearly, some names have stuck while others faded away, leading to the conclusion that in some cases, these names actually fulfilled a need and put a name to a newly formed, distinct neighborhood.
Surely there’s a better way? How can we systematically find neighborhoods, quantify their characteristics, and observe their changing or shifting states? With an eye to the characteristics I’ve mentioned that define neighborhoods, there are many popular social media websites out there now that give indication to some of these characteristics. For my study, I focus on Foursquare check-in data, specifically a data set that’s been collected from the Twitter API (check-ins from Foursquare forwarded to public Twitter accounts) from May 27th, 2010 to November 2nd, 2010 by the Cambridge NetOS group. This data set returns a list of places, a list of users, and a list of each time a user has checked into a place.
The characteristics that I chose to focus on were places, time, and tourist/local. Let me describe each in more detail, and how I collected these values.
Places – Foursquare has a given list of place categories that define all the places that people check in to. These include bars, Mexican restaurants, shoe stores, and many more. Categories are placed into a hierarchical tree, so that Mexican restaurants are under Restaurants and shoe stores are under Shops. For a full list of categories, visit here. Thus, every place has an associated place category tag.
Time – Here, I try to answer the question of: what time of the day are places busiest? By counting the volume of check-ins for every hour in a day for a place, I can pick out when they are most active. Then, by assigning chunks of hours in a day to categories such as Morning, Afternoon, Evening, and more, I can classify every place by the time category they are busiest.
Tourist/Local – To determine whether a place is touristy or local, I first must determine whether a user is a local or tourist. By counting the percentage of check-ins a user has in or around a city, I can make an educated guess of whether a user is a local of that city. From here, a place can be considered local or touristy based on the proportion of locals or tourists that visit this place.
From these tags associated with every place, I can cluster places based on their geographic location and the various characteristics just mentioned. The clustering method I use is called OPTICS and is a density-based hierarchical clustering algorithm that does not require an input of the number of clusters and also doesn’t require every point to be into a cluster. This makes sense for our look at neighborhoods, since we do not know the number of neighborhoods in advance, and neighborhoods are small. It would make little sense to define a “shopping” cluster that is the entire size of Manhattan, even though there are shops throughout the city. In this case, we are interested in highly-dense pockets for each characteristic. Because we have such a large number of characteristics, it would be difficult to perform OPTICS for each category, manually setting inputs into the algorithm to achieve a reasonable-looking set of clusters. By using an automatic clustering algorithm that is fine-tuned to each city, we greatly reduce the time it takes to cluster. Here I will share some preliminary results:
In this way, we can characterize the areas of a city by the clusters that are present. And, by overlapping clusters, we can find the areas of intersect that have homogeneous qualities across many characteristics, leading us to neighborhoods!
We see a lot of overlap in the clusters, leading us to the possibility that we could define neighborhood boundaries using this method. We also see that the nightlife spots are indeed busy in the late evening, as we would expect, but not all late evening clusters have dense nightlife spots. Thus, some characteristics of different neighborhoods emerge – this is the feel or vibe that we want to quantify. However, this is only an example from just looking at two characteristics. With more characteristics overlapping, we can do an even better job of finding neighborhoods and characterizing them based on place, time, and local/tourist.
This work is still in progress, and a web application for anyone to explore the various clusters is forthcoming. I touched earlier upon many more characteristics that can define a neighborhood that I have not delved into, including things like demographics of occupation, ethnicity, age, and more. Activities, such as working or shopping, were also not explicitly studied, though they may be inferred from the characteristics of place and time. In the future, these characteristics and more could be added to improve this study. Newer data sets could also be used to observe changes as a result of time and find if neighborhoods have moved, grown, shrunk, appeared or disappeared. The addition of cities (I am looking at New York City and London at the moment) is always good, though we are limited to only large cities that have enough volume in check-ins to do analysis on. An interesting related task would be to find neighborhoods that are similar across cities, as this post attempts. Comparisons of cities are also possible, as I happened upon striking differences between the check-in activity of New York City and London. Last, as more social media websites incorporate location-tagging, we could replicate this analysis on their data and possibly increase the size of our user demographic.