Using Data Science for house hunting in Montreal
I happen to live in Montreal, in my condo on the edge of McGill Ghetto. Close to Saint Laurent Boulevard or the Maine as locals call it, with all it's attractions — bars, restaurants, night clubs, drunken students. And once upon a time, on a particular lively night, listening to the sounds of McGill frosh students drunkenly heading home after hard night of studying. I thought, that it might be a good idea to move into my own house, a little bit further away from the action.
It was not my first rodeo, buying a real estate in Montreal, but first time buying a house. So, I decided to do a little bit of research, before trusting my money to a real estate agent. I quickly realized that I can't afford a house anywhere close to the subway station on the Island, but I could possible afford a duplex or a triplex, where tenants would be covering part of my mortgage.
The solution to this problem depends not only on the price of the house, but also on the rent or potential rent that the tenants could be paying.
So, being a visual person with background in research, I wanted to see a visual map of how much things cost around the island, and how much revenue I could get. In the States, and even in Ontario there are services like Zillow that can show some of the information, but for Montreal I couldn't find anything, apart from the realtor association APCIQ. Maybe my preference of using English language is to blame.
So, after a few weeks of studying realtor.ca and kijiji, I wrote a python script to scrape information from them, using some resources I found on github: https://github.com/Froren/realtorca. Also, city of Montreal have an open data web site, that helps to fill-out some blanks.
After the data is collected by webscrappers it is processed in R, using tidy-verse, Simple Features for R. I found excellent resources on how to process geospatial information in R: Geocomputation with R, I used ggplot2 to make graphs and thematic maps for map making.
Now I have more then a year worth of data to study.
I preprocess the data by converting it into simple-features format first, and then changing the geographic coordinate reference system (longitude and latitude) to North American projection for Quebec and Ontario
library(tidyverse) library(sf) property<-read_csv("....") %>% st_as_sf(coords=c("lng","lat"), crs=4326) %>% st_transform(crs=32188)
First I wanted to evaluate how much I could get for my condo. I need to define my neighborhood and find all the condos for sale around me.
neighbourhood<-geojson_sf("quartierreferencehabitation.geojson") %>% st_transform(32188) %>% filter(nom_qr %in% c("Saint-Louis", "Milton-Parc")) %>% summarize() %>% st_buffer(dist=0)
Selecting condos for sale.
neighbors <- st_join(property, neighbourhood, left=F)
Using a basemap from openstreetmap.
osm_neighbourhood<-read_osm(st_bbox(neighbourhood%>%st_transform(4326)), ext=1.5, type="esri")
Drawing results using tmap package.
library(tmap) library(tmaptools) tm_shape(osm_neighbourhood) + tm_rgb(alpha=0.7)+ tm_shape(neighbourhood) + tm_borders(col='red',alpha=0.8) + tm_shape(neighbors) + tm_symbols(shape=3,size=0.2,alpha=0.8) + tm_shape(ref_home) + tm_symbols(col='red',shape=4,size=0.5,alpha=0.8)+ tm_compass(position=c("right", "bottom"))+ tm_scale_bar(position=c("right", "bottom"))
Neighbourhood condo prices
Now I can show the prices, and see how the depend on condo surface area and if there is a parking lot. And If i use a simple linear regression I can get the first approximation of what my condo might be worth.
More formally I can use linear model to predict price and confidence intervals
model_price_lm <- lm(mprice ~ parking:area_interior , data=neighbors_) ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 41861.30 22421.28 1.867 0.0628 . ## parkingFALSE:area_interior 436.65 23.56 18.530 <2e-16 *** ## parkingTRUE:area_interior 511.95 19.40 26.393 <2e-16 ***
So, in my neighborhood every square foot in a condo without parking adds 437$ to the base price of 42k$, and with parking it is 512$ per square foot. And now I can make a prediction of the price: 443k$ with confidence interval [422k$, 465k$]
However, if I look at the difference between what my model predicts for all the condos in the neighborhood and the prices, I can see that error depends on the predicted value:
Therefore violating one of the conditions where simple linear regression can be used. This kind of behaviour is called overdispersion, and there are several ways of dealing with it. In particular, I found in the literature that I should be using a generalized linear model with inverse Gaussian distribution for errors and logarithmic link function.
Generalized linear model
The estimate using generalized linear model is following:
model_price_glm <- glm(mprice ~ parking:area_interior , data=neighbors_, family=inverse.gaussian(link="log"))
Which gives prediction 436k$ [422k$, 452k$]
Note that I am ignoring number of rooms, floor of the building and the location of the condo for simplicity. It is possible to plug them all in into the regression, but it will increase number of parameters and make modelling results more difficult to interpret. Also, many parameters are correlated, for example bigger apartments tend to have more rooms and there a more of them with parking.
Now, for the sake of simplicity of comparing different properties, I could estimate price per square foot, and how it is affected by different factors.
Again, using generalized linear model with inverse Gaussian distribution and log link:
price per square foot
It's easy to make sense of the regression results:
print(exp(model_psqft$coeff)) ## (Intercept) parkingTRUE bedrooms2 bedrooms3 bedrooms4 ## 501.7826165 1.1215192 0.9769839 0.9818974 0.8349424
So, the square foot is worth 501$, parking adds 12%, two bedrooms reduce price by 2.4%, three bedrooms by 1.2%, four bedrooms 17% (given the same total price).
The predicted price of my condo is: 431k$ [414k$, 449k$]
Longitudinal condo price model
All my previous models are showing results based on the condos on the market during the last year, without trying to account for the price change. It would have been interesting, how the price change with time. I have no idea how prices should behave, there is no reason to think that there is a steady linear trend, considering seasonal rise and fall in prices, so first, I could just smooth the data using loess function.
If I pile all the data together:
But if I try to separate by number of bedrooms, the results are kind of random, since the data
might be too sparse.
So, it seems that I would rather want to have an overall smooth variation in price, while taking into account some features of the condos: i.e there is actually no reason to think that two bedroom condos are gaining in value slower then three bedroom ones. But there is variation of the proportion of different appartments with time, which would bias the results.
So, I am going to use generalized additive models where I can model overall change of price using a smooth function, while taking into account difference between different kinds of condos.
Longitudinal condo price model:GAM model
# price model with time model_psqft_t <- gam(price_sqft ~ bedrooms + parking + s(start_date, k=24) , data=neighbors_, bs="cr",method='REML', family=inverse.gaussian(link="log"))
It still looks like the prices are going up.
Using this model, the prediction of the price is 468k$ [435k$, 503k$]
How long would it take to sell
Another important question — how long would it take to sell? For this one can use survival analysis Technically, it looks like some types of condos sell faster then others, but the difference is not big. It looks like half of the condos disappear from the market within 60 days :
Plex price estimate
Similarly, when I am looking at the potential plex I would like to know how much houses cost in the neighborhood. Let's say within 2km radius of the plex I was interested at some point:
The price distribution is
Here i can see that the seller is asking slightly more then what is the average for neighborhood, but
at the same time the variability is quite high. For plexes many more parameters are important then
for condos, like the size of the backyard, which year the building was built and how much
existing tennants are paying.
Using similar GLM model as for condos, the estimate for the price is the following: 567к$ [522k$, 616k$]
To estimate the rentals prices in the neighborhood I can find all the appartments listed on Kijiji during last year close by.
The price distribution gives me idea how much I could be potentially getting from the tenants. Of course there might be existing tenants already, so it would show me if what they are paying is close to what's currently on the market.
Average over neighborhood
Remember, my original question was to see the map of the prices in Montreal. The simplest would be
to calculate median rental prices per neighborhood and show it on the map, like following:
rent_by_quartier<-aggregate( kijiji_geo_p%>%filter(bedrooms==2) %>% dplyr::select(price), mtl_p,median, join = st_contains)
Since I am not actually looking everywhere on the island, here is the central part. Blue cross is where I go for work.
This map looks interesting, but it seem unrealistic to ussume that there are going to be sharp borders on the edges of neighborhoods. So, I would prefer to use a method that allows for smooth spatial change in prices. I can actually again use generalized additive models, as for the time course estimate, but with spatial coordinates.
Rental prices spatial gam model
model_rent_geo_whole<-gam(price~bedrooms+s(x,y,k=100), data=rent,bs="cr",method='REML', family=inverse.gaussian(link="log"))
Rental prices in the central area, which is more interesting for me.
Plexes price spatial model
In a same fashion, I can model distribution of the prices per square foot for triplexes with 3br main apartment and parking.
Surface area for a triplex with 3br and parking
Now that I have spatial price distribution, I can also model surface area distribution. This,
technically can be done using data from the city website. But for this example I am using only property that was on the market
Triplex Profitability (rent per year/triplex total price)
This way I can roughly estimate profitability of triplexes in different parts of town. By calculating a total price and dividing by the potential income of two two-bedroom apartments rented for the year. Of course this is very rough estimate, since I am assuming that all triplexes will have two 4 1/2 apartments for rent.
Plex Longitudinal price model: Plateau, Ahuntsic, Rosemont, Villeray
Finally, using the same idea that was used for tracking condo price during the year, I can track plexes prices in the boroughs that were interesting for me.
I did this research to study the distribution of prices in Montreal and to familiarize myself with geospatial modelling in R. I didn't have access to the actual sale prices, so the results should be taken with a grain of salt.
Source code and data
The complete source of scripts used for this publication is publicly available on github: (https://github.com/vfonov/re_mtl), version of this article rendered using rmarkdown is available at http://www.ilmarin.info/re_mtl/stats_eng.html
Interactive map of prices distribution
Results are also shown in an interactive dashboard on (http://www.ilmarin.info/re_mtl/)