Using Data Science for house hunting in Montreal

    Introduction


    I happen to live in Montreal, in my condo on the edge of McGill Ghetto. Close to Saint Laurent Boulevard or the Maine as locals call it, with all it's attractions — bars, restaurants, night clubs, drunken students. And once upon a time, on a particular lively night, listening to the sounds of McGill frosh students drunkenly heading home after hard night of studying. I thought, that it might be a good idea to move into my own house, a little bit further away from the action.


    Image


    It was not my first rodeo, buying a real estate in Montreal, but first time buying a house. So, I decided to do a little bit of research, before trusting my money to a real estate agent. I quickly realized that I can't afford a house anywhere close to the subway station on the Island, but I could possible afford a duplex or a triplex, where tenants would be covering part of my mortgage.
    The solution to this problem depends not only on the price of the house, but also on the rent or potential rent that the tenants could be paying.


    So, being a visual person with background in research, I wanted to see a visual map of how much things cost around the island, and how much revenue I could get. In the States, and even in Ontario there are services like Zillow that can show some of the information, but for Montreal I couldn't find anything, apart from the realtor association APCIQ. Maybe my preference of using English language is to blame.


    So, after a few weeks of studying realtor.ca and kijiji, I wrote a python script to scrape information from them, using some resources I found on github: https://github.com/Froren/realtorca. Also, city of Montreal have an open data web site, that helps to fill-out some blanks.


    After the data is collected by webscrappers it is processed in R, using tidy-verse, Simple Features for R. I found excellent resources on how to process geospatial information in R: Geocomputation with R, I used ggplot2 to make graphs and thematic maps for map making.


    Now I have more then a year worth of data to study.


    Data pre-processing


    I preprocess the data by converting it into simple-features format first, and then changing the geographic coordinate reference system (longitude and latitude) to North American projection for Quebec and Ontario


    library(tidyverse)
    library(sf)
    
    property<-read_csv("....") %>% 
     st_as_sf(coords=c("lng","lat"), crs=4326) %>% 
     st_transform(crs=32188)

    Condo price


    First I wanted to evaluate how much I could get for my condo. I need to define my neighborhood and find all the condos for sale around me.


    Neighborhood map


    neighbourhood<-geojson_sf("quartierreferencehabitation.geojson") %>%
     st_transform(32188) %>% 
     filter(nom_qr %in% c("Saint-Louis", "Milton-Parc")) %>% 
     summarize() %>% 
     st_buffer(dist=0)

    Selecting condos for sale.


    neighbors <- st_join(property, neighbourhood, left=F)

    Using a basemap from openstreetmap.


    osm_neighbourhood<-read_osm(st_bbox(neighbourhood%>%st_transform(4326)), ext=1.5, type="esri")

    Drawing results using tmap package.


    library(tmap)
    library(tmaptools)
    
    tm_shape(osm_neighbourhood) + tm_rgb(alpha=0.7)+
      tm_shape(neighbourhood) + tm_borders(col='red',alpha=0.8)  + 
      tm_shape(neighbors) + tm_symbols(shape=3,size=0.2,alpha=0.8) +
      tm_shape(ref_home) + tm_symbols(col='red',shape=4,size=0.5,alpha=0.8)+
      tm_compass(position=c("right", "bottom"))+
      tm_scale_bar(position=c("right", "bottom"))

    image


    Neighbourhood condo prices


    Now I can show the prices, and see how the depend on condo surface area and if there is a parking lot. And If i use a simple linear regression I can get the first approximation of what my condo might be worth.


    image


    Linear model


    More formally I can use linear model to predict price and confidence intervals


    model_price_lm <- lm(mprice ~ parking:area_interior , data=neighbors_)
    
    ## 
    ## Coefficients:
    ##                            Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)                41861.30   22421.28   1.867   0.0628 .  
    ## parkingFALSE:area_interior   436.65      23.56  18.530   <2e-16 ***
    ## parkingTRUE:area_interior    511.95      19.40  26.393   <2e-16 ***

    So, in my neighborhood every square foot in a condo without parking adds 437$ to the base price of 42k$, and with parking it is 512$ per square foot. And now I can make a prediction of the price: 443k$ with confidence interval [422k$, 465k$]


    However, if I look at the difference between what my model predicts for all the condos in the neighborhood and the prices, I can see that error depends on the predicted value:


    image


    Therefore violating one of the conditions where simple linear regression can be used. This kind of behaviour is called overdispersion, and there are several ways of dealing with it. In particular, I found in the literature that I should be using a generalized linear model with inverse Gaussian distribution for errors and logarithmic link function.


    Generalized linear model


    image


    The estimate using generalized linear model is following:


    model_price_glm <- glm(mprice ~ parking:area_interior , data=neighbors_, 
                           family=inverse.gaussian(link="log"))
    

    Which gives prediction 436k$ [422k$, 452k$]


    Note that I am ignoring number of rooms, floor of the building and the location of the condo for simplicity. It is possible to plug them all in into the regression, but it will increase number of parameters and make modelling results more difficult to interpret. Also, many parameters are correlated, for example bigger apartments tend to have more rooms and there a more of them with parking.


    Now, for the sake of simplicity of comparing different properties, I could estimate price per square foot, and how it is affected by different factors.


    Again, using generalized linear model with inverse Gaussian distribution and log link:


    price per square foot


    image


    It's easy to make sense of the regression results:


    print(exp(model_psqft$coeff))
    
    ## (Intercept) parkingTRUE   bedrooms2   bedrooms3   bedrooms4 
    ## 501.7826165   1.1215192   0.9769839   0.9818974   0.8349424

    So, the square foot is worth 501$, parking adds 12%, two bedrooms reduce price by 2.4%, three bedrooms by 1.2%, four bedrooms 17% (given the same total price).


    The predicted price of my condo is: 431k$ [414k$, 449k$]


    Longitudinal condo price model


    All my previous models are showing results based on the condos on the market during the last year, without trying to account for the price change. It would have been interesting, how the price change with time. I have no idea how prices should behave, there is no reason to think that there is a steady linear trend, considering seasonal rise and fall in prices, so first, I could just smooth the data using loess function.


    Loess smoothing


    If I pile all the data together:
    image


    But if I try to separate by number of bedrooms, the results are kind of random, since the data
    might be too sparse.
    image


    So, it seems that I would rather want to have an overall smooth variation in price, while taking into account some features of the condos: i.e there is actually no reason to think that two bedroom condos are gaining in value slower then three bedroom ones. But there is variation of the proportion of different appartments with time, which would bias the results.


    So, I am going to use generalized additive models where I can model overall change of price using a smooth function, while taking into account difference between different kinds of condos.


    Longitudinal condo price model:GAM model


    # price model with time
    model_psqft_t <- gam(price_sqft ~ bedrooms + parking + s(start_date, k=24) ,
              data=neighbors_, bs="cr",method='REML',
              family=inverse.gaussian(link="log"))

    It still looks like the prices are going up.
    image


    Using this model, the prediction of the price is 468k$ [435k$, 503k$]


    How long would it take to sell


    Another important question — how long would it take to sell? For this one can use survival analysis Technically, it looks like some types of condos sell faster then others, but the difference is not big. It looks like half of the condos disappear from the market within 60 days :


    image


    Plex price estimate


    Similarly, when I am looking at the potential plex I would like to know how much houses cost in the neighborhood. Let's say within 2km radius of the plex I was interested at some point:


    image


    The price distribution is


    image


    Here i can see that the seller is asking slightly more then what is the average for neighborhood, but
    at the same time the variability is quite high. For plexes many more parameters are important then
    for condos, like the size of the backyard, which year the building was built and how much
    existing tennants are paying.


    Using similar GLM model as for condos, the estimate for the price is the following: 567к$ [522k$, 616k$]


    To estimate the rentals prices in the neighborhood I can find all the appartments listed on Kijiji during last year close by.


    image


    The price distribution gives me idea how much I could be potentially getting from the tenants. Of course there might be existing tenants already, so it would show me if what they are paying is close to what's currently on the market.


    image


    Spatial prices


    Average over neighborhood


    Remember, my original question was to see the map of the prices in Montreal. The simplest would be
    to calculate median rental prices per neighborhood and show it on the map, like following:


    rent_by_quartier<-aggregate( kijiji_geo_p%>%filter(bedrooms==2) %>% 
     dplyr::select(price), mtl_p,median, join = st_contains) 

    image


    Since I am not actually looking everywhere on the island, here is the central part. Blue cross is where I go for work.


    image


    This map looks interesting, but it seem unrealistic to ussume that there are going to be sharp borders on the edges of neighborhoods. So, I would prefer to use a method that allows for smooth spatial change in prices. I can actually again use generalized additive models, as for the time course estimate, but with spatial coordinates.


    Rental prices spatial gam model


    model_rent_geo_whole<-gam(price~bedrooms+s(x,y,k=100),
            data=rent,bs="cr",method='REML',
            family=inverse.gaussian(link="log"))

    image


    Rental prices in the central area, which is more interesting for me.


    image


    Plexes price spatial model


    In a same fashion, I can model distribution of the prices per square foot for triplexes with 3br main apartment and parking.


    image


    Surface area for a triplex with 3br and parking


    Now that I have spatial price distribution, I can also model surface area distribution. This,
    technically can be done using data from the city website. But for this example I am using only property that was on the market


    image


    Triplex Profitability (rent per year/triplex total price)


    This way I can roughly estimate profitability of triplexes in different parts of town. By calculating a total price and dividing by the potential income of two two-bedroom apartments rented for the year. Of course this is very rough estimate, since I am assuming that all triplexes will have two 4 1/2 apartments for rent.


    image


    Plex Longitudinal price model: Plateau, Ahuntsic, Rosemont, Villeray


    Finally, using the same idea that was used for tracking condo price during the year, I can track plexes prices in the boroughs that were interesting for me.


    image


    Conclusions


    I did this research to study the distribution of prices in Montreal and to familiarize myself with geospatial modelling in R. I didn't have access to the actual sale prices, so the results should be taken with a grain of salt.


    Source code and data


    The complete source of scripts used for this publication is publicly available on github: (https://github.com/vfonov/re_mtl), version of this article rendered using rmarkdown is available at http://www.ilmarin.info/re_mtl/stats_eng.html


    Interactive map of prices distribution


    Results are also shown in an interactive dashboard on (http://www.ilmarin.info/re_mtl/)

    Similar posts

    AdBlock has stolen the banner, but banners are not teeth — they will be back

    More
    Ads

    Comments 0

    Only users with full accounts can post comments. Log in, please.