AirBnB Hong Kong Analysis

#loading data
listings <- read_csv("http://data.insideairbnb.com/china/hk/hong-kong/2020-06-15/data/listings.csv")%>% 
    clean_names()

# How many variables/columns? How many rows/observations?
# Which variables are numbers?
# Which are categorical or factor variables (numeric or character variables with variables that have a fixed and known set of possible values?
# What are the correlations between variables? Does each scatterplot support a linear relationship between variables? Do any of the correlations appear to be conditional on the value of a categorical variable?

glimpse(listings)
## Rows: 11,187
## Columns: 106
## $ id                                           <dbl> 69074, 75083, 103760, 13…
## $ listing_url                                  <chr> "https://www.airbnb.com/…
## $ scrape_id                                    <dbl> 2.02e+13, 2.02e+13, 2.02…
## $ last_scraped                                 <date> 2020-06-17, 2020-06-17,…
## $ name                                         <chr> "Beautiful oasis of plan…
## $ summary                                      <chr> "An ideal Hong location …
## $ space                                        <chr> "Filled with plants and …
## $ description                                  <chr> "An ideal Hong location …
## $ experiences_offered                          <chr> "none", "none", "none", …
## $ neighborhood_overview                        <chr> "In the upper part of tr…
## $ notes                                        <chr> NA, "Once you arrive in …
## $ transit                                      <chr> "Buses pass often along …
## $ access                                       <chr> "All access, except one …
## $ interaction                                  <chr> "If a guest is staying t…
## $ house_rules                                  <chr> "Everything to make your…
## $ thumbnail_url                                <lgl> NA, NA, NA, NA, NA, NA, …
## $ medium_url                                   <lgl> NA, NA, NA, NA, NA, NA, …
## $ picture_url                                  <chr> "https://a0.muscache.com…
## $ xl_picture_url                               <lgl> NA, NA, NA, NA, NA, NA, …
## $ host_id                                      <dbl> 160139, 304876, 304876, …
## $ host_url                                     <chr> "https://www.airbnb.com/…
## $ host_name                                    <chr> "Amy", "Brend", "Brend",…
## $ host_since                                   <date> 2010-07-07, 2010-11-30,…
## $ host_location                                <chr> "Hong Kong", "Hong Kong"…
## $ host_about                                   <chr> "I've been with AirBnB n…
## $ host_response_time                           <chr> "within a few hours", "w…
## $ host_response_rate                           <chr> "86%", "100%", "100%", "…
## $ host_acceptance_rate                         <chr> "60%", "99%", "99%", "99…
## $ host_is_superhost                            <lgl> TRUE, FALSE, FALSE, FALS…
## $ host_thumbnail_url                           <chr> "https://a0.muscache.com…
## $ host_picture_url                             <chr> "https://a0.muscache.com…
## $ host_neighbourhood                           <chr> "Sheung Wan", "Sheung Wa…
## $ host_listings_count                          <dbl> 2, 12, 12, 12, 1, 12, 12…
## $ host_total_listings_count                    <dbl> 2, 12, 12, 12, 1, 12, 12…
## $ host_verifications                           <chr> "['email', 'phone', 'rev…
## $ host_has_profile_pic                         <lgl> TRUE, TRUE, TRUE, TRUE, …
## $ host_identity_verified                       <lgl> TRUE, FALSE, FALSE, FALS…
## $ street                                       <chr> "Sheung Wan, Hong Kong",…
## $ neighbourhood                                <chr> "Central & Western Distr…
## $ neighbourhood_cleansed                       <chr> "Central & Western", "Ce…
## $ neighbourhood_group_cleansed                 <lgl> NA, NA, NA, NA, NA, NA, …
## $ city                                         <chr> "Sheung Wan", "Sheung Wa…
## $ state                                        <chr> NA, NA, NA, NA, "Hong Ko…
## $ zipcode                                      <chr> NA, NA, NA, NA, NA, NA, …
## $ market                                       <chr> "Hong Kong", "Hong Kong"…
## $ smart_location                               <chr> "Sheung Wan, Hong Kong",…
## $ country_code                                 <chr> "HK", "HK", "HK", "HK", …
## $ country                                      <chr> "Hong Kong", "Hong Kong"…
## $ latitude                                     <dbl> 22.3, 22.3, 22.3, 22.3, …
## $ longitude                                    <dbl> 114, 114, 114, 114, 114,…
## $ is_location_exact                            <lgl> TRUE, TRUE, TRUE, FALSE,…
## $ property_type                                <chr> "Apartment", "Apartment"…
## $ room_type                                    <chr> "Entire home/apt", "Enti…
## $ accommodates                                 <dbl> 3, 3, 6, 6, 2, 6, 6, 2, …
## $ bathrooms                                    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, …
## $ bedrooms                                     <dbl> 1, 0, 2, 2, 1, 2, 2, 1, …
## $ beds                                         <dbl> 2, 2, 3, 3, 1, 3, 3, 1, …
## $ bed_type                                     <chr> "Real Bed", "Real Bed", …
## $ amenities                                    <chr> "{\"Cable TV\",Internet,…
## $ square_feet                                  <dbl> NA, NA, NA, NA, NA, NA, …
## $ price                                        <chr> "$1,395.00", "$783.00", …
## $ weekly_price                                 <chr> NA, NA, NA, NA, NA, NA, …
## $ monthly_price                                <chr> "$29,451.00", NA, NA, NA…
## $ security_deposit                             <chr> "$2,325.00", "$775.00", …
## $ cleaning_fee                                 <chr> "$310.00", "$271.00", "$…
## $ guests_included                              <dbl> 2, 2, 2, 3, 1, 2, 2, 1, …
## $ extra_people                                 <chr> "$155.00", "$155.00", "$…
## $ minimum_nights                               <dbl> 3, 14, 2, 2, 2, 2, 2, 1,…
## $ maximum_nights                               <dbl> 365, 365, 365, 365, 60, …
## $ minimum_minimum_nights                       <dbl> 3, 14, 2, 2, 2, 2, 2, 1,…
## $ maximum_minimum_nights                       <dbl> 4, 14, 2, 2, 2, 2, 2, 1,…
## $ minimum_maximum_nights                       <dbl> 365, 365, 365, 365, 60, …
## $ maximum_maximum_nights                       <dbl> 365, 365, 365, 365, 60, …
## $ minimum_nights_avg_ntm                       <dbl> 3.1, 14.0, 2.0, 2.0, 2.0…
## $ maximum_nights_avg_ntm                       <dbl> 365, 365, 365, 365, 60, …
## $ calendar_updated                             <chr> "2 months ago", "7 weeks…
## $ has_availability                             <lgl> TRUE, TRUE, TRUE, TRUE, …
## $ availability_30                              <dbl> 0, 0, 0, 14, 0, 8, 9, 30…
## $ availability_60                              <dbl> 23, 0, 0, 44, 15, 33, 39…
## $ availability_90                              <dbl> 53, 14, 0, 74, 45, 63, 6…
## $ availability_365                             <dbl> 143, 193, 0, 345, 135, 3…
## $ calendar_last_scraped                        <date> 2020-06-17, 2020-06-17,…
## $ number_of_reviews                            <dbl> 134, 229, 271, 305, 27, …
## $ number_of_reviews_ltm                        <dbl> 4, 1, 13, 48, 0, 16, 11,…
## $ first_review                                 <date> 2011-02-14, 2011-03-05,…
## $ last_review                                  <date> 2020-03-24, 2020-04-18,…
## $ review_scores_rating                         <dbl> 97, 89, 89, 93, 97, 86, …
## $ review_scores_accuracy                       <dbl> 10, 8, 9, 10, 10, 9, 9, …
## $ review_scores_cleanliness                    <dbl> 9, 9, 9, 10, 9, 9, 9, 10…
## $ review_scores_checkin                        <dbl> 10, 9, 10, 10, 10, 9, 10…
## $ review_scores_communication                  <dbl> 10, 9, 10, 10, 10, 9, 10…
## $ review_scores_location                       <dbl> 10, 10, 10, 10, 10, 10, …
## $ review_scores_value                          <dbl> 9, 9, 9, 9, 10, 9, 9, 10…
## $ requires_license                             <lgl> FALSE, FALSE, FALSE, FAL…
## $ license                                      <lgl> NA, NA, NA, NA, NA, NA, …
## $ jurisdiction_names                           <lgl> NA, NA, NA, NA, NA, NA, …
## $ instant_bookable                             <lgl> FALSE, FALSE, FALSE, FAL…
## $ is_business_travel_ready                     <lgl> FALSE, FALSE, FALSE, FAL…
## $ cancellation_policy                          <chr> "strict_14_with_grace_pe…
## $ require_guest_profile_picture                <lgl> FALSE, FALSE, FALSE, FAL…
## $ require_guest_phone_verification             <lgl> FALSE, FALSE, FALSE, FAL…
## $ calculated_host_listings_count               <dbl> 1, 13, 13, 13, 1, 13, 13…
## $ calculated_host_listings_count_entire_homes  <dbl> 1, 9, 9, 9, 1, 9, 9, 0, …
## $ calculated_host_listings_count_private_rooms <dbl> 0, 4, 4, 4, 0, 4, 4, 1, …
## $ calculated_host_listings_count_shared_rooms  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, …
## $ reviews_per_month                            <dbl> 1.18, 2.02, 2.47, 2.81, …
skim(listings) 
(#tab:data_upload)Data summary
Name listings
Number of rows 11187
Number of columns 106
_______________________
Column type frequency:
character 46
Date 5
logical 16
numeric 39
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
listing_url 0 1.00 34 37 0 11187 0
name 8 1.00 1 250 0 10899 0
summary 756 0.93 1 1000 0 7990 0
space 4528 0.60 1 1000 0 4887 0
description 521 0.95 1 1000 0 8950 0
experiences_offered 0 1.00 4 4 0 1 0
neighborhood_overview 5879 0.47 1 1000 0 3570 0
notes 6862 0.39 1 1000 0 2407 0
transit 5598 0.50 1 1000 0 3665 0
access 6790 0.39 1 1000 0 2868 0
interaction 6119 0.45 1 1000 0 2979 0
house_rules 6217 0.44 2 1000 0 3169 0
picture_url 0 1.00 81 146 0 10607 0
host_url 0 1.00 39 43 0 4874 0
host_name 12 1.00 1 33 0 2846 0
host_location 38 1.00 2 133 0 429 0
host_about 4315 0.61 1 3850 0 2456 5
host_response_time 12 1.00 3 18 0 5 0
host_response_rate 12 1.00 2 4 0 59 0
host_acceptance_rate 12 1.00 2 4 0 74 0
host_thumbnail_url 12 1.00 55 106 0 4851 0
host_picture_url 12 1.00 57 109 0 4851 0
host_neighbourhood 2525 0.77 2 26 0 163 0
host_verifications 0 1.00 2 156 0 265 0
street 0 1.00 13 82 0 686 0
neighbourhood 1284 0.89 4 26 0 56 0
neighbourhood_cleansed 0 1.00 5 17 0 18 0
city 772 0.93 1 50 0 343 0
state 370 0.97 1 31 0 177 0
zipcode 10464 0.06 1 20 0 121 0
market 9 1.00 6 22 0 12 0
smart_location 0 1.00 9 61 0 385 0
country_code 0 1.00 2 2 0 3 0
country 0 1.00 5 14 0 3 0
property_type 0 1.00 3 22 0 41 0
room_type 0 1.00 10 15 0 4 0
bed_type 0 1.00 5 13 0 5 0
amenities 0 1.00 2 1126 0 8558 0
price 0 1.00 5 10 0 374 0
weekly_price 10601 0.05 6 10 0 268 0
monthly_price 10480 0.06 7 11 0 316 0
security_deposit 5677 0.49 5 10 0 231 0
cleaning_fee 5055 0.55 5 9 0 259 0
extra_people 0 1.00 5 9 0 184 0
calendar_updated 0 1.00 5 13 0 78 0
cancellation_policy 0 1.00 6 27 0 6 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
last_scraped 0 1.00 2020-06-15 2020-06-19 2020-06-17 4
host_since 12 1.00 2009-08-17 2020-06-10 2015-12-27 2355
calendar_last_scraped 0 1.00 2020-06-15 2020-06-19 2020-06-17 4
first_review 4155 0.63 2011-02-14 2020-06-15 2018-02-19 1986
last_review 4155 0.63 2013-01-02 2020-06-17 2019-06-23 1365

Variable type: logical

skim_variable n_missing complete_rate mean count
thumbnail_url 11187 0 NaN :
medium_url 11187 0 NaN :
xl_picture_url 11187 0 NaN :
host_is_superhost 12 1 0.13 FAL: 9669, TRU: 1506
host_has_profile_pic 12 1 1.00 TRU: 11141, FAL: 34
host_identity_verified 12 1 0.27 FAL: 8179, TRU: 2996
neighbourhood_group_cleansed 11187 0 NaN :
is_location_exact 0 1 0.69 TRU: 7698, FAL: 3489
has_availability 0 1 1.00 TRU: 11187
requires_license 0 1 0.00 FAL: 11187
license 11187 0 NaN :
jurisdiction_names 11187 0 NaN :
instant_bookable 0 1 0.42 FAL: 6485, TRU: 4702
is_business_travel_ready 0 1 0.00 FAL: 11187
require_guest_profile_picture 0 1 0.01 FAL: 11102, TRU: 85
require_guest_phone_verification 0 1 0.01 FAL: 11086, TRU: 101

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 2.50e+07 1.17e+07 6.91e+04 1.63e+07 2.63e+07 3.47e+07 4.38e+07 ▃▅▆▇▇
scrape_id 0 1.00 2.02e+13 0.00e+00 2.02e+13 2.02e+13 2.02e+13 2.02e+13 2.02e+13 ▁▁▇▁▁
host_id 0 1.00 8.84e+07 8.74e+07 3.22e+04 1.69e+07 5.25e+07 1.39e+08 3.49e+08 ▇▃▂▂▁
host_listings_count 12 1.00 4.85e+01 1.05e+02 0.00e+00 1.00e+00 5.00e+00 2.20e+01 3.86e+02 ▇▁▁▁▁
host_total_listings_count 12 1.00 4.85e+01 1.05e+02 0.00e+00 1.00e+00 5.00e+00 2.20e+01 3.86e+02 ▇▁▁▁▁
latitude 0 1.00 2.23e+01 5.00e-02 2.22e+01 2.23e+01 2.23e+01 2.23e+01 2.26e+01 ▁▇▁▁▁
longitude 0 1.00 1.14e+02 4.00e-02 1.14e+02 1.14e+02 1.14e+02 1.14e+02 1.14e+02 ▁▁▃▇▁
accommodates 0 1.00 2.82e+00 2.18e+00 1.00e+00 2.00e+00 2.00e+00 3.00e+00 1.60e+01 ▇▁▁▁▁
bathrooms 17 1.00 1.16e+00 5.70e-01 0.00e+00 1.00e+00 1.00e+00 1.00e+00 1.10e+01 ▇▁▁▁▁
bedrooms 38 1.00 1.09e+00 8.50e-01 0.00e+00 1.00e+00 1.00e+00 1.00e+00 1.10e+01 ▇▁▁▁▁
beds 69 0.99 1.68e+00 1.44e+00 0.00e+00 1.00e+00 1.00e+00 2.00e+00 2.00e+01 ▇▁▁▁▁
square_feet 11146 0.00 3.99e+02 6.19e+02 0.00e+00 0.00e+00 1.40e+02 6.00e+02 3.20e+03 ▇▂▁▁▁
guests_included 0 1.00 1.39e+00 1.06e+00 1.00e+00 1.00e+00 1.00e+00 1.00e+00 1.60e+01 ▇▁▁▁▁
minimum_nights 0 1.00 9.76e+00 2.83e+01 1.00e+00 1.00e+00 2.00e+00 7.00e+00 1.10e+03 ▇▁▁▁▁
maximum_nights 0 1.00 3.86e+05 2.87e+07 1.00e+00 3.65e+02 1.12e+03 1.12e+03 2.15e+09 ▇▁▁▁▁
minimum_minimum_nights 0 1.00 9.61e+00 2.80e+01 1.00e+00 1.00e+00 2.00e+00 7.00e+00 1.10e+03 ▇▁▁▁▁
maximum_minimum_nights 0 1.00 1.00e+01 2.91e+01 1.00e+00 1.00e+00 2.00e+00 7.00e+00 1.10e+03 ▇▁▁▁▁
minimum_maximum_nights 0 1.00 3.86e+05 2.87e+07 1.00e+00 3.65e+02 1.12e+03 1.12e+03 2.15e+09 ▇▁▁▁▁
maximum_maximum_nights 0 1.00 3.86e+05 2.87e+07 1.00e+00 3.65e+02 1.12e+03 1.12e+03 2.15e+09 ▇▁▁▁▁
minimum_nights_avg_ntm 0 1.00 9.79e+00 2.82e+01 1.00e+00 1.00e+00 2.00e+00 7.00e+00 1.10e+03 ▇▁▁▁▁
maximum_nights_avg_ntm 0 1.00 3.86e+05 2.87e+07 1.00e+00 3.65e+02 1.12e+03 1.12e+03 2.15e+09 ▇▁▁▁▁
availability_30 0 1.00 1.55e+01 1.40e+01 0.00e+00 0.00e+00 2.00e+01 3.00e+01 3.00e+01 ▇▁▁▁▇
availability_60 0 1.00 3.28e+01 2.79e+01 0.00e+00 0.00e+00 4.70e+01 6.00e+01 6.00e+01 ▆▁▁▁▇
availability_90 0 1.00 5.06e+01 4.17e+01 0.00e+00 0.00e+00 7.60e+01 9.00e+01 9.00e+01 ▆▁▁▁▇
availability_365 0 1.00 1.68e+02 1.57e+02 0.00e+00 0.00e+00 1.08e+02 3.64e+02 3.65e+02 ▇▂▂▁▇
number_of_reviews 0 1.00 1.77e+01 4.12e+01 0.00e+00 0.00e+00 2.00e+00 1.40e+01 7.57e+02 ▇▁▁▁▁
number_of_reviews_ltm 0 1.00 2.68e+00 7.55e+00 0.00e+00 0.00e+00 0.00e+00 1.00e+00 1.38e+02 ▇▁▁▁▁
review_scores_rating 4355 0.61 9.09e+01 1.12e+01 2.00e+01 8.70e+01 9.40e+01 9.90e+01 1.00e+02 ▁▁▁▂▇
review_scores_accuracy 4357 0.61 9.34e+00 1.12e+00 2.00e+00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_cleanliness 4357 0.61 9.09e+00 1.20e+00 2.00e+00 9.00e+00 9.00e+00 1.00e+01 1.00e+01 ▁▁▁▂▇
review_scores_checkin 4356 0.61 9.50e+00 1.04e+00 2.00e+00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_communication 4357 0.61 9.51e+00 1.03e+00 2.00e+00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_location 4358 0.61 9.61e+00 8.50e-01 2.00e+00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_value 4358 0.61 9.13e+00 1.13e+00 2.00e+00 9.00e+00 9.00e+00 1.00e+01 1.00e+01 ▁▁▁▂▇
calculated_host_listings_count 0 1.00 4.57e+01 1.03e+02 1.00e+00 1.00e+00 4.00e+00 1.90e+01 3.89e+02 ▇▁▁▁▁
calculated_host_listings_count_entire_homes 0 1.00 7.80e+00 1.90e+01 0.00e+00 0.00e+00 1.00e+00 4.00e+00 1.08e+02 ▇▁▁▁▁
calculated_host_listings_count_private_rooms 0 1.00 3.29e+01 8.22e+01 0.00e+00 0.00e+00 1.00e+00 1.10e+01 3.39e+02 ▇▁▁▁▁
calculated_host_listings_count_shared_rooms 0 1.00 4.54e+00 1.57e+01 0.00e+00 0.00e+00 0.00e+00 0.00e+00 8.20e+01 ▇▁▁▁▁
reviews_per_month 4155 0.63 8.40e-01 1.18e+00 1.00e-02 1.20e-01 3.50e-01 1.03e+00 1.32e+01 ▇▁▁▁▁

Cleaning the data

Here we are selecting the data and specific variables to perform our analysis on. We got rid of qualitative variables, such as description and summary, as transforming qualitative into quantitative data leads to pre-programmed errors, due to the nature of the algorithm.

#we never change the real data
hong_kong_listings <- listings %>% 
  select(id, 
         host_id,
         host_since,
         host_is_superhost,
         host_listings_count,
         neighbourhood_cleansed,
         #latitude, does not give extra info as all pretty similiar
         #longitude, does not give extra info as all pretty similiar
         property_type,
         room_type,
         accommodates,
         bathrooms,
         bedrooms,
         beds,
         bed_type,
         #amenities, just one long string
         #square_feet, we noticed that a lot of values are missing so excluded this variable
         price,
         #weekly_price, a lot of NAs
         #monthly_price,a lot of NAs
         security_deposit,
         cleaning_fee,
         guests_included,
         extra_people,
         minimum_nights,
         maximum_nights,
         number_of_reviews,
         reviews_per_month,
         number_of_reviews_ltm,
         review_scores_rating,
         review_scores_accuracy,
         review_scores_cleanliness,
         review_scores_checkin,
         review_scores_communication,
         review_scores_location,
         review_scores_value,
         listing_url,
         city,
         description,
         neighborhood_overview,
         #s_business_travel_ready,
         cancellation_policy) %>% 
#Converting characters to "doubles" and factors where appropriate
  mutate(neighbourhood_cleansed=factor(neighbourhood_cleansed),
         room_type=as.factor(room_type),
         price=parse_number(price),
         security_deposit=parse_number(security_deposit),
         cleaning_fee=parse_number(cleaning_fee),
         extra_people=parse_number(extra_people),
         cancellation_policy=as.factor(cancellation_policy),
         bed_type=as.factor(bed_type),
         city=as.factor(city))

#Inspecting data frame to make sure all the variables are correctly attributed
glimpse(hong_kong_listings) 
## Rows: 11,187
## Columns: 35
## $ id                          <dbl> 69074, 75083, 103760, 132773, 133390, 163…
## $ host_id                     <dbl> 160139, 304876, 304876, 304876, 654642, 3…
## $ host_since                  <date> 2010-07-07, 2010-11-30, 2010-11-30, 2010…
## $ host_is_superhost           <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, …
## $ host_listings_count         <dbl> 2, 12, 12, 12, 1, 12, 12, 1, 1, 1, 1, 8, …
## $ neighbourhood_cleansed      <fct> Central & Western, Central & Western, Cen…
## $ property_type               <chr> "Apartment", "Apartment", "Apartment", "A…
## $ room_type                   <fct> Entire home/apt, Entire home/apt, Entire …
## $ accommodates                <dbl> 3, 3, 6, 6, 2, 6, 6, 2, 2, 8, 4, 4, 6, 3,…
## $ bathrooms                   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 2, 1,…
## $ bedrooms                    <dbl> 1, 0, 2, 2, 1, 2, 2, 1, 1, 1, 2, 1, 3, 2,…
## $ beds                        <dbl> 2, 2, 3, 3, 1, 3, 3, 1, 1, 7, 3, 1, 3, 2,…
## $ bed_type                    <fct> Real Bed, Real Bed, Real Bed, Real Bed, R…
## $ price                       <dbl> 1395, 783, 845, 1046, 930, 690, 767, 698,…
## $ security_deposit            <dbl> 2325, 775, 775, 775, 1163, 775, 775, NA, …
## $ cleaning_fee                <dbl> 310, 271, 271, 302, NA, 302, 302, NA, NA,…
## $ guests_included             <dbl> 2, 2, 2, 3, 1, 2, 2, 1, 1, 1, 4, 2, 2, 2,…
## $ extra_people                <dbl> 155, 155, 194, 225, 0, 194, 194, 0, 0, 0,…
## $ minimum_nights              <dbl> 3, 14, 2, 2, 2, 2, 2, 1, 1, 10, 4, 2, 4, …
## $ maximum_nights              <dbl> 365, 365, 365, 365, 60, 365, 365, 365, 60…
## $ number_of_reviews           <dbl> 134, 229, 271, 305, 27, 222, 225, 17, 163…
## $ reviews_per_month           <dbl> 1.18, 2.02, 2.47, 2.81, 0.25, 2.07, 2.09,…
## $ number_of_reviews_ltm       <dbl> 4, 1, 13, 48, 0, 16, 11, 0, 12, 0, 9, 2, …
## $ review_scores_rating        <dbl> 97, 89, 89, 93, 97, 86, 86, 100, 98, NA, …
## $ review_scores_accuracy      <dbl> 10, 8, 9, 10, 10, 9, 9, 10, 10, NA, 10, 9…
## $ review_scores_cleanliness   <dbl> 9, 9, 9, 10, 9, 9, 9, 10, 10, NA, 9, 9, 7…
## $ review_scores_checkin       <dbl> 10, 9, 10, 10, 10, 9, 10, 10, 10, NA, 10,…
## $ review_scores_communication <dbl> 10, 9, 10, 10, 10, 9, 10, 10, 10, NA, 10,…
## $ review_scores_location      <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 9, NA, 10…
## $ review_scores_value         <dbl> 9, 9, 9, 9, 10, 9, 9, 10, 10, NA, 9, 9, 8…
## $ listing_url                 <chr> "https://www.airbnb.com/rooms/69074", "ht…
## $ city                        <fct> Sheung Wan, Sheung Wan, Central, Hong Kon…
## $ description                 <chr> "An ideal Hong location any visitor--hip …
## $ neighborhood_overview       <chr> "In the upper part of trendy, hip Sheung …
## $ cancellation_policy         <fct> strict_14_with_grace_period, strict_14_wi…
skim(hong_kong_listings)
Table 1: Data summary
Name hong_kong_listings
Number of rows 11187
Number of columns 35
_______________________
Column type frequency:
character 4
Date 1
factor 5
logical 1
numeric 24
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
property_type 0 1.00 3 22 0 41 0
listing_url 0 1.00 34 37 0 11187 0
description 521 0.95 1 1000 0 8950 0
neighborhood_overview 5879 0.47 1 1000 0 3570 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
host_since 12 1 2009-08-17 2020-06-10 2015-12-27 2355

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
neighbourhood_cleansed 0 1.00 FALSE 18 Yau: 4165, Cen: 2378, Wan: 2029, Kow: 471
room_type 0 1.00 FALSE 4 Pri: 5376, Ent: 4940, Sha: 615, Hot: 256
bed_type 0 1.00 FALSE 5 Rea: 11124, Pul: 28, Fut: 19, Air: 8
city 772 0.93 FALSE 343 Hon: 8040, Hon: 406, She: 238, 香港: 178
cancellation_policy 0 1.00 FALSE 6 str: 5435, fle: 4015, mod: 1687, sup: 29

Variable type: logical

skim_variable n_missing complete_rate mean count
host_is_superhost 12 1 0.13 FAL: 9669, TRU: 1506

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 2.50e+07 1.17e+07 69074.00 1.63e+07 2.63e+07 3.47e+07 4.38e+07 ▃▅▆▇▇
host_id 0 1.00 8.84e+07 8.74e+07 32172.00 1.69e+07 5.25e+07 1.39e+08 3.49e+08 ▇▃▂▂▁
host_listings_count 12 1.00 4.85e+01 1.05e+02 0.00 1.00e+00 5.00e+00 2.20e+01 3.86e+02 ▇▁▁▁▁
accommodates 0 1.00 2.82e+00 2.18e+00 1.00 2.00e+00 2.00e+00 3.00e+00 1.60e+01 ▇▁▁▁▁
bathrooms 17 1.00 1.16e+00 5.70e-01 0.00 1.00e+00 1.00e+00 1.00e+00 1.10e+01 ▇▁▁▁▁
bedrooms 38 1.00 1.09e+00 8.50e-01 0.00 1.00e+00 1.00e+00 1.00e+00 1.10e+01 ▇▁▁▁▁
beds 69 0.99 1.68e+00 1.44e+00 0.00 1.00e+00 1.00e+00 2.00e+00 2.00e+01 ▇▁▁▁▁
price 0 1.00 7.42e+02 1.89e+03 0.00 2.95e+02 4.81e+02 7.98e+02 7.80e+04 ▇▁▁▁▁
security_deposit 5677 0.49 1.56e+03 3.75e+03 0.00 0.00e+00 8.00e+02 1.50e+03 3.96e+04 ▇▁▁▁▁
cleaning_fee 5055 0.55 1.77e+02 2.33e+02 0.00 3.90e+01 1.39e+02 2.50e+02 4.80e+03 ▇▁▁▁▁
guests_included 0 1.00 1.39e+00 1.06e+00 1.00 1.00e+00 1.00e+00 1.00e+00 1.60e+01 ▇▁▁▁▁
extra_people 0 1.00 5.62e+01 1.47e+02 0.00 0.00e+00 0.00e+00 5.00e+01 2.34e+03 ▇▁▁▁▁
minimum_nights 0 1.00 9.76e+00 2.83e+01 1.00 1.00e+00 2.00e+00 7.00e+00 1.10e+03 ▇▁▁▁▁
maximum_nights 0 1.00 3.86e+05 2.87e+07 1.00 3.65e+02 1.12e+03 1.12e+03 2.15e+09 ▇▁▁▁▁
number_of_reviews 0 1.00 1.77e+01 4.12e+01 0.00 0.00e+00 2.00e+00 1.40e+01 7.57e+02 ▇▁▁▁▁
reviews_per_month 4155 0.63 8.40e-01 1.18e+00 0.01 1.20e-01 3.50e-01 1.03e+00 1.32e+01 ▇▁▁▁▁
number_of_reviews_ltm 0 1.00 2.68e+00 7.55e+00 0.00 0.00e+00 0.00e+00 1.00e+00 1.38e+02 ▇▁▁▁▁
review_scores_rating 4355 0.61 9.09e+01 1.12e+01 20.00 8.70e+01 9.40e+01 9.90e+01 1.00e+02 ▁▁▁▂▇
review_scores_accuracy 4357 0.61 9.34e+00 1.12e+00 2.00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_cleanliness 4357 0.61 9.09e+00 1.20e+00 2.00 9.00e+00 9.00e+00 1.00e+01 1.00e+01 ▁▁▁▂▇
review_scores_checkin 4356 0.61 9.50e+00 1.04e+00 2.00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_communication 4357 0.61 9.51e+00 1.03e+00 2.00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_location 4358 0.61 9.61e+00 8.50e-01 2.00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_value 4358 0.61 9.13e+00 1.13e+00 2.00 9.00e+00 9.00e+00 1.00e+01 1.00e+01 ▁▁▁▂▇

Here is a description of some of the key variables in our dataset hong_kong_listings:

  • price = cost per night
  • cleaning_fee: cleaning fee
  • extra_people: charge for having more than 1 person
  • property_type: type of accommodation (House, Apartment, etc.)
  • room_type:
  • Entire home/apt (guests have entire place to themselves)
  • Private room (Guests have private room to sleep, all other rooms shared)
  • Shared room (Guests sleep in room shared with others)
  • number_of_reviews: Total number of reviews for the listing
  • review_scores_rating: Average review score (0 - 100)
  • neighbourhood*: three variables on a few major neighbourhoods

Handling missing values

We get rid of NAs by using once again mutate, we also filter for min/max nights and accommodates. We assume for NAs in cleaning fee and security deposits to be 0. Which means, that if we have a NA we now that there is either no cleaning fee or no security deposit. we also think this is then reflected in the daily price.

Summary of property types

Property_type_summary <- hong_kong_listings%>%
  group_by(property_type)%>%
  summarise(count = n())%>%
  mutate(property_proportion = count/sum(count))%>%
  arrange(desc(count))

ggplot(data = Property_type_summary) +
  geom_col(aes(y = count, x = property_type)) +
  coord_flip()

Property_type_top10 <- Property_type_summary%>%
  head(n=10) %>% 
  ggplot() +
  geom_col(aes(y = reorder(property_type, count), x = count), fill = "#00B81F") +
  theme_bw() +
  labs(y = "Property type",
       x = "", 
       title = "The most popular property types on Airbnb \n in Hong Kong")
Property_type_top10

The most common four Airbnb property types in Hong Kong are: apartment, condominium, serviced apartment, and hostel, and their proportions out of the total number of listings are: 67.5%, 9.01%, 4.43%, and 3.57%, respectively.

Summary of minimum nights

Minimum_nights_summary <- hong_kong_listings%>%
  group_by(minimum_nights)%>%
  summarise(count = n())%>%
  mutate(frequency = count/sum(count))

Minimum_nights_top5 <- Minimum_nights_summary%>%
  arrange(desc(count))%>%
  head(n=5) %>% 
  ggplot() +
  geom_col(aes(y = reorder(minimum_nights, count), x = count), fill = "darkorange") +
  theme_bw() +
  labs(y = "Minimum nights",
       x = "", 
       title = "Top 5 minimum nights in Hong Kong")

Minimum_nights_top5

The most common values (top 5) of minimum nights are 1, 2, 29, 3, 28 nights respectively. The values that stand out among these common ones are 29 and 28 nights. Since Hong Kong is a metropolis, we think that these two types of Airbnb are intended for people who are in Hong Kong for business purposes rather than tourism. They are in need of a longer-term stay in Hong Kong, so the Airbnb acts like a rented space for them that requires them to stay for at least a month. The benfits of renting an Airbnb for that time is the ease of administration. It is often times impossible to find an apartment for a couple of weeks without going through the administrative hassle of exchanging documents, looking for credit risk and the like.

Filter and mutate the dataset

Based on the observations and summaries above, we filter and mutate our dataset in order to obtain only accommodations that are suitable for 2 guests who want to spend 4 nights in the Airbnb. We also filter accommodates for the range of <2:9> since we believe that booking a place for up to 9 accommodates by wealthy clients is reasonable. In addition, we create a new variable called prop_type_simplified that include the most common 4 property types and the rest are considered as Other. We also assume that for each NAvalue in both variables cleaning_feeand security_deposit, the value is 0, meaning that there is no cleanin_feeand no security_deposit.

#Filter dataset for two guests and 4 nights
#Clean dataset for cleaning_fee, security_deposit, property_type, minimum_nights and accommodates
hong_kong_listings_cleaned <- hong_kong_listings %>%
  mutate(cleaning_fee = case_when(      #considering cleaning_fee as 0 if displayed as NA
    is.na(cleaning_fee) ~ 0, 
    TRUE ~ cleaning_fee),
    security_deposit = case_when(      #considering security_deposit as 0 if displayed as NA
    is.na(security_deposit) ~ 0, 
    TRUE ~ security_deposit),
    prop_type_simplified = case_when(   #regrouping of property_types: put all less popular property types into "Other"
    property_type %in% c("Apartment",
                         "Hostel",
                         "Condominium",
                         "Serviced apartment")~ property_type , 
    TRUE ~ "Other"),
    prop_type_simplified=as.factor(prop_type_simplified)) %>% #creating factors
  filter(minimum_nights<=4, maximum_nights>=4, accommodates>=2 , accommodates<=9) #filtering dataframe

#Visually inspecting cleaned data set
glimpse(hong_kong_listings_cleaned)
## Rows: 6,437
## Columns: 36
## $ id                          <dbl> 69074, 103760, 132773, 133390, 163664, 16…
## $ host_id                     <dbl> 160139, 304876, 304876, 654642, 304876, 3…
## $ host_since                  <date> 2010-07-07, 2010-11-30, 2010-11-30, 2011…
## $ host_is_superhost           <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, …
## $ host_listings_count         <dbl> 2, 12, 12, 1, 12, 12, 1, 1, 1, 8, 8, 1, 1…
## $ neighbourhood_cleansed      <fct> Central & Western, Central & Western, Cen…
## $ property_type               <chr> "Apartment", "Apartment", "Apartment", "A…
## $ room_type                   <fct> Entire home/apt, Entire home/apt, Entire …
## $ accommodates                <dbl> 3, 6, 6, 2, 6, 6, 2, 2, 4, 4, 6, 3, 4, 3,…
## $ bathrooms                   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1,…
## $ bedrooms                    <dbl> 1, 2, 2, 1, 2, 2, 1, 1, 2, 1, 3, 2, 1, 1,…
## $ beds                        <dbl> 2, 3, 3, 1, 3, 3, 1, 1, 3, 1, 3, 2, 2, 1,…
## $ bed_type                    <fct> Real Bed, Real Bed, Real Bed, Real Bed, R…
## $ price                       <dbl> 1395, 845, 1046, 930, 690, 767, 698, 643,…
## $ security_deposit            <dbl> 2325, 775, 775, 1163, 775, 775, 0, 0, 193…
## $ cleaning_fee                <dbl> 310, 271, 302, 0, 302, 302, 0, 0, 504, 31…
## $ guests_included             <dbl> 2, 2, 3, 1, 2, 2, 1, 1, 4, 2, 2, 2, 4, 1,…
## $ extra_people                <dbl> 155, 194, 225, 0, 194, 194, 0, 0, 0, 155,…
## $ minimum_nights              <dbl> 3, 2, 2, 2, 2, 2, 1, 1, 4, 2, 4, 2, 1, 1,…
## $ maximum_nights              <dbl> 365, 365, 365, 60, 365, 365, 365, 60, 112…
## $ number_of_reviews           <dbl> 134, 271, 305, 27, 222, 225, 17, 163, 240…
## $ reviews_per_month           <dbl> 1.18, 2.47, 2.81, 0.25, 2.07, 2.09, 0.16,…
## $ number_of_reviews_ltm       <dbl> 4, 13, 48, 0, 16, 11, 0, 12, 9, 2, 49, 0,…
## $ review_scores_rating        <dbl> 97, 89, 93, 97, 86, 86, 100, 98, 95, 93, …
## $ review_scores_accuracy      <dbl> 10, 9, 10, 10, 9, 9, 10, 10, 10, 9, 9, 9,…
## $ review_scores_cleanliness   <dbl> 9, 9, 10, 9, 9, 9, 10, 10, 9, 9, 7, 9, 9,…
## $ review_scores_checkin       <dbl> 10, 10, 10, 10, 9, 10, 10, 10, 10, 9, 9, …
## $ review_scores_communication <dbl> 10, 10, 10, 10, 9, 10, 10, 10, 10, 10, 9,…
## $ review_scores_location      <dbl> 10, 10, 10, 10, 10, 10, 10, 9, 10, 9, 9, …
## $ review_scores_value         <dbl> 9, 9, 9, 10, 9, 9, 10, 10, 9, 9, 8, 9, 9,…
## $ listing_url                 <chr> "https://www.airbnb.com/rooms/69074", "ht…
## $ city                        <fct> Sheung Wan, Central, Hong Kong Island, Ce…
## $ description                 <chr> "An ideal Hong location any visitor--hip …
## $ neighborhood_overview       <chr> "In the upper part of trendy, hip Sheung …
## $ cancellation_policy         <fct> strict_14_with_grace_period, strict_14_wi…
## $ prop_type_simplified        <fct> Apartment, Apartment, Apartment, Apartmen…
skim(hong_kong_listings_cleaned)
Table 2: Data summary
Name hong_kong_listings_cleane…
Number of rows 6437
Number of columns 36
_______________________
Column type frequency:
character 4
Date 1
factor 6
logical 1
numeric 24
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
property_type 0 1.00 3 22 0 36 0
listing_url 0 1.00 34 37 0 6437 0
description 292 0.95 1 1000 0 5212 0
neighborhood_overview 2931 0.54 1 1000 0 2374 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
host_since 10 1 2009-10-07 2020-06-09 2015-12-28 1935

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
neighbourhood_cleansed 0 1.00 FALSE 18 Yau: 2747, Cen: 1372, Wan: 888, Isl: 268
room_type 0 1.00 FALSE 4 Ent: 3157, Pri: 2922, Hot: 188, Sha: 170
bed_type 0 1.00 FALSE 5 Rea: 6404, Pul: 16, Fut: 8, Air: 6
city 468 0.93 FALSE 269 Hon: 4538, She: 196, 香港: 97, Hon: 83
cancellation_policy 0 1.00 FALSE 6 str: 3575, fle: 1777, mod: 1048, sup: 21
prop_type_simplified 0 1.00 FALSE 5 Apa: 4129, Oth: 1251, Con: 541, Ser: 277

Variable type: logical

skim_variable n_missing complete_rate mean count
host_is_superhost 10 1 0.13 FAL: 5594, TRU: 833

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 2.40e+07 1.17e+07 69074.00 1.49e+07 2.46e+07 3.38e+07 4.38e+07 ▅▆▇▇▇
host_id 0 1.00 9.35e+07 9.02e+07 44242.00 2.37e+07 5.25e+07 1.49e+08 3.49e+08 ▇▂▂▂▁
host_listings_count 10 1.00 1.19e+01 2.49e+01 0.00 1.00e+00 3.00e+00 1.20e+01 3.86e+02 ▇▁▁▁▁
accommodates 0 1.00 3.07e+00 1.58e+00 2.00 2.00e+00 2.00e+00 4.00e+00 9.00e+00 ▇▂▁▁▁
bathrooms 4 1.00 1.13e+00 4.30e-01 0.00 1.00e+00 1.00e+00 1.00e+00 8.00e+00 ▇▁▁▁▁
bedrooms 14 1.00 1.13e+00 7.40e-01 0.00 1.00e+00 1.00e+00 1.00e+00 1.00e+01 ▇▁▁▁▁
beds 26 1.00 1.77e+00 1.22e+00 0.00 1.00e+00 1.00e+00 2.00e+00 1.40e+01 ▇▂▁▁▁
price 0 1.00 8.15e+02 1.67e+03 0.00 3.72e+02 5.50e+02 8.53e+02 6.67e+04 ▇▁▁▁▁
security_deposit 0 1.00 5.64e+02 1.89e+03 0.00 0.00e+00 0.00e+00 7.84e+02 3.80e+04 ▇▁▁▁▁
cleaning_fee 0 1.00 1.01e+02 1.88e+02 0.00 0.00e+00 0.00e+00 1.60e+02 4.69e+03 ▇▁▁▁▁
guests_included 0 1.00 1.48e+00 1.03e+00 1.00 1.00e+00 1.00e+00 2.00e+00 1.60e+01 ▇▁▁▁▁
extra_people 0 1.00 6.87e+01 1.58e+02 0.00 0.00e+00 0.00e+00 1.00e+02 2.30e+03 ▇▁▁▁▁
minimum_nights 0 1.00 1.55e+00 8.50e-01 1.00 1.00e+00 1.00e+00 2.00e+00 4.00e+00 ▇▂▁▂▁
maximum_nights 0 1.00 3.36e+05 2.68e+07 4.00 3.60e+02 1.12e+03 1.12e+03 2.15e+09 ▇▁▁▁▁
number_of_reviews 0 1.00 2.32e+01 4.61e+01 0.00 1.00e+00 5.00e+00 2.30e+01 7.57e+02 ▇▁▁▁▁
reviews_per_month 1541 0.76 9.10e-01 1.22e+00 0.01 1.40e-01 4.10e-01 1.16e+00 1.32e+01 ▇▁▁▁▁
number_of_reviews_ltm 0 1.00 3.55e+00 8.71e+00 0.00 0.00e+00 0.00e+00 2.00e+00 1.38e+02 ▇▁▁▁▁
review_scores_rating 1654 0.74 9.10e+01 1.06e+01 20.00 8.70e+01 9.30e+01 9.80e+01 1.00e+02 ▁▁▁▂▇
review_scores_accuracy 1656 0.74 9.33e+00 1.09e+00 2.00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_cleanliness 1656 0.74 9.13e+00 1.13e+00 2.00 9.00e+00 9.00e+00 1.00e+01 1.00e+01 ▁▁▁▂▇
review_scores_checkin 1655 0.74 9.50e+00 1.01e+00 2.00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_communication 1656 0.74 9.52e+00 9.80e-01 2.00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_location 1658 0.74 9.61e+00 8.40e-01 2.00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_value 1658 0.74 9.14e+00 1.08e+00 2.00 9.00e+00 9.00e+00 1.00e+01 1.00e+01 ▁▁▁▂▇

Calculate total price for 4 nights and data transformation

To end the pre-processing section, we calculated price_4_nights as our target variable for regression. It is the total price of 4 nights and two people for each listing. In addition, because some of the total price_4_nights is equal to 0, log(price_4_nights) would turn out to be negative infinity that hinders further analysis. Therefore, we transformed those price_4_nights that are equal to 0 by adding 1 while keeping others unchanged. Since log(1) is still 0, it does not affect our regression outcome.

hong_kong_listings_total_price<-hong_kong_listings_cleaned %>%
  # price_4_nights calculation
  mutate(price_4_nights=price*4+
           cleaning_fee+
           if_else(guests_included==1, extra_people*4,0),
         # Add 1 to price_4_nights that are equal to 0
         price_4_nights_transformed = price_4_nights +
           if_else(price_4_nights==0, 1,0),
         log_price_4_nights = log(price_4_nights),
         log_price_4_nights_transformed = log(price_4_nights_transformed))

New variables: neighbourhood_simplified and rating_group

Using city knowledge, we create a new categorical variable neighbourhood_simplified where we group neighbourhoods together geographically into 5 different zones. We also create a new categorical variable, rating_group, to divide the properties into 3 categories; properties with review_scores_rating less than 90, greater than 90 and No Rating.

hong_kong_listings_neighbourhood_simplified <- hong_kong_listings_total_price %>% 
  mutate(
neighbourhood_simplified = case_when(
      neighbourhood_cleansed=="Central & Western"~"zone_1",
      neighbourhood_cleansed=="Eastern"~"zone_1",
      neighbourhood_cleansed=="Islands"~"zone_2",
      neighbourhood_cleansed=="Kowloon City"~"zone_3",
      neighbourhood_cleansed=="Kwai Tsing"~"zone_4",
      neighbourhood_cleansed=="Kwun Tong"~"zone_3",
      neighbourhood_cleansed=="North"~"zone_4",
      neighbourhood_cleansed=="Sai Kung"~"zone_5",
      neighbourhood_cleansed=="Sha Tin"~"zone_4",
      neighbourhood_cleansed=="Sham Shui Po"~"zone_3",
      neighbourhood_cleansed=="Southern"~"zone_1",
      neighbourhood_cleansed=="Tai Po"~"zone_4",
      neighbourhood_cleansed=="Tsuen Wan"~"zone_4",
      neighbourhood_cleansed=="Tuen Mun"~"zone_4",
      neighbourhood_cleansed=="Wan Chai"~"zone_1",
      neighbourhood_cleansed=="Wong Tai Sin"~"zone_3",
      neighbourhood_cleansed=="Yau Tsim Mong"~"zone_3",
      neighbourhood_cleansed=="Yuen Long"~"zone_4"),
rating_group= case_when(
  review_scores_rating <90 ~ "Under 90",
  is.na(review_scores_rating)~"No Rating",
  TRUE ~ "Over 90"))

skim(hong_kong_listings_neighbourhood_simplified)
Table 3: Data summary
Name hong_kong_listings_neighb…
Number of rows 6437
Number of columns 42
_______________________
Column type frequency:
character 6
Date 1
factor 6
logical 1
numeric 28
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
property_type 0 1.00 3 22 0 36 0
listing_url 0 1.00 34 37 0 6437 0
description 292 0.95 1 1000 0 5212 0
neighborhood_overview 2931 0.54 1 1000 0 2374 0
neighbourhood_simplified 0 1.00 6 6 0 5 0
rating_group 0 1.00 7 9 0 3 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
host_since 10 1 2009-10-07 2020-06-09 2015-12-28 1935

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
neighbourhood_cleansed 0 1.00 FALSE 18 Yau: 2747, Cen: 1372, Wan: 888, Isl: 268
room_type 0 1.00 FALSE 4 Ent: 3157, Pri: 2922, Hot: 188, Sha: 170
bed_type 0 1.00 FALSE 5 Rea: 6404, Pul: 16, Fut: 8, Air: 6
city 468 0.93 FALSE 269 Hon: 4538, She: 196, 香港: 97, Hon: 83
cancellation_policy 0 1.00 FALSE 6 str: 3575, fle: 1777, mod: 1048, sup: 21
prop_type_simplified 0 1.00 FALSE 5 Apa: 4129, Oth: 1251, Con: 541, Ser: 277

Variable type: logical

skim_variable n_missing complete_rate mean count
host_is_superhost 10 1 0.13 FAL: 5594, TRU: 833

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 2.40e+07 1.17e+07 69074.00 1.49e+07 2.46e+07 3.38e+07 4.38e+07 ▅▆▇▇▇
host_id 0 1.00 9.35e+07 9.02e+07 44242.00 2.37e+07 5.25e+07 1.49e+08 3.49e+08 ▇▂▂▂▁
host_listings_count 10 1.00 1.19e+01 2.49e+01 0.00 1.00e+00 3.00e+00 1.20e+01 3.86e+02 ▇▁▁▁▁
accommodates 0 1.00 3.07e+00 1.58e+00 2.00 2.00e+00 2.00e+00 4.00e+00 9.00e+00 ▇▂▁▁▁
bathrooms 4 1.00 1.13e+00 4.30e-01 0.00 1.00e+00 1.00e+00 1.00e+00 8.00e+00 ▇▁▁▁▁
bedrooms 14 1.00 1.13e+00 7.40e-01 0.00 1.00e+00 1.00e+00 1.00e+00 1.00e+01 ▇▁▁▁▁
beds 26 1.00 1.77e+00 1.22e+00 0.00 1.00e+00 1.00e+00 2.00e+00 1.40e+01 ▇▂▁▁▁
price 0 1.00 8.15e+02 1.67e+03 0.00 3.72e+02 5.50e+02 8.53e+02 6.67e+04 ▇▁▁▁▁
security_deposit 0 1.00 5.64e+02 1.89e+03 0.00 0.00e+00 0.00e+00 7.84e+02 3.80e+04 ▇▁▁▁▁
cleaning_fee 0 1.00 1.01e+02 1.88e+02 0.00 0.00e+00 0.00e+00 1.60e+02 4.69e+03 ▇▁▁▁▁
guests_included 0 1.00 1.48e+00 1.03e+00 1.00 1.00e+00 1.00e+00 2.00e+00 1.60e+01 ▇▁▁▁▁
extra_people 0 1.00 6.87e+01 1.58e+02 0.00 0.00e+00 0.00e+00 1.00e+02 2.30e+03 ▇▁▁▁▁
minimum_nights 0 1.00 1.55e+00 8.50e-01 1.00 1.00e+00 1.00e+00 2.00e+00 4.00e+00 ▇▂▁▂▁
maximum_nights 0 1.00 3.36e+05 2.68e+07 4.00 3.60e+02 1.12e+03 1.12e+03 2.15e+09 ▇▁▁▁▁
number_of_reviews 0 1.00 2.32e+01 4.61e+01 0.00 1.00e+00 5.00e+00 2.30e+01 7.57e+02 ▇▁▁▁▁
reviews_per_month 1541 0.76 9.10e-01 1.22e+00 0.01 1.40e-01 4.10e-01 1.16e+00 1.32e+01 ▇▁▁▁▁
number_of_reviews_ltm 0 1.00 3.55e+00 8.71e+00 0.00 0.00e+00 0.00e+00 2.00e+00 1.38e+02 ▇▁▁▁▁
review_scores_rating 1654 0.74 9.10e+01 1.06e+01 20.00 8.70e+01 9.30e+01 9.80e+01 1.00e+02 ▁▁▁▂▇
review_scores_accuracy 1656 0.74 9.33e+00 1.09e+00 2.00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_cleanliness 1656 0.74 9.13e+00 1.13e+00 2.00 9.00e+00 9.00e+00 1.00e+01 1.00e+01 ▁▁▁▂▇
review_scores_checkin 1655 0.74 9.50e+00 1.01e+00 2.00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_communication 1656 0.74 9.52e+00 9.80e-01 2.00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_location 1658 0.74 9.61e+00 8.40e-01 2.00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_value 1658 0.74 9.14e+00 1.08e+00 2.00 9.00e+00 9.00e+00 1.00e+01 1.00e+01 ▁▁▁▂▇
price_4_nights 0 1.00 3.47e+03 6.70e+03 0.00 1.61e+03 2.39e+03 3.79e+03 2.67e+05 ▇▁▁▁▁
price_4_nights_transformed 0 1.00 3.47e+03 6.70e+03 1.00 1.61e+03 2.39e+03 3.79e+03 2.67e+05 ▇▁▁▁▁
log_price_4_nights 0 1.00 -Inf NaN -Inf 7.39e+00 7.78e+00 8.24e+00 1.25e+01 ▁▂▇▁▁
log_price_4_nights_transformed 0 1.00 7.84e+00 6.80e-01 0.00 7.39e+00 7.78e+00 8.24e+00 1.25e+01 ▁▁▅▇▁

Overview of our cleansed data

How many variables (coloumns)? How many observations (rows)?

The original dataset, listings, had 11187 observations with 106 variables. After cleaning the data, removing variables and observations with a lot of NAs, and using our own judgement to remove insignificant variables, we end up with a final dataset, hong_kong_listings_neighbourhood_simplified, with 6437 observations and 42 variables. This dataset is used for our regression models.

Which variables are numbers?

The original dataset, listings, had 39 numeric variables whereas our final cleaned dataset had 28 numeric variables. Some examples of numeric variables in the dataset are the variables id, accomodates, bedrooms, beds, price, price_4_nights etc.

Which are categorical or factor variables? - numeric or character variables with variables that have a fixed and known set of possible values?

The original dataset, listings, had 46 categorical or factor variables whereas our final cleaned dataset had 12 categorical and factor variables. Some examples of factor and categorical variables in the dataset are the variables neighbourhood_cleansed, room_type, bed_type.

Exploratory Data Analysis

Summary statistics and favstats

Now that we have cleaned our data sets for our specific target (4 nights, 2 people) we will conduct a exploratory data analysis.

#summary to check for NA's and general statistics
summary(hong_kong_listings_neighbourhood_simplified)
##        id              host_id           host_since         host_is_superhost
##  Min.   :   69074   Min.   :4.42e+04   Min.   :2009-10-07   Mode :logical    
##  1st Qu.:14921794   1st Qu.:2.37e+07   1st Qu.:2014-11-16   FALSE:5594       
##  Median :24554597   Median :5.25e+07   Median :2015-12-28   TRUE :833        
##  Mean   :24021748   Mean   :9.35e+07   Mean   :2016-02-29   NA's :10         
##  3rd Qu.:33810314   3rd Qu.:1.49e+08   3rd Qu.:2017-09-07                    
##  Max.   :43751721   Max.   :3.49e+08   Max.   :2020-06-09                    
##                                        NA's   :10                            
##  host_listings_count       neighbourhood_cleansed property_type     
##  Min.   :  0         Yau Tsim Mong    :2747       Length:6437       
##  1st Qu.:  1         Central & Western:1372       Class :character  
##  Median :  3         Wan Chai         : 888       Mode  :character  
##  Mean   : 12         Islands          : 268                         
##  3rd Qu.: 12         Kowloon City     : 257                         
##  Max.   :386         North            : 156                         
##  NA's   :10          (Other)          : 749                         
##            room_type     accommodates    bathrooms       bedrooms    
##  Entire home/apt:3157   Min.   :2.00   Min.   :0.00   Min.   : 0.00  
##  Hotel room     : 188   1st Qu.:2.00   1st Qu.:1.00   1st Qu.: 1.00  
##  Private room   :2922   Median :2.00   Median :1.00   Median : 1.00  
##  Shared room    : 170   Mean   :3.07   Mean   :1.13   Mean   : 1.13  
##                         3rd Qu.:4.00   3rd Qu.:1.00   3rd Qu.: 1.00  
##                         Max.   :9.00   Max.   :8.00   Max.   :10.00  
##                                        NA's   :4      NA's   :14     
##       beds                bed_type        price       security_deposit
##  Min.   : 0.00   Airbed       :   6   Min.   :    0   Min.   :    0   
##  1st Qu.: 1.00   Couch        :   3   1st Qu.:  372   1st Qu.:    0   
##  Median : 1.00   Futon        :   8   Median :  550   Median :    0   
##  Mean   : 1.77   Pull-out Sofa:  16   Mean   :  815   Mean   :  564   
##  3rd Qu.: 2.00   Real Bed     :6404   3rd Qu.:  853   3rd Qu.:  784   
##  Max.   :14.00                        Max.   :66667   Max.   :38000   
##  NA's   :26                                                           
##   cleaning_fee  guests_included  extra_people  minimum_nights
##  Min.   :   0   Min.   : 1.00   Min.   :   0   Min.   :1.00  
##  1st Qu.:   0   1st Qu.: 1.00   1st Qu.:   0   1st Qu.:1.00  
##  Median :   0   Median : 1.00   Median :   0   Median :1.00  
##  Mean   : 101   Mean   : 1.48   Mean   :  69   Mean   :1.55  
##  3rd Qu.: 160   3rd Qu.: 2.00   3rd Qu.: 100   3rd Qu.:2.00  
##  Max.   :4689   Max.   :16.00   Max.   :2300   Max.   :4.00  
##                                                              
##  maximum_nights     number_of_reviews reviews_per_month number_of_reviews_ltm
##  Min.   :4.00e+00   Min.   :  0       Min.   : 0        Min.   :  0.0        
##  1st Qu.:3.60e+02   1st Qu.:  1       1st Qu.: 0        1st Qu.:  0.0        
##  Median :1.12e+03   Median :  5       Median : 0        Median :  0.0        
##  Mean   :3.36e+05   Mean   : 23       Mean   : 1        Mean   :  3.5        
##  3rd Qu.:1.12e+03   3rd Qu.: 23       3rd Qu.: 1        3rd Qu.:  2.0        
##  Max.   :2.15e+09   Max.   :757       Max.   :13        Max.   :138.0        
##                                       NA's   :1541                           
##  review_scores_rating review_scores_accuracy review_scores_cleanliness
##  Min.   : 20          Min.   : 2             Min.   : 2               
##  1st Qu.: 87          1st Qu.: 9             1st Qu.: 9               
##  Median : 93          Median :10             Median : 9               
##  Mean   : 91          Mean   : 9             Mean   : 9               
##  3rd Qu.: 98          3rd Qu.:10             3rd Qu.:10               
##  Max.   :100          Max.   :10             Max.   :10               
##  NA's   :1654         NA's   :1656           NA's   :1656             
##  review_scores_checkin review_scores_communication review_scores_location
##  Min.   : 2            Min.   : 2                  Min.   : 2            
##  1st Qu.: 9            1st Qu.: 9                  1st Qu.: 9            
##  Median :10            Median :10                  Median :10            
##  Mean   :10            Mean   :10                  Mean   :10            
##  3rd Qu.:10            3rd Qu.:10                  3rd Qu.:10            
##  Max.   :10            Max.   :10                  Max.   :10            
##  NA's   :1655          NA's   :1656                NA's   :1658          
##  review_scores_value listing_url                      city     
##  Min.   : 2          Length:6437        Hong Kong       :4538  
##  1st Qu.: 9          Class :character   Shenzhen        : 196  
##  Median : 9          Mode  :character   香港            :  97  
##  Mean   : 9                             Hong Kong Island:  83  
##  3rd Qu.:10                             Kowloon         :  80  
##  Max.   :10                             (Other)         : 975  
##  NA's   :1658                           NA's            : 468  
##  description        neighborhood_overview                  cancellation_policy
##  Length:6437        Length:6437           flexible                   :1777    
##  Class :character   Class :character      moderate                   :1048    
##  Mode  :character   Mode  :character      strict                     :   2    
##                                           strict_14_with_grace_period:3575    
##                                           super_strict_30            :  14    
##                                           super_strict_60            :  21    
##                                                                               
##          prop_type_simplified price_4_nights   price_4_nights_transformed
##  Apartment         :4129      Min.   :     0   Min.   :     1            
##  Condominium       : 541      1st Qu.:  1612   1st Qu.:  1612            
##  Hostel            : 239      Median :  2388   Median :  2388            
##  Other             :1251      Mean   :  3469   Mean   :  3469            
##  Serviced apartment: 277      3rd Qu.:  3792   3rd Qu.:  3792            
##                               Max.   :266668   Max.   :266668            
##                                                                          
##  log_price_4_nights log_price_4_nights_transformed neighbourhood_simplified
##  Min.   : -Inf      Min.   : 0.00                  Length:6437             
##  1st Qu.: 7.39      1st Qu.: 7.39                  Class :character        
##  Median : 7.78      Median : 7.78                  Mode  :character        
##  Mean   : -Inf      Mean   : 7.84                                          
##  3rd Qu.: 8.24      3rd Qu.: 8.24                                          
##  Max.   :12.49      Max.   :12.49                                          
##                                                                            
##  rating_group      
##  Length:6437       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
#running favstats on some interesting variable combinations
favstats(price_4_nights_transformed~accommodates,
         data=hong_kong_listings_neighbourhood_simplified) 
##   accommodates  min   Q1 median   Q3    max mean    sd    n missing
## 1            2   50 1396   1924 2951 266668 2770  5802 3600       0
## 2            3    1 1736   2472 3632  43744 3224  4001  894       0
## 3            4    1 2071   3070 4250  43772 3926  4251 1020       0
## 4            5  836 2550   3662 5210  43772 5058  5670  287       0
## 5            6  596 3018   3914 5446 232008 6276 17509  323       0
## 6            7 1488 3860   4514 5814  43772 6091  6501   82       0
## 7            8  312 3812   4558 5952  74807 6578  8747  201       0
## 8            9  312 4692   5328 6479  13836 5819  2662   30       0
favstats(price_4_nights_transformed~neighbourhood_cleansed,
         data=hong_kong_listings_neighbourhood_simplified)
##    neighbourhood_cleansed  min   Q1 median   Q3    max  mean    sd    n missing
## 1       Central & Western  404 2388   3308 4328  35184  3789  2456 1372       0
## 2                 Eastern  900 1566   2212 3685  19996  3063  2526  152       0
## 3                 Islands    1 2107   2792 3908  34536  3544  3658  268       0
## 4            Kowloon City  868 1704   3272 4300  43772  5120  8610  257       0
## 5              Kwai Tsing  868 1800   2420 3936   8000  2910  1776   21       0
## 6               Kwun Tong 1024 1536   2108 3302 198000  9603 35150   31       0
## 7                   North  312 1070   1448 2333  47212  2649  5337  156       0
## 8                Sai Kung  808 1949   2646 4107  59988  4164  6745   84       0
## 9                 Sha Tin  836 1444   2450 2986   5508  2415   976   56       0
## 10           Sham Shui Po  312 1242   2200 3462  15500  2691  2136   99       0
## 11               Southern    1 2577   3842 7370 232008 10433 30227   64       0
## 12                 Tai Po 1041 1412   2668 3150  10512  3030  2421   27       0
## 13              Tsuen Wan 1268 2364   3162 4827  43772  7092 11815   30       0
## 14               Tuen Mun 1056 1611   1844 2684   7656  2344  1422   24       0
## 15               Wan Chai  588 2016   3054 4284 266668  3783  9130  888       0
## 16           Wong Tai Sin  808 2186   2656 3008   4000  2663   946   10       0
## 17          Yau Tsim Mong   50 1396   1860 2869  74807  2956  4945 2747       0
## 18              Yuen Long  312 1412   1774 2342  34564  2354  2983  151       0
favstats(price_4_nights_transformed~host_is_superhost,
         data=hong_kong_listings_neighbourhood_simplified)
##   host_is_superhost min   Q1 median   Q3    max mean   sd    n missing
## 1             FALSE   1 1580   2388 3766 266668 3424 6466 5594       0
## 2              TRUE 712 1828   2572 3893 198000 3786 8110  833       0
favstats(price_4_nights_transformed~prop_type_simplified,
         data=hong_kong_listings_neighbourhood_simplified)
##   prop_type_simplified min   Q1 median   Q3    max mean    sd    n missing
## 1            Apartment   1 1800   2784 4000  43772 3385  3359 4129       0
## 2          Condominium  50 1370   2062 3532 266668 3712 12528  541       0
## 3               Hostel 312 1288   1612 2144  11996 1836  1100  239       0
## 4                Other 312 1450   1996 2932 232008 3610 10212 1251       0
## 5   Serviced apartment 528 1434   1860 3692  43772 5030  9578  277       0
favstats(price_4_nights_transformed~minimum_nights,
         data=hong_kong_listings_neighbourhood_simplified)
##   minimum_nights min   Q1 median   Q3    max mean   sd    n missing
## 1              1   1 1452   2020 3332 266668 3397 8056 4121       0
## 2              2  50 2016   3004 4162  43772 3583 3458 1326       0
## 3              3   1 2138   3102 4300  29996 3545 2277  726       0
## 4              4 684 2524   3341 4400  23996 3823 2487  264       0

Data visualization

Building upon the above summary and favstats investigations, we visualize our data by using ggplot2.

#Distribution of Airbnb property types in Hong Kong 
ggplot(hong_kong_listings_neighbourhood_simplified, 
       aes(y=(prop_type_simplified),
           fill = neighbourhood_simplified))+
  geom_bar()+
  facet_wrap(~neighbourhood_simplified)+
  labs(title = "Distribution of Airbnb Property Types \n in Different Geographic Zones ",
       x = "Property type",
       y = "Number of Properties") +   
  theme_bw() +
  theme(title = element_text(size = 15, face = "bold"),
        axis.text.x = element_text(size = 10, angle=30),
        axis.text.y = element_text(size = 10), legend.position = "none")

# Density plot of ratings by zones
ggplot(hong_kong_listings_neighbourhood_simplified, aes(x=review_scores_rating, fill=neighbourhood_simplified, alpha = 0.1))+
  geom_density()+
  scale_alpha(guide = "none") +
  labs(title = "Density plot of ratings by Different \n Geographic Zones",
       x = "Ratings",
       y = "Density") +  theme_bw()+

  theme(title = element_text(size = 15, face = "bold"),
        axis.text.x = element_text(size = 8),
        axis.text.y = element_text(size = 8),
        legend.text = element_text(size=8),
        legend.position = "bottom")   

# Distribution of average cleaning fee and security deposit by property type
cleaning_security <- hong_kong_listings_neighbourhood_simplified %>%
  group_by(prop_type_simplified) %>%
  summarise(mean_cleaning_fee = mean(cleaning_fee),
            mean_security_deposit = mean(security_deposit))

cleaning_security <- pivot_longer(cleaning_security,
                                  cols = 2:3, names_to = "Type", values_to = "value")

ggplot(cleaning_security,aes(x=prop_type_simplified, y = value, fill = Type))+
  geom_col(position = "dodge")+
  labs(title = "Distribution of Average Cleaning Fee and \n Security Deposit by Property Type",
       x = "Property Type",
       y = "Dollars") +
  theme_bw()+
  theme(title = element_text(size = 15, face = "bold"),
        axis.text.x = element_text(size = 10),
        axis.text.y = element_text(size = 10),
        legend.text = element_text(size=10))

# Boxplot of log(prices_4_night) by zones
ggplot(hong_kong_listings_neighbourhood_simplified,
       aes(x=neighbourhood_simplified, y = log_price_4_nights_transformed,
           fill = neighbourhood_simplified, alpha =0.5))+
  geom_boxplot()+  
  labs(title = "Boxplot of Total Price for 4 nights \n by zones",
       subtitle = "Zone 3 has the lowest median total price",
       x = "Zones",
       y = "Log (Price for 4 Nights)") +  
  theme_bw()+

  theme(title = element_text(size = 15, face = "bold"),
        axis.text.x = element_text(size = 10),
        axis.text.y = element_text(size = 10),
        legend.position = "none")

Correlation matrix

#Producing scatterplot-correlation matrix between important variables in the dataset
ggp <- hong_kong_listings_neighbourhood_simplified %>% 
      select(c(price_4_nights, 
               neighbourhood_simplified, 
               accommodates, 
               bathrooms, 
               beds, 
               security_deposit, 
               cleaning_fee, 
               number_of_reviews, 
               review_scores_rating)) %>% 
  ggpairs(cardinality_threshold = NULL)

print(ggp, progress = F)

What are the correlations between variables? Does each scatterplot support a linear relationship between variables? Do any of the correlations appear to be conditional on the value of a categorical variable?

Above, we check the correlations between the numeric variables in the dataset. Intuitively, we see that the variables Price_4_Nights and Price are highly correlated at 0.997 since Price_4_nights is calculated from Price. We also see that the variables accommodates and beds have a very strong relationship, with correlation equal to 0.758. The variables reviews_per_month and number_of_reviews_ltm are also highly correlated at 0.826. Furthermore, we see that the variable review_scores_rating have very strong relationships with each of the other rating categories such as review_scores_accuracy and review_scores_cleanliness etc with a correlation coefficients greater than 0.7. This would be particularly useful when we select variables for our regression analysis as we know that using the variable review_scores_rating would suffice.

Mapping

library(leaflet)
leaflet(data = filter(listings, minimum_nights <= 4)) %>% 
  addProviderTiles("OpenStreetMap.Mapnik") %>% 
  addCircleMarkers(lng = ~longitude, 
                   lat = ~latitude, 
                   radius = 1, 
                   fillColor = "blue", 
                   fillOpacity = 0.4, 
                   popup = ~listing_url,
                   label = ~property_type)

Regression Analysis

Price_4_nights vs. Log (price_4_nights)

# histogram of price_4_nights
ggplot(hong_kong_listings_total_price, aes (x = price_4_nights))+
  geom_histogram()+
  xlim(c(0,20000))+
  labs(title = "Histogram of Total Prices for 4 Nights",
       x = "Total Prices for 4 Nights",
       y = "Count")+
  theme(title = element_text(size=15),
          axis.text.x = element_text(size=10),
        axis.text.y=element_text(size=10))+
  theme_bw()

# histogram of log(price_4_nights)
ggplot(hong_kong_listings_total_price, aes (x = log_price_4_nights))+
  geom_histogram()+
  labs(title = "Histogram of Log (Prices for 4 Nights)",
       x = "Log Prices for 4 Nights",
       y = "Count")+
  theme(title = element_text(size=15),
          axis.text.x = element_text(size=10),
        axis.text.y=element_text(size=10))+
  theme_bw()

We should use log(price_4_nights) because we can see from the histograms that the log(price_4_nights) distribution has a roughly normal shape, while the distribution of total price_4_nights is right-skewed. If we use the total price_4_nights in the regression analysis, the regression line might not be linear and variance might not be constant.

Model 1

# explanatory variables: prop_type_simplified, number_of_reviews, review_scores_rating
model1 <- lm(log_price_4_nights_transformed ~ 
               prop_type_simplified + 
               number_of_reviews + 
               review_scores_rating, 
             data = hong_kong_listings_total_price)
msummary(model1)
##                                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)                             7.284387   0.078870   92.36  < 2e-16
## prop_type_simplifiedCondominium        -0.142477   0.033567   -4.24  2.2e-05
## prop_type_simplifiedHostel             -0.449322   0.048289   -9.30  < 2e-16
## prop_type_simplifiedOther              -0.173379   0.023501   -7.38  1.9e-13
## prop_type_simplifiedServiced apartment -0.172016   0.048864   -3.52  0.00044
## number_of_reviews                      -0.000868   0.000176   -4.92  8.9e-07
## review_scores_rating                    0.007045   0.000854    8.25  < 2e-16
##                                           
## (Intercept)                            ***
## prop_type_simplifiedCondominium        ***
## prop_type_simplifiedHostel             ***
## prop_type_simplifiedOther              ***
## prop_type_simplifiedServiced apartment ***
## number_of_reviews                      ***
## review_scores_rating                   ***
## 
## Residual standard error: 0.621 on 4776 degrees of freedom
##   (1654 observations deleted due to missingness)
## Multiple R-squared:  0.0502, Adjusted R-squared:  0.049 
## F-statistic: 42.1 on 6 and 4776 DF,  p-value: <2e-16
car::vif(model1)
##                      GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.02  4            1.00
## number_of_reviews    1.01  1            1.00
## review_scores_rating 1.02  1            1.01
plot(model1)

Interpretations of Regression Output

Since we transformed the dependent variable by taking the logarithm of price_4_nights, we need to exponentiate the coefficients, then subtract the number by 1 to arrive at the unit increase in Y variable by increasing one unit of X variable. Property type is also a categorical variable, so when performing regression analysis we have Apartment has our baseline, which is not shown in the regression output report.

The coefficient for number_of_reviews is -0.000868, so the unit increase in price_4_nights will be (e^-0.000868 -1). That is, for every increase of 1 in the review rating score, the price_4_nights will decrease by 0.000868.

The coefficient for review_scores_rating is 0.007045, so the unit increase in price_4_nights will be (e^0.007043 -1). That is, for every increase of 1 in the review rating score, the price_4_nights will increase by 0.00707.

  • if the property type is condominium, everything else equal, price_4_nights will increase by (e^-0.142477 -1 ) = -0.133, or decrease by 0.133 compared to property type being apartment.
    if the property type is hostel, everything else equal, price_4_nights will increase by (e^-0.449322 -1) = -0.362, or decrease by 0.362 compared to property type being apartment.

  • if the property type is other, everything else equal, price_4_nights will increase by (e^-0.173379 -1 ) = -0.159, or decrease by 0.159 compared to property type being apartment.
    if the property type is serviced apartment, everything else equal, price_4_nights will increase by (e^-0.172016 -1) = -0.158, or decrease by 0.158 compared to property type being apartment.

Interpretation of the above plots

  • first plot (Fitted vs Residual):
    • detects several types of violations in the linear regression assumptions
      • Does linearity hold? This is indicated by the mean residual value for every fitted value region being close to 0. The closer ther red line is to the dashed line
      • Whether homoskedasticity holds. The spread of residuals should be approximately the same across the x-axis.
      • Whether there are outliers. This is indicated by some ‘extreme’ residuals that are far from the rest.
  • In the second plot (Normal Q-Q Plot):
    • The Q-Q plot, or quantile-quantile plot, is a graphical tool to help us assess if a set of data plausibly came from some theoretical distribution such as a Normal or exponential.
      • A Q-Q plot is a scatterplot created by plotting two sets of quantiles against one another. If both sets of quantiles came from the same distribution, we should see the points forming a line that’s roughly straight
  • In the third plot (Scale Location):
    • red line is approximately horizontal. Then the average magnitude of the standardized residuals isn’t changing much as a function of the fitted values.
    • spread around the red line doesn’t vary with the fitted values. Then the variability of magnitudes doesn’t vary much as a function of the fitted values.
  • Fourth plot (Residuals vs Leverage):
    • This can help detect outliers in a linear regression mode:
      • We’re looking at how the spread of standardized residuals changes as the leverage, or sensitivity of the fitted _i to a change in y_i, increases. Firstly, this can also be used to detect heteroskedasticity and non-linearity. The spread of standardized residuals shouldn’t change as a function of leverage: here it appears to decrease, indicating heteroskedasticity.
      • Second, points with high leverage may be influential: that is, deleting them would change the model a lot. For this we can look at Cook’s distance, which measures the effect of deleting a point on the combined parameter vector. Cook’s distance is the dotted red line here, and points outside the dotted line have high influence. In this case there are no points outside the dotted line

Model 2

# explanatory variables in model1 plus room_type
model2 <- lm(log_price_4_nights_transformed ~ 
               prop_type_simplified + 
               number_of_reviews + 
               review_scores_rating + 
               room_type, 
             data = hong_kong_listings_total_price)

msummary(model2)
##                                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)                             7.806235   0.075081  103.97  < 2e-16
## prop_type_simplifiedCondominium        -0.126730   0.030977   -4.09  4.4e-05
## prop_type_simplifiedHostel             -0.218213   0.046868   -4.66  3.3e-06
## prop_type_simplifiedOther               0.005852   0.023389    0.25    0.802
## prop_type_simplifiedServiced apartment -0.020805   0.045874   -0.45    0.650
## number_of_reviews                      -0.000450   0.000164   -2.75    0.006
## review_scores_rating                    0.003369   0.000799    4.22  2.5e-05
## room_typeHotel room                    -0.245381   0.051155   -4.80  1.7e-06
## room_typePrivate room                  -0.533044   0.018560  -28.72  < 2e-16
## room_typeShared room                   -0.249947   0.056373   -4.43  9.5e-06
##                                           
## (Intercept)                            ***
## prop_type_simplifiedCondominium        ***
## prop_type_simplifiedHostel             ***
## prop_type_simplifiedOther                 
## prop_type_simplifiedServiced apartment    
## number_of_reviews                      ** 
## review_scores_rating                   ***
## room_typeHotel room                    ***
## room_typePrivate room                  ***
## room_typeShared room                   ***
## 
## Residual standard error: 0.573 on 4773 degrees of freedom
##   (1654 observations deleted due to missingness)
## Multiple R-squared:  0.192,  Adjusted R-squared:  0.191 
## F-statistic:  126 on 9 and 4773 DF,  p-value: <2e-16
car::vif(model2)
##                      GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.29  4            1.03
## number_of_reviews    1.02  1            1.01
## review_scores_rating 1.05  1            1.02
## room_type            1.34  3            1.05
plot(model2)

room_type is a significant indicator of price_4_nights, because as shown in the summary statistics below, the t-values for three different room types all have absolute values greater than 2.

Further Regression Analysis

Model 3

When performing regression analysis, we removed variables that have perfect collinearity with others (property type and property_type_simplified). In addition, after checking for Variance Inflation Factors, we found out that neighbourhood_cleansed and city have very high collinearity, so we also removed both variables from our regression analysis and used neighbourhood_simplified instead. Although the VIF of reviews_per_month, room_type and review_scores_rating are all larger than 5 but smaller than 10, we still decided to keep these variables for our next model because they could be influential on our total price.

model3 <- lm(log_price_4_nights_transformed ~ . 
             -log_price_4_nights 
             - price_4_nights 
             - price_4_nights_transformed 
             - listing_url 
             - id 
             - host_id 
             - description 
             - neighborhood_overview 
             - city 
             - property_type 
             - neighbourhood_cleansed,
             data = hong_kong_listings_neighbourhood_simplified)
msummary(model3)
##                                                 Estimate Std. Error t value
## (Intercept)                                     7.36e+00   3.31e-01   22.26
## host_since                                     -1.40e-05   1.06e-05   -1.32
## host_is_superhostTRUE                           1.12e-01   1.92e-02    5.84
## host_listings_count                            -1.27e-03   3.94e-04   -3.23
## room_typeHotel room                            -2.71e-02   4.08e-02   -0.66
## room_typePrivate room                          -2.04e-01   1.99e-02  -10.25
## room_typeShared room                           -3.03e-01   6.05e-02   -5.01
## accommodates                                    7.50e-02   8.27e-03    9.07
## bathrooms                                       5.07e-03   2.31e-02    0.22
## bedrooms                                        1.98e-02   1.29e-02    1.54
## beds                                            1.66e-02   1.02e-02    1.63
## bed_typeFuton                                   2.28e-01   3.17e-01    0.72
## bed_typePull-out Sofa                           1.70e-01   2.72e-01    0.63
## bed_typeReal Bed                                2.57e-01   2.46e-01    1.04
## price                                           3.48e-04   6.65e-06   52.39
## security_deposit                                1.31e-05   3.78e-06    3.45
## cleaning_fee                                    2.94e-04   3.73e-05    7.89
## guests_included                                -2.07e-02   7.72e-03   -2.68
## extra_people                                    4.45e-04   4.79e-05    9.29
## minimum_nights                                  3.46e-02   9.69e-03    3.58
## maximum_nights                                 -2.98e-06   3.37e-06   -0.88
## number_of_reviews                               9.66e-05   2.39e-04    0.40
## reviews_per_month                              -2.97e-02   1.39e-02   -2.14
## number_of_reviews_ltm                          -4.97e-04   1.25e-03   -0.40
## review_scores_rating                            5.75e-03   1.91e-03    3.01
## review_scores_accuracy                         -1.09e-02   1.34e-02   -0.82
## review_scores_cleanliness                       2.21e-02   1.10e-02    2.02
## review_scores_checkin                           2.75e-02   1.18e-02    2.33
## review_scores_communication                    -2.70e-02   1.34e-02   -2.02
## review_scores_location                         -1.35e-02   1.27e-02   -1.07
## review_scores_value                            -5.33e-02   1.27e-02   -4.20
## cancellation_policymoderate                     3.51e-03   2.31e-02    0.15
## cancellation_policystrict                      -1.08e-01   3.50e-01   -0.31
## cancellation_policystrict_14_with_grace_period  2.05e-02   2.04e-02    1.01
## prop_type_simplifiedCondominium                -3.29e-02   2.69e-02   -1.23
## prop_type_simplifiedHostel                     -1.10e-01   4.83e-02   -2.29
## prop_type_simplifiedOther                      -1.30e-02   2.03e-02   -0.64
## prop_type_simplifiedServiced apartment          2.84e-02   3.86e-02    0.74
## neighbourhood_simplifiedzone_2                 -1.07e-01   3.32e-02   -3.22
## neighbourhood_simplifiedzone_3                 -1.27e-01   1.93e-02   -6.55
## neighbourhood_simplifiedzone_4                 -3.01e-01   3.24e-02   -9.30
## neighbourhood_simplifiedzone_5                 -9.94e-02   5.53e-02   -1.80
## rating_groupUnder 90                           -1.15e-02   2.37e-02   -0.48
##                                                Pr(>|t|)    
## (Intercept)                                     < 2e-16 ***
## host_since                                      0.18630    
## host_is_superhostTRUE                           5.9e-09 ***
## host_listings_count                             0.00125 ** 
## room_typeHotel room                             0.50628    
## room_typePrivate room                           < 2e-16 ***
## room_typeShared room                            5.9e-07 ***
## accommodates                                    < 2e-16 ***
## bathrooms                                       0.82648    
## bedrooms                                        0.12324    
## beds                                            0.10233    
## bed_typeFuton                                   0.47180    
## bed_typePull-out Sofa                           0.53199    
## bed_typeReal Bed                                0.29711    
## price                                           < 2e-16 ***
## security_deposit                                0.00057 ***
## cleaning_fee                                    4.6e-15 ***
## guests_included                                 0.00741 ** 
## extra_people                                    < 2e-16 ***
## minimum_nights                                  0.00036 ***
## maximum_nights                                  0.37687    
## number_of_reviews                               0.68665    
## reviews_per_month                               0.03258 *  
## number_of_reviews_ltm                           0.69063    
## review_scores_rating                            0.00268 ** 
## review_scores_accuracy                          0.41263    
## review_scores_cleanliness                       0.04341 *  
## review_scores_checkin                           0.01990 *  
## review_scores_communication                     0.04385 *  
## review_scores_location                          0.28604    
## review_scores_value                             2.8e-05 ***
## cancellation_policymoderate                     0.87915    
## cancellation_policystrict                       0.75836    
## cancellation_policystrict_14_with_grace_period  0.31482    
## prop_type_simplifiedCondominium                 0.22027    
## prop_type_simplifiedHostel                      0.02230 *  
## prop_type_simplifiedOther                       0.52338    
## prop_type_simplifiedServiced apartment          0.46200    
## neighbourhood_simplifiedzone_2                  0.00132 ** 
## neighbourhood_simplifiedzone_3                  6.7e-11 ***
## neighbourhood_simplifiedzone_4                  < 2e-16 ***
## neighbourhood_simplifiedzone_5                  0.07247 .  
## rating_groupUnder 90                            0.62822    
## 
## Residual standard error: 0.346 on 2598 degrees of freedom
##   (3796 observations deleted due to missingness)
## Multiple R-squared:  0.714,  Adjusted R-squared:  0.709 
## F-statistic:  154 on 42 and 2598 DF,  p-value: <2e-16
car::vif(model3)
##                             GVIF Df GVIF^(1/(2*Df))
## host_since                  1.41  1            1.19
## host_is_superhost           1.25  1            1.12
## host_listings_count         1.61  1            1.27
## room_type                   6.08  3            1.35
## accommodates                4.27  1            2.07
## bathrooms                   1.85  1            1.36
## bedrooms                    1.79  1            1.34
## beds                        3.92  1            1.98
## bed_type                    1.05  3            1.01
## price                       1.14  1            1.07
## security_deposit            1.18  1            1.09
## cleaning_fee                1.40  1            1.18
## guests_included             1.62  1            1.27
## extra_people                1.65  1            1.29
## minimum_nights              1.54  1            1.24
## maximum_nights              1.01  1            1.01
## number_of_reviews           3.60  1            1.90
## reviews_per_month           7.07  1            2.66
## number_of_reviews_ltm       3.72  1            1.93
## review_scores_rating        6.77  1            2.60
## review_scores_accuracy      3.49  1            1.87
## review_scores_cleanliness   2.63  1            1.62
## review_scores_checkin       2.30  1            1.52
## review_scores_communication 2.66  1            1.63
## review_scores_location      1.70  1            1.30
## review_scores_value         3.13  1            1.77
## cancellation_policy         1.35  3            1.05
## prop_type_simplified        2.09  4            1.10
## neighbourhood_simplified    2.49  4            1.12
## rating_group                2.43  1            1.56
plot(model3)

Model 4

For our model 4, we further removed variables that have t-values less than 2 (host_since,bathrooms, bedrooms, beds, bed_type, maximum_nights, number_of_reviews_ltm,number_of_reviews, review_scores_accuracy, review_scores_location, cancellation_policy and rating_group) to refine our model.

model4 <- lm(log_price_4_nights_transformed ~ 
               host_is_superhost +  
               host_listings_count + 
               room_type +
               accommodates + 
               price + 
               security_deposit + 
               cleaning_fee + 
               guests_included + 
               extra_people + 
               minimum_nights +
               review_scores_rating +
               review_scores_cleanliness + 
               review_scores_checkin +
               review_scores_communication +
               review_scores_value + 
               prop_type_simplified + 
               neighbourhood_simplified+
               rating_group,
             data = hong_kong_listings_neighbourhood_simplified)

msummary(model4)
##                                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)                             7.48e+00   8.65e-02   86.57  < 2e-16
## host_is_superhostTRUE                   1.08e-01   1.70e-02    6.37  2.0e-10
## host_listings_count                    -1.32e-03   3.21e-04   -4.12  3.8e-05
## room_typeHotel room                    -1.27e-02   3.78e-02   -0.33  0.73801
## room_typePrivate room                  -2.21e-01   1.63e-02  -13.55  < 2e-16
## room_typeShared room                   -4.01e-01   4.62e-02   -8.69  < 2e-16
## accommodates                            1.05e-01   4.66e-03   22.62  < 2e-16
## price                                   2.19e-04   4.43e-06   49.45  < 2e-16
## security_deposit                        1.38e-05   3.51e-06    3.93  8.5e-05
## cleaning_fee                            3.88e-04   3.72e-05   10.43  < 2e-16
## guests_included                        -2.61e-02   6.51e-03   -4.00  6.4e-05
## extra_people                            5.45e-04   4.10e-05   13.31  < 2e-16
## minimum_nights                          3.63e-02   8.10e-03    4.48  7.7e-06
## review_scores_rating                    1.51e-03   1.44e-03    1.05  0.29372
## review_scores_cleanliness               3.37e-02   8.75e-03    3.84  0.00012
## review_scores_checkin                   8.18e-03   1.03e-02    0.79  0.42676
## review_scores_communication            -5.56e-03   1.07e-02   -0.52  0.60256
## review_scores_value                    -5.57e-02   9.82e-03   -5.68  1.5e-08
## prop_type_simplifiedCondominium        -7.05e-02   2.24e-02   -3.15  0.00163
## prop_type_simplifiedHostel             -1.19e-01   3.35e-02   -3.54  0.00040
## prop_type_simplifiedOther               2.28e-02   1.72e-02    1.33  0.18433
## prop_type_simplifiedServiced apartment  7.79e-02   3.38e-02    2.31  0.02112
## neighbourhood_simplifiedzone_2         -1.06e-01   2.91e-02   -3.65  0.00027
## neighbourhood_simplifiedzone_3         -1.54e-01   1.61e-02   -9.58  < 2e-16
## neighbourhood_simplifiedzone_4         -3.49e-01   2.83e-02  -12.32  < 2e-16
## neighbourhood_simplifiedzone_5          1.95e-02   5.24e-02    0.37  0.71024
## rating_groupUnder 90                   -3.48e-02   1.92e-02   -1.82  0.06946
##                                           
## (Intercept)                            ***
## host_is_superhostTRUE                  ***
## host_listings_count                    ***
## room_typeHotel room                       
## room_typePrivate room                  ***
## room_typeShared room                   ***
## accommodates                           ***
## price                                  ***
## security_deposit                       ***
## cleaning_fee                           ***
## guests_included                        ***
## extra_people                           ***
## minimum_nights                         ***
## review_scores_rating                      
## review_scores_cleanliness              ***
## review_scores_checkin                     
## review_scores_communication               
## review_scores_value                    ***
## prop_type_simplifiedCondominium        ** 
## prop_type_simplifiedHostel             ***
## prop_type_simplifiedOther                 
## prop_type_simplifiedServiced apartment *  
## neighbourhood_simplifiedzone_2         ***
## neighbourhood_simplifiedzone_3         ***
## neighbourhood_simplifiedzone_4         ***
## neighbourhood_simplifiedzone_5            
## rating_groupUnder 90                   .  
## 
## Residual standard error: 0.405 on 4741 degrees of freedom
##   (1669 observations deleted due to missingness)
## Multiple R-squared:  0.599,  Adjusted R-squared:  0.596 
## F-statistic:  272 on 26 and 4741 DF,  p-value: <2e-16
car::vif(model4)
##                             GVIF Df GVIF^(1/(2*Df))
## host_is_superhost           1.12  1            1.06
## host_listings_count         1.35  1            1.16
## room_type                   2.80  3            1.19
## accommodates                1.66  1            1.29
## price                       1.05  1            1.02
## security_deposit            1.16  1            1.08
## cleaning_fee                1.42  1            1.19
## guests_included             1.52  1            1.23
## extra_people                1.34  1            1.16
## minimum_nights              1.39  1            1.18
## review_scores_rating        6.66  1            2.58
## review_scores_cleanliness   2.82  1            1.68
## review_scores_checkin       3.06  1            1.75
## review_scores_communication 3.16  1            1.78
## review_scores_value         3.22  1            1.79
## prop_type_simplified        1.57  4            1.06
## neighbourhood_simplified    1.87  4            1.08
## rating_group                2.27  1            1.51
plot(model4)

Model 5

For our model 5, we further removed variables from model 4 that are insignificant (have t-values less than 2). They are: review_scores_checkin and review_scores_communication.

model5 <- lm(log_price_4_nights_transformed ~
               host_is_superhost +  
               host_listings_count + 
               accommodates + 
               price + 
               security_deposit + 
               cleaning_fee + 
               guests_included +
               extra_people +
               minimum_nights + 
               number_of_reviews + 
               review_scores_rating +
               review_scores_cleanliness + 
               review_scores_value + 
               prop_type_simplified + 
               neighbourhood_simplified+
               rating_group,
             data = hong_kong_listings_neighbourhood_simplified)

msummary(model5)
##                                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)                             7.35e+00   8.37e-02   87.82  < 2e-16
## host_is_superhostTRUE                   9.40e-02   1.76e-02    5.35  9.4e-08
## host_listings_count                    -2.32e-03   3.21e-04   -7.21  6.4e-13
## accommodates                            1.15e-01   4.40e-03   26.09  < 2e-16
## price                                   2.23e-04   4.52e-06   49.31  < 2e-16
## security_deposit                        1.73e-05   3.59e-06    4.82  1.5e-06
## cleaning_fee                            5.02e-04   3.71e-05   13.54  < 2e-16
## guests_included                        -8.97e-03   6.51e-03   -1.38  0.16812
## extra_people                            4.16e-04   3.92e-05   10.62  < 2e-16
## minimum_nights                          5.40e-02   8.17e-03    6.61  4.2e-11
## number_of_reviews                      -5.98e-04   1.24e-04   -4.83  1.4e-06
## review_scores_rating                    2.04e-03   1.36e-03    1.50  0.13325
## review_scores_cleanliness               3.26e-02   8.95e-03    3.64  0.00028
## review_scores_value                    -5.78e-02   9.82e-03   -5.89  4.2e-09
## prop_type_simplifiedCondominium        -6.31e-02   2.29e-02   -2.75  0.00589
## prop_type_simplifiedHostel             -1.63e-01   3.30e-02   -4.94  8.2e-07
## prop_type_simplifiedOther              -2.16e-03   1.69e-02   -0.13  0.89792
## prop_type_simplifiedServiced apartment  9.34e-02   3.43e-02    2.72  0.00646
## neighbourhood_simplifiedzone_2         -1.10e-01   2.97e-02   -3.69  0.00022
## neighbourhood_simplifiedzone_3         -1.85e-01   1.60e-02  -11.58  < 2e-16
## neighbourhood_simplifiedzone_4         -3.49e-01   2.90e-02  -12.07  < 2e-16
## neighbourhood_simplifiedzone_5         -4.75e-02   5.34e-02   -0.89  0.37352
## rating_groupUnder 90                   -4.84e-02   1.95e-02   -2.48  0.01316
##                                           
## (Intercept)                            ***
## host_is_superhostTRUE                  ***
## host_listings_count                    ***
## accommodates                           ***
## price                                  ***
## security_deposit                       ***
## cleaning_fee                           ***
## guests_included                           
## extra_people                           ***
## minimum_nights                         ***
## number_of_reviews                      ***
## review_scores_rating                      
## review_scores_cleanliness              ***
## review_scores_value                    ***
## prop_type_simplifiedCondominium        ** 
## prop_type_simplifiedHostel             ***
## prop_type_simplifiedOther                 
## prop_type_simplifiedServiced apartment ** 
## neighbourhood_simplifiedzone_2         ***
## neighbourhood_simplifiedzone_3         ***
## neighbourhood_simplifiedzone_4         ***
## neighbourhood_simplifiedzone_5            
## rating_groupUnder 90                   *  
## 
## Residual standard error: 0.415 on 4745 degrees of freedom
##   (1669 observations deleted due to missingness)
## Multiple R-squared:  0.579,  Adjusted R-squared:  0.577 
## F-statistic:  296 on 22 and 4745 DF,  p-value: <2e-16
car::vif(model5)
##                           GVIF Df GVIF^(1/(2*Df))
## host_is_superhost         1.15  1            1.07
## host_listings_count       1.29  1            1.14
## accommodates              1.41  1            1.19
## price                     1.04  1            1.02
## security_deposit          1.16  1            1.08
## cleaning_fee              1.35  1            1.16
## guests_included           1.45  1            1.20
## extra_people              1.17  1            1.08
## minimum_nights            1.35  1            1.16
## number_of_reviews         1.11  1            1.05
## review_scores_rating      5.69  1            2.39
## review_scores_cleanliness 2.81  1            1.68
## review_scores_value       3.07  1            1.75
## prop_type_simplified      1.35  4            1.04
## neighbourhood_simplified  1.72  4            1.07
## rating_group              2.24  1            1.50
plot(model5)

Model 6

Lastly, we removed one more variable guests_included that has t-value less than 2 in model 5. Model 6 is our final regression model, as all the variables in the model are significant.

model6 <- lm(log_price_4_nights_transformed ~  
               host_is_superhost +  
               host_listings_count + 
               accommodates + 
               price + 
               security_deposit + 
               cleaning_fee + 
               extra_people +
               minimum_nights + 
               number_of_reviews + 
               review_scores_rating +
               review_scores_cleanliness + 
               review_scores_value + 
               prop_type_simplified +
               neighbourhood_simplified,
             data = hong_kong_listings_neighbourhood_simplified)

msummary(model6)
##                                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)                             7.21e+00   6.26e-02  115.19  < 2e-16
## host_is_superhostTRUE                   9.81e-02   1.75e-02    5.62  2.1e-08
## host_listings_count                    -2.37e-03   3.19e-04   -7.42  1.4e-13
## accommodates                            1.12e-01   4.04e-03   27.80  < 2e-16
## price                                   2.23e-04   4.52e-06   49.33  < 2e-16
## security_deposit                        1.71e-05   3.58e-06    4.77  1.9e-06
## cleaning_fee                            4.96e-04   3.64e-05   13.60  < 2e-16
## extra_people                            4.06e-04   3.86e-05   10.51  < 2e-16
## minimum_nights                          5.45e-02   8.17e-03    6.66  3.0e-11
## number_of_reviews                      -6.10e-04   1.23e-04   -4.94  7.9e-07
## review_scores_rating                    3.77e-03   1.17e-03    3.21  0.00134
## review_scores_cleanliness               3.21e-02   8.95e-03    3.58  0.00034
## review_scores_value                    -6.10e-02   9.75e-03   -6.26  4.3e-10
## prop_type_simplifiedCondominium        -6.28e-02   2.29e-02   -2.74  0.00615
## prop_type_simplifiedHostel             -1.60e-01   3.30e-02   -4.86  1.2e-06
## prop_type_simplifiedOther              -1.49e-03   1.69e-02   -0.09  0.92964
## prop_type_simplifiedServiced apartment  9.24e-02   3.43e-02    2.69  0.00707
## neighbourhood_simplifiedzone_2         -1.13e-01   2.97e-02   -3.82  0.00014
## neighbourhood_simplifiedzone_3         -1.92e-01   1.58e-02  -12.15  < 2e-16
## neighbourhood_simplifiedzone_4         -3.52e-01   2.90e-02  -12.14  < 2e-16
## neighbourhood_simplifiedzone_5         -4.86e-02   5.34e-02   -0.91  0.36266
##                                           
## (Intercept)                            ***
## host_is_superhostTRUE                  ***
## host_listings_count                    ***
## accommodates                           ***
## price                                  ***
## security_deposit                       ***
## cleaning_fee                           ***
## extra_people                           ***
## minimum_nights                         ***
## number_of_reviews                      ***
## review_scores_rating                   ** 
## review_scores_cleanliness              ***
## review_scores_value                    ***
## prop_type_simplifiedCondominium        ** 
## prop_type_simplifiedHostel             ***
## prop_type_simplifiedOther                 
## prop_type_simplifiedServiced apartment ** 
## neighbourhood_simplifiedzone_2         ***
## neighbourhood_simplifiedzone_3         ***
## neighbourhood_simplifiedzone_4         ***
## neighbourhood_simplifiedzone_5            
## 
## Residual standard error: 0.415 on 4747 degrees of freedom
##   (1669 observations deleted due to missingness)
## Multiple R-squared:  0.578,  Adjusted R-squared:  0.576 
## F-statistic:  325 on 20 and 4747 DF,  p-value: <2e-16
car::vif(model6)
##                           GVIF Df GVIF^(1/(2*Df))
## host_is_superhost         1.13  1            1.06
## host_listings_count       1.28  1            1.13
## accommodates              1.19  1            1.09
## price                     1.04  1            1.02
## security_deposit          1.15  1            1.07
## cleaning_fee              1.30  1            1.14
## extra_people              1.14  1            1.07
## minimum_nights            1.35  1            1.16
## number_of_reviews         1.10  1            1.05
## review_scores_rating      4.23  1            2.06
## review_scores_cleanliness 2.81  1            1.68
## review_scores_value       3.02  1            1.74
## prop_type_simplified      1.34  4            1.04
## neighbourhood_simplified  1.67  4            1.07
plot(model6)

Models Overview

huxtable::huxreg(model1,
                 model2,
                 model3,
                 model4,
                 model5,
                 model6)
(1)(2)(3)(4)(5)(6)
(Intercept)7.284 ***7.806 ***7.365 ***7.485 ***7.349 ***7.208 ***
(0.079)   (0.075)   (0.331)   (0.086)   (0.084)   (0.063)   
prop_type_simplifiedCondominium-0.142 ***-0.127 ***-0.033    -0.070 ** -0.063 ** -0.063 ** 
(0.034)   (0.031)   (0.027)   (0.022)   (0.023)   (0.023)   
prop_type_simplifiedHostel-0.449 ***-0.218 ***-0.110 *  -0.119 ***-0.163 ***-0.160 ***
(0.048)   (0.047)   (0.048)   (0.033)   (0.033)   (0.033)   
prop_type_simplifiedOther-0.173 ***0.006    -0.013    0.023    -0.002    -0.001    
(0.024)   (0.023)   (0.020)   (0.017)   (0.017)   (0.017)   
prop_type_simplifiedServiced apartment-0.172 ***-0.021    0.028    0.078 *  0.093 ** 0.092 ** 
(0.049)   (0.046)   (0.039)   (0.034)   (0.034)   (0.034)   
number_of_reviews-0.001 ***-0.000 ** 0.000            -0.001 ***-0.001 ***
(0.000)   (0.000)   (0.000)           (0.000)   (0.000)   
review_scores_rating0.007 ***0.003 ***0.006 ** 0.002    0.002    0.004 ** 
(0.001)   (0.001)   (0.002)   (0.001)   (0.001)   (0.001)   
room_typeHotel room        -0.245 ***-0.027    -0.013                    
        (0.051)   (0.041)   (0.038)                   
room_typePrivate room        -0.533 ***-0.204 ***-0.221 ***                
        (0.019)   (0.020)   (0.016)                   
room_typeShared room        -0.250 ***-0.303 ***-0.401 ***                
        (0.056)   (0.060)   (0.046)                   
host_since                -0.000                            
                (0.000)                           
host_is_superhostTRUE                0.112 ***0.108 ***0.094 ***0.098 ***
                (0.019)   (0.017)   (0.018)   (0.017)   
host_listings_count                -0.001 ** -0.001 ***-0.002 ***-0.002 ***
                (0.000)   (0.000)   (0.000)   (0.000)   
accommodates                0.075 ***0.105 ***0.115 ***0.112 ***
                (0.008)   (0.005)   (0.004)   (0.004)   
bathrooms                0.005                            
                (0.023)                           
bedrooms                0.020                            
                (0.013)                           
beds                0.017                            
                (0.010)                           
bed_typeFuton                0.228                            
                (0.317)                           
bed_typePull-out Sofa                0.170                            
                (0.272)                           
bed_typeReal Bed                0.257                            
                (0.246)                           
price                0.000 ***0.000 ***0.000 ***0.000 ***
                (0.000)   (0.000)   (0.000)   (0.000)   
security_deposit                0.000 ***0.000 ***0.000 ***0.000 ***
                (0.000)   (0.000)   (0.000)   (0.000)   
cleaning_fee                0.000 ***0.000 ***0.001 ***0.000 ***
                (0.000)   (0.000)   (0.000)   (0.000)   
guests_included                -0.021 ** -0.026 ***-0.009            
                (0.008)   (0.007)   (0.007)           
extra_people                0.000 ***0.001 ***0.000 ***0.000 ***
                (0.000)   (0.000)   (0.000)   (0.000)   
minimum_nights                0.035 ***0.036 ***0.054 ***0.054 ***
                (0.010)   (0.008)   (0.008)   (0.008)   
maximum_nights                -0.000                            
                (0.000)                           
reviews_per_month                -0.030 *                          
                (0.014)                           
number_of_reviews_ltm                -0.000                            
                (0.001)                           
review_scores_accuracy                -0.011                            
                (0.013)                           
review_scores_cleanliness                0.022 *  0.034 ***0.033 ***0.032 ***
                (0.011)   (0.009)   (0.009)   (0.009)   
review_scores_checkin                0.028 *  0.008                    
                (0.012)   (0.010)                   
review_scores_communication                -0.027 *  -0.006                    
                (0.013)   (0.011)                   
review_scores_location                -0.014                            
                (0.013)                           
review_scores_value                -0.053 ***-0.056 ***-0.058 ***-0.061 ***
                (0.013)   (0.010)   (0.010)   (0.010)   
cancellation_policymoderate                0.004                            
                (0.023)                           
cancellation_policystrict                -0.108                            
                (0.350)                           
cancellation_policystrict_14_with_grace_period                0.021                            
                (0.020)                           
neighbourhood_simplifiedzone_2                -0.107 ** -0.106 ***-0.110 ***-0.113 ***
                (0.033)   (0.029)   (0.030)   (0.030)   
neighbourhood_simplifiedzone_3                -0.127 ***-0.154 ***-0.185 ***-0.192 ***
                (0.019)   (0.016)   (0.016)   (0.016)   
neighbourhood_simplifiedzone_4                -0.301 ***-0.349 ***-0.349 ***-0.352 ***
                (0.032)   (0.028)   (0.029)   (0.029)   
neighbourhood_simplifiedzone_5                -0.099    0.019    -0.048    -0.049    
                (0.055)   (0.052)   (0.053)   (0.053)   
rating_groupUnder 90                -0.011    -0.035    -0.048 *          
                (0.024)   (0.019)   (0.020)           
N4783        4783        2641        4768        4768        4768        
R20.050    0.192    0.714    0.599    0.579    0.578    
logLik-4508.062    -4121.225    -919.227    -2444.830    -2559.903    -2563.900    
AIC9032.124    8264.449    1926.455    4945.660    5167.805    5171.800    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Interpretations based on these models

Bathroom, bedroom, beds and accomodates

Bathroom, bedroom and number of beds are insignificant explanatory factors for the price of an airbnb for 4 nights, because their corresponding t-values are less than 1.96, as shown in model 3. Therefore, we removed these three variables in the following models. However, the size of the Airbnb (accommodates) does has explanatory power in predicting the total price for 4 nights.

Hence, the model 6, which is model 5 plus bedrooms is our strongest model so far. It can explain around 58% of the deviation of prices by the included variables. The strongest price driver is by no surprise the number of accommodates in the airbnb and being a superhost. This does not come out of the blue, because we all know from own experience that prices per night for a hotel room are often per person prices, hence the price of a room will increase if there is an extra person living in that room.

Superhost

Based on our final regression model (model6), we can see that after controlling for other variables, Superhosts do command a pricing premium, because it is a significant variable in the model and has a coefficient of 0.101 when regressing against log(price_4_nights). Therefore, the fact that the host is a superhost increases the price_4_nights by (e^0.101-1) = 0.106 compared to the host not being a superhost. This makes economic sense, because being a superhost is very similar to a brand name and strong brands typically have higher pricing power.

Cancellation Policy

In our model 3, we see that cancellation policy is not a significant explanatory variable because all the different values of cancellation policy have t-values less than 1.96. To again test for its significance, we tried to include cancellation policy in our final model and see what happens. However, adding the variable is neither significant nor adds any explanatory power to our model. So we come to the conclusion that it is best to remove this variable from our model. It is better to have a less “complex” model with the same explanatory model as the complex.

Number of host listings

Since our Hong Kong dataset does not include information regarding whether the hosts advertise the exact locations of their listings, we choose to explore the relationship between the number of host listings and the price_4_nights. From our model 6, the coefficient for host_listings_count is -0.00225 when regressing against log(price_4_nights). Therefore, for every increase in host listings, the price_4_nights decreases by 0.00225. This might be the case because as a host owns more listings, he/she cares less about pricing of each individual listing, which leads to a slight price decrease.

#Prediction for price_4_nights in Hong Kong

#Filtering for properties that satisfy the conditions, have a private room, at least 10 reviews and an average rating over 90
hong_kong_listings_predict <- hong_kong_listings_neighbourhood_simplified %>% 
#Since all room_types besides shared room have private rooms, we only have to filter out room types that are shared rooms 
  filter(room_type != "Shared room", number_of_reviews >= 10, rating_group == "Over 90")


#log prediction + transformation
prediction <- exp(predict(model6, newdata= hong_kong_listings_predict, interval = "confidence"))
prediction %>% 
  summary()
##       fit             lwr             upr       
##  Min.   : 1196   Min.   : 1100   Min.   : 1300  
##  1st Qu.: 1880   1st Qu.: 1799   1st Qu.: 1961  
##  Median : 2427   Median : 2315   Median : 2542  
##  Mean   : 2937   Mean   : 2785   Mean   : 3101  
##  3rd Qu.: 3356   3rd Qu.: 3222   3rd Qu.: 3499  
##  Max.   :43440   Max.   :37933   Max.   :51163
plot(model6$residuals)

#non log
# here we look at the model without the log -> small differences
model_predict <- lm(price_4_nights ~  
               host_is_superhost +  
               host_listings_count + 
               accommodates + 
               price + 
               security_deposit + 
               cleaning_fee + 
               extra_people +
               minimum_nights + 
               number_of_reviews + 
               review_scores_rating +
               review_scores_cleanliness + 
               review_scores_value + 
               prop_type_simplified +
               neighbourhood_simplified,
             data = hong_kong_listings_predict)

confint(model_predict, level = 0.95)
##                                            2.5 %   97.5 %
## (Intercept)                            -4.66e+02 420.4591
## host_is_superhostTRUE                  -4.99e+01   5.0839
## host_listings_count                    -1.43e+00   0.6577
## accommodates                           -2.72e+01 -11.6086
## price                                   3.96e+00   3.9912
## security_deposit                        2.91e-03   0.0162
## cleaning_fee                            9.60e-01   1.0924
## extra_people                            7.36e-01   0.9229
## minimum_nights                         -1.93e+01  12.9198
## number_of_reviews                      -3.84e-01  -0.0280
## review_scores_rating                   -4.35e+00   7.5335
## review_scores_cleanliness              -3.45e+01  19.9229
## review_scores_value                    -2.27e+01  31.7656
## prop_type_simplifiedCondominium        -3.96e+01  46.5905
## prop_type_simplifiedHostel             -2.92e+01 105.5375
## prop_type_simplifiedOther               5.67e+00  69.1511
## prop_type_simplifiedServiced apartment -1.78e+02 -38.2587
## neighbourhood_simplifiedzone_2         -8.30e+01  12.1307
## neighbourhood_simplifiedzone_3         -5.11e+01   9.5439
## neighbourhood_simplifiedzone_4         -3.00e+01  86.0904
## neighbourhood_simplifiedzone_5          2.14e+01 199.6289
predict(model_predict, newdata = hong_kong_listings_predict, interval = "confidence") %>% 
  summary()
##       fit             lwr             upr       
##  Min.   :  331   Min.   :  298   Min.   :  364  
##  1st Qu.: 1744   1st Qu.: 1698   1st Qu.: 1791  
##  Median : 2557   Median : 2517   Median : 2600  
##  Mean   : 3223   Mean   : 3175   Mean   : 3271  
##  3rd Qu.: 3905   3rd Qu.: 3858   3rd Qu.: 3958  
##  Max.   :43497   Max.   :43325   Max.   :43670
plot(model_predict$residuals)

Conclusion

Model effectiveness and limitations

Our final regression model (model6) includes the following 13 explanatory variables:

  • host_is_superhost
  • host_listings_count
  • accommodates
  • price
  • security_deposit
  • cleaning_fee
  • extra_people
  • minimum_nights
  • number_of_reviews
  • review_scores_cleanliness
  • review_scores_value
  • prop_type_simplified
  • neighbourhood_simplified

This model has an adjusted R-Squared of 0.579, meaning that we were able to explain 58% of the variability of price_4_nights using the above variables. However, it is worth noticing that our adjusted R-Squared decreases from 0.709 in model 3 as we removed the insignificant variables. This is probably due to the fact that as we add more variables to a model, the ability to account for the variations increase. However, an efficient regression model should only contain variables that are significant and should not be highly complex. Therefore, we believe that our final model is a strong one based on the current dataset.
However, there are much more factors that could affect price_4_nights that is not reflected in this dataset and analysis. For example, there are macroeconomic factors that can impact the pricing of Airbnb listings, especially under unusual circumstances that could limit travel conditions like now. In addition, total prices could vary greatly among different seasons due to holidays and vacations. We should also take into account the effect of pricing by competitors like Booking.com and Expedia. These are all variables that are not incorporated in the model and are worth exploring in future analysis.

Take Aways

This exercise has allowed us to apply all our knowledge in R and beyond. We were able to incorporate our statistical knowledge that we gathered through this course to a real life problem. We learned how to use real data to read our surroundings and take action accordingly. If we are traveling on a budget (as most student usually do) we know what variables or in this case qualities we need to remove from our filter to find the cheapest accommodation for our budget.

We want to thank Prof. Kostis and his army of TAs that were always supportive in this new environment (We are not only talking about COVID here ;))