AirBnB Hong Kong Analysis
#loading data
listings <- read_csv("http://data.insideairbnb.com/china/hk/hong-kong/2020-06-15/data/listings.csv")%>%
clean_names()
# How many variables/columns? How many rows/observations?
# Which variables are numbers?
# Which are categorical or factor variables (numeric or character variables with variables that have a fixed and known set of possible values?
# What are the correlations between variables? Does each scatterplot support a linear relationship between variables? Do any of the correlations appear to be conditional on the value of a categorical variable?
glimpse(listings)
## Rows: 11,187
## Columns: 106
## $ id <dbl> 69074, 75083, 103760, 13…
## $ listing_url <chr> "https://www.airbnb.com/…
## $ scrape_id <dbl> 2.02e+13, 2.02e+13, 2.02…
## $ last_scraped <date> 2020-06-17, 2020-06-17,…
## $ name <chr> "Beautiful oasis of plan…
## $ summary <chr> "An ideal Hong location …
## $ space <chr> "Filled with plants and …
## $ description <chr> "An ideal Hong location …
## $ experiences_offered <chr> "none", "none", "none", …
## $ neighborhood_overview <chr> "In the upper part of tr…
## $ notes <chr> NA, "Once you arrive in …
## $ transit <chr> "Buses pass often along …
## $ access <chr> "All access, except one …
## $ interaction <chr> "If a guest is staying t…
## $ house_rules <chr> "Everything to make your…
## $ thumbnail_url <lgl> NA, NA, NA, NA, NA, NA, …
## $ medium_url <lgl> NA, NA, NA, NA, NA, NA, …
## $ picture_url <chr> "https://a0.muscache.com…
## $ xl_picture_url <lgl> NA, NA, NA, NA, NA, NA, …
## $ host_id <dbl> 160139, 304876, 304876, …
## $ host_url <chr> "https://www.airbnb.com/…
## $ host_name <chr> "Amy", "Brend", "Brend",…
## $ host_since <date> 2010-07-07, 2010-11-30,…
## $ host_location <chr> "Hong Kong", "Hong Kong"…
## $ host_about <chr> "I've been with AirBnB n…
## $ host_response_time <chr> "within a few hours", "w…
## $ host_response_rate <chr> "86%", "100%", "100%", "…
## $ host_acceptance_rate <chr> "60%", "99%", "99%", "99…
## $ host_is_superhost <lgl> TRUE, FALSE, FALSE, FALS…
## $ host_thumbnail_url <chr> "https://a0.muscache.com…
## $ host_picture_url <chr> "https://a0.muscache.com…
## $ host_neighbourhood <chr> "Sheung Wan", "Sheung Wa…
## $ host_listings_count <dbl> 2, 12, 12, 12, 1, 12, 12…
## $ host_total_listings_count <dbl> 2, 12, 12, 12, 1, 12, 12…
## $ host_verifications <chr> "['email', 'phone', 'rev…
## $ host_has_profile_pic <lgl> TRUE, TRUE, TRUE, TRUE, …
## $ host_identity_verified <lgl> TRUE, FALSE, FALSE, FALS…
## $ street <chr> "Sheung Wan, Hong Kong",…
## $ neighbourhood <chr> "Central & Western Distr…
## $ neighbourhood_cleansed <chr> "Central & Western", "Ce…
## $ neighbourhood_group_cleansed <lgl> NA, NA, NA, NA, NA, NA, …
## $ city <chr> "Sheung Wan", "Sheung Wa…
## $ state <chr> NA, NA, NA, NA, "Hong Ko…
## $ zipcode <chr> NA, NA, NA, NA, NA, NA, …
## $ market <chr> "Hong Kong", "Hong Kong"…
## $ smart_location <chr> "Sheung Wan, Hong Kong",…
## $ country_code <chr> "HK", "HK", "HK", "HK", …
## $ country <chr> "Hong Kong", "Hong Kong"…
## $ latitude <dbl> 22.3, 22.3, 22.3, 22.3, …
## $ longitude <dbl> 114, 114, 114, 114, 114,…
## $ is_location_exact <lgl> TRUE, TRUE, TRUE, FALSE,…
## $ property_type <chr> "Apartment", "Apartment"…
## $ room_type <chr> "Entire home/apt", "Enti…
## $ accommodates <dbl> 3, 3, 6, 6, 2, 6, 6, 2, …
## $ bathrooms <dbl> 1, 1, 1, 1, 1, 1, 1, 1, …
## $ bedrooms <dbl> 1, 0, 2, 2, 1, 2, 2, 1, …
## $ beds <dbl> 2, 2, 3, 3, 1, 3, 3, 1, …
## $ bed_type <chr> "Real Bed", "Real Bed", …
## $ amenities <chr> "{\"Cable TV\",Internet,…
## $ square_feet <dbl> NA, NA, NA, NA, NA, NA, …
## $ price <chr> "$1,395.00", "$783.00", …
## $ weekly_price <chr> NA, NA, NA, NA, NA, NA, …
## $ monthly_price <chr> "$29,451.00", NA, NA, NA…
## $ security_deposit <chr> "$2,325.00", "$775.00", …
## $ cleaning_fee <chr> "$310.00", "$271.00", "$…
## $ guests_included <dbl> 2, 2, 2, 3, 1, 2, 2, 1, …
## $ extra_people <chr> "$155.00", "$155.00", "$…
## $ minimum_nights <dbl> 3, 14, 2, 2, 2, 2, 2, 1,…
## $ maximum_nights <dbl> 365, 365, 365, 365, 60, …
## $ minimum_minimum_nights <dbl> 3, 14, 2, 2, 2, 2, 2, 1,…
## $ maximum_minimum_nights <dbl> 4, 14, 2, 2, 2, 2, 2, 1,…
## $ minimum_maximum_nights <dbl> 365, 365, 365, 365, 60, …
## $ maximum_maximum_nights <dbl> 365, 365, 365, 365, 60, …
## $ minimum_nights_avg_ntm <dbl> 3.1, 14.0, 2.0, 2.0, 2.0…
## $ maximum_nights_avg_ntm <dbl> 365, 365, 365, 365, 60, …
## $ calendar_updated <chr> "2 months ago", "7 weeks…
## $ has_availability <lgl> TRUE, TRUE, TRUE, TRUE, …
## $ availability_30 <dbl> 0, 0, 0, 14, 0, 8, 9, 30…
## $ availability_60 <dbl> 23, 0, 0, 44, 15, 33, 39…
## $ availability_90 <dbl> 53, 14, 0, 74, 45, 63, 6…
## $ availability_365 <dbl> 143, 193, 0, 345, 135, 3…
## $ calendar_last_scraped <date> 2020-06-17, 2020-06-17,…
## $ number_of_reviews <dbl> 134, 229, 271, 305, 27, …
## $ number_of_reviews_ltm <dbl> 4, 1, 13, 48, 0, 16, 11,…
## $ first_review <date> 2011-02-14, 2011-03-05,…
## $ last_review <date> 2020-03-24, 2020-04-18,…
## $ review_scores_rating <dbl> 97, 89, 89, 93, 97, 86, …
## $ review_scores_accuracy <dbl> 10, 8, 9, 10, 10, 9, 9, …
## $ review_scores_cleanliness <dbl> 9, 9, 9, 10, 9, 9, 9, 10…
## $ review_scores_checkin <dbl> 10, 9, 10, 10, 10, 9, 10…
## $ review_scores_communication <dbl> 10, 9, 10, 10, 10, 9, 10…
## $ review_scores_location <dbl> 10, 10, 10, 10, 10, 10, …
## $ review_scores_value <dbl> 9, 9, 9, 9, 10, 9, 9, 10…
## $ requires_license <lgl> FALSE, FALSE, FALSE, FAL…
## $ license <lgl> NA, NA, NA, NA, NA, NA, …
## $ jurisdiction_names <lgl> NA, NA, NA, NA, NA, NA, …
## $ instant_bookable <lgl> FALSE, FALSE, FALSE, FAL…
## $ is_business_travel_ready <lgl> FALSE, FALSE, FALSE, FAL…
## $ cancellation_policy <chr> "strict_14_with_grace_pe…
## $ require_guest_profile_picture <lgl> FALSE, FALSE, FALSE, FAL…
## $ require_guest_phone_verification <lgl> FALSE, FALSE, FALSE, FAL…
## $ calculated_host_listings_count <dbl> 1, 13, 13, 13, 1, 13, 13…
## $ calculated_host_listings_count_entire_homes <dbl> 1, 9, 9, 9, 1, 9, 9, 0, …
## $ calculated_host_listings_count_private_rooms <dbl> 0, 4, 4, 4, 0, 4, 4, 1, …
## $ calculated_host_listings_count_shared_rooms <dbl> 0, 0, 0, 0, 0, 0, 0, 0, …
## $ reviews_per_month <dbl> 1.18, 2.02, 2.47, 2.81, …
skim(listings)
| Name | listings |
| Number of rows | 11187 |
| Number of columns | 106 |
| _______________________ | |
| Column type frequency: | |
| character | 46 |
| Date | 5 |
| logical | 16 |
| numeric | 39 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| listing_url | 0 | 1.00 | 34 | 37 | 0 | 11187 | 0 |
| name | 8 | 1.00 | 1 | 250 | 0 | 10899 | 0 |
| summary | 756 | 0.93 | 1 | 1000 | 0 | 7990 | 0 |
| space | 4528 | 0.60 | 1 | 1000 | 0 | 4887 | 0 |
| description | 521 | 0.95 | 1 | 1000 | 0 | 8950 | 0 |
| experiences_offered | 0 | 1.00 | 4 | 4 | 0 | 1 | 0 |
| neighborhood_overview | 5879 | 0.47 | 1 | 1000 | 0 | 3570 | 0 |
| notes | 6862 | 0.39 | 1 | 1000 | 0 | 2407 | 0 |
| transit | 5598 | 0.50 | 1 | 1000 | 0 | 3665 | 0 |
| access | 6790 | 0.39 | 1 | 1000 | 0 | 2868 | 0 |
| interaction | 6119 | 0.45 | 1 | 1000 | 0 | 2979 | 0 |
| house_rules | 6217 | 0.44 | 2 | 1000 | 0 | 3169 | 0 |
| picture_url | 0 | 1.00 | 81 | 146 | 0 | 10607 | 0 |
| host_url | 0 | 1.00 | 39 | 43 | 0 | 4874 | 0 |
| host_name | 12 | 1.00 | 1 | 33 | 0 | 2846 | 0 |
| host_location | 38 | 1.00 | 2 | 133 | 0 | 429 | 0 |
| host_about | 4315 | 0.61 | 1 | 3850 | 0 | 2456 | 5 |
| host_response_time | 12 | 1.00 | 3 | 18 | 0 | 5 | 0 |
| host_response_rate | 12 | 1.00 | 2 | 4 | 0 | 59 | 0 |
| host_acceptance_rate | 12 | 1.00 | 2 | 4 | 0 | 74 | 0 |
| host_thumbnail_url | 12 | 1.00 | 55 | 106 | 0 | 4851 | 0 |
| host_picture_url | 12 | 1.00 | 57 | 109 | 0 | 4851 | 0 |
| host_neighbourhood | 2525 | 0.77 | 2 | 26 | 0 | 163 | 0 |
| host_verifications | 0 | 1.00 | 2 | 156 | 0 | 265 | 0 |
| street | 0 | 1.00 | 13 | 82 | 0 | 686 | 0 |
| neighbourhood | 1284 | 0.89 | 4 | 26 | 0 | 56 | 0 |
| neighbourhood_cleansed | 0 | 1.00 | 5 | 17 | 0 | 18 | 0 |
| city | 772 | 0.93 | 1 | 50 | 0 | 343 | 0 |
| state | 370 | 0.97 | 1 | 31 | 0 | 177 | 0 |
| zipcode | 10464 | 0.06 | 1 | 20 | 0 | 121 | 0 |
| market | 9 | 1.00 | 6 | 22 | 0 | 12 | 0 |
| smart_location | 0 | 1.00 | 9 | 61 | 0 | 385 | 0 |
| country_code | 0 | 1.00 | 2 | 2 | 0 | 3 | 0 |
| country | 0 | 1.00 | 5 | 14 | 0 | 3 | 0 |
| property_type | 0 | 1.00 | 3 | 22 | 0 | 41 | 0 |
| room_type | 0 | 1.00 | 10 | 15 | 0 | 4 | 0 |
| bed_type | 0 | 1.00 | 5 | 13 | 0 | 5 | 0 |
| amenities | 0 | 1.00 | 2 | 1126 | 0 | 8558 | 0 |
| price | 0 | 1.00 | 5 | 10 | 0 | 374 | 0 |
| weekly_price | 10601 | 0.05 | 6 | 10 | 0 | 268 | 0 |
| monthly_price | 10480 | 0.06 | 7 | 11 | 0 | 316 | 0 |
| security_deposit | 5677 | 0.49 | 5 | 10 | 0 | 231 | 0 |
| cleaning_fee | 5055 | 0.55 | 5 | 9 | 0 | 259 | 0 |
| extra_people | 0 | 1.00 | 5 | 9 | 0 | 184 | 0 |
| calendar_updated | 0 | 1.00 | 5 | 13 | 0 | 78 | 0 |
| cancellation_policy | 0 | 1.00 | 6 | 27 | 0 | 6 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| last_scraped | 0 | 1.00 | 2020-06-15 | 2020-06-19 | 2020-06-17 | 4 |
| host_since | 12 | 1.00 | 2009-08-17 | 2020-06-10 | 2015-12-27 | 2355 |
| calendar_last_scraped | 0 | 1.00 | 2020-06-15 | 2020-06-19 | 2020-06-17 | 4 |
| first_review | 4155 | 0.63 | 2011-02-14 | 2020-06-15 | 2018-02-19 | 1986 |
| last_review | 4155 | 0.63 | 2013-01-02 | 2020-06-17 | 2019-06-23 | 1365 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| thumbnail_url | 11187 | 0 | NaN | : |
| medium_url | 11187 | 0 | NaN | : |
| xl_picture_url | 11187 | 0 | NaN | : |
| host_is_superhost | 12 | 1 | 0.13 | FAL: 9669, TRU: 1506 |
| host_has_profile_pic | 12 | 1 | 1.00 | TRU: 11141, FAL: 34 |
| host_identity_verified | 12 | 1 | 0.27 | FAL: 8179, TRU: 2996 |
| neighbourhood_group_cleansed | 11187 | 0 | NaN | : |
| is_location_exact | 0 | 1 | 0.69 | TRU: 7698, FAL: 3489 |
| has_availability | 0 | 1 | 1.00 | TRU: 11187 |
| requires_license | 0 | 1 | 0.00 | FAL: 11187 |
| license | 11187 | 0 | NaN | : |
| jurisdiction_names | 11187 | 0 | NaN | : |
| instant_bookable | 0 | 1 | 0.42 | FAL: 6485, TRU: 4702 |
| is_business_travel_ready | 0 | 1 | 0.00 | FAL: 11187 |
| require_guest_profile_picture | 0 | 1 | 0.01 | FAL: 11102, TRU: 85 |
| require_guest_phone_verification | 0 | 1 | 0.01 | FAL: 11086, TRU: 101 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 2.50e+07 | 1.17e+07 | 6.91e+04 | 1.63e+07 | 2.63e+07 | 3.47e+07 | 4.38e+07 | ▃▅▆▇▇ |
| scrape_id | 0 | 1.00 | 2.02e+13 | 0.00e+00 | 2.02e+13 | 2.02e+13 | 2.02e+13 | 2.02e+13 | 2.02e+13 | ▁▁▇▁▁ |
| host_id | 0 | 1.00 | 8.84e+07 | 8.74e+07 | 3.22e+04 | 1.69e+07 | 5.25e+07 | 1.39e+08 | 3.49e+08 | ▇▃▂▂▁ |
| host_listings_count | 12 | 1.00 | 4.85e+01 | 1.05e+02 | 0.00e+00 | 1.00e+00 | 5.00e+00 | 2.20e+01 | 3.86e+02 | ▇▁▁▁▁ |
| host_total_listings_count | 12 | 1.00 | 4.85e+01 | 1.05e+02 | 0.00e+00 | 1.00e+00 | 5.00e+00 | 2.20e+01 | 3.86e+02 | ▇▁▁▁▁ |
| latitude | 0 | 1.00 | 2.23e+01 | 5.00e-02 | 2.22e+01 | 2.23e+01 | 2.23e+01 | 2.23e+01 | 2.26e+01 | ▁▇▁▁▁ |
| longitude | 0 | 1.00 | 1.14e+02 | 4.00e-02 | 1.14e+02 | 1.14e+02 | 1.14e+02 | 1.14e+02 | 1.14e+02 | ▁▁▃▇▁ |
| accommodates | 0 | 1.00 | 2.82e+00 | 2.18e+00 | 1.00e+00 | 2.00e+00 | 2.00e+00 | 3.00e+00 | 1.60e+01 | ▇▁▁▁▁ |
| bathrooms | 17 | 1.00 | 1.16e+00 | 5.70e-01 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 1.10e+01 | ▇▁▁▁▁ |
| bedrooms | 38 | 1.00 | 1.09e+00 | 8.50e-01 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 1.10e+01 | ▇▁▁▁▁ |
| beds | 69 | 0.99 | 1.68e+00 | 1.44e+00 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 2.00e+01 | ▇▁▁▁▁ |
| square_feet | 11146 | 0.00 | 3.99e+02 | 6.19e+02 | 0.00e+00 | 0.00e+00 | 1.40e+02 | 6.00e+02 | 3.20e+03 | ▇▂▁▁▁ |
| guests_included | 0 | 1.00 | 1.39e+00 | 1.06e+00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 1.60e+01 | ▇▁▁▁▁ |
| minimum_nights | 0 | 1.00 | 9.76e+00 | 2.83e+01 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 7.00e+00 | 1.10e+03 | ▇▁▁▁▁ |
| maximum_nights | 0 | 1.00 | 3.86e+05 | 2.87e+07 | 1.00e+00 | 3.65e+02 | 1.12e+03 | 1.12e+03 | 2.15e+09 | ▇▁▁▁▁ |
| minimum_minimum_nights | 0 | 1.00 | 9.61e+00 | 2.80e+01 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 7.00e+00 | 1.10e+03 | ▇▁▁▁▁ |
| maximum_minimum_nights | 0 | 1.00 | 1.00e+01 | 2.91e+01 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 7.00e+00 | 1.10e+03 | ▇▁▁▁▁ |
| minimum_maximum_nights | 0 | 1.00 | 3.86e+05 | 2.87e+07 | 1.00e+00 | 3.65e+02 | 1.12e+03 | 1.12e+03 | 2.15e+09 | ▇▁▁▁▁ |
| maximum_maximum_nights | 0 | 1.00 | 3.86e+05 | 2.87e+07 | 1.00e+00 | 3.65e+02 | 1.12e+03 | 1.12e+03 | 2.15e+09 | ▇▁▁▁▁ |
| minimum_nights_avg_ntm | 0 | 1.00 | 9.79e+00 | 2.82e+01 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 7.00e+00 | 1.10e+03 | ▇▁▁▁▁ |
| maximum_nights_avg_ntm | 0 | 1.00 | 3.86e+05 | 2.87e+07 | 1.00e+00 | 3.65e+02 | 1.12e+03 | 1.12e+03 | 2.15e+09 | ▇▁▁▁▁ |
| availability_30 | 0 | 1.00 | 1.55e+01 | 1.40e+01 | 0.00e+00 | 0.00e+00 | 2.00e+01 | 3.00e+01 | 3.00e+01 | ▇▁▁▁▇ |
| availability_60 | 0 | 1.00 | 3.28e+01 | 2.79e+01 | 0.00e+00 | 0.00e+00 | 4.70e+01 | 6.00e+01 | 6.00e+01 | ▆▁▁▁▇ |
| availability_90 | 0 | 1.00 | 5.06e+01 | 4.17e+01 | 0.00e+00 | 0.00e+00 | 7.60e+01 | 9.00e+01 | 9.00e+01 | ▆▁▁▁▇ |
| availability_365 | 0 | 1.00 | 1.68e+02 | 1.57e+02 | 0.00e+00 | 0.00e+00 | 1.08e+02 | 3.64e+02 | 3.65e+02 | ▇▂▂▁▇ |
| number_of_reviews | 0 | 1.00 | 1.77e+01 | 4.12e+01 | 0.00e+00 | 0.00e+00 | 2.00e+00 | 1.40e+01 | 7.57e+02 | ▇▁▁▁▁ |
| number_of_reviews_ltm | 0 | 1.00 | 2.68e+00 | 7.55e+00 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 1.00e+00 | 1.38e+02 | ▇▁▁▁▁ |
| review_scores_rating | 4355 | 0.61 | 9.09e+01 | 1.12e+01 | 2.00e+01 | 8.70e+01 | 9.40e+01 | 9.90e+01 | 1.00e+02 | ▁▁▁▂▇ |
| review_scores_accuracy | 4357 | 0.61 | 9.34e+00 | 1.12e+00 | 2.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_cleanliness | 4357 | 0.61 | 9.09e+00 | 1.20e+00 | 2.00e+00 | 9.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | ▁▁▁▂▇ |
| review_scores_checkin | 4356 | 0.61 | 9.50e+00 | 1.04e+00 | 2.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_communication | 4357 | 0.61 | 9.51e+00 | 1.03e+00 | 2.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_location | 4358 | 0.61 | 9.61e+00 | 8.50e-01 | 2.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_value | 4358 | 0.61 | 9.13e+00 | 1.13e+00 | 2.00e+00 | 9.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | ▁▁▁▂▇ |
| calculated_host_listings_count | 0 | 1.00 | 4.57e+01 | 1.03e+02 | 1.00e+00 | 1.00e+00 | 4.00e+00 | 1.90e+01 | 3.89e+02 | ▇▁▁▁▁ |
| calculated_host_listings_count_entire_homes | 0 | 1.00 | 7.80e+00 | 1.90e+01 | 0.00e+00 | 0.00e+00 | 1.00e+00 | 4.00e+00 | 1.08e+02 | ▇▁▁▁▁ |
| calculated_host_listings_count_private_rooms | 0 | 1.00 | 3.29e+01 | 8.22e+01 | 0.00e+00 | 0.00e+00 | 1.00e+00 | 1.10e+01 | 3.39e+02 | ▇▁▁▁▁ |
| calculated_host_listings_count_shared_rooms | 0 | 1.00 | 4.54e+00 | 1.57e+01 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 8.20e+01 | ▇▁▁▁▁ |
| reviews_per_month | 4155 | 0.63 | 8.40e-01 | 1.18e+00 | 1.00e-02 | 1.20e-01 | 3.50e-01 | 1.03e+00 | 1.32e+01 | ▇▁▁▁▁ |
Cleaning the data
Here we are selecting the data and specific variables to perform our analysis on. We got rid of qualitative variables, such as description and summary, as transforming qualitative into quantitative data leads to pre-programmed errors, due to the nature of the algorithm.
#we never change the real data
hong_kong_listings <- listings %>%
select(id,
host_id,
host_since,
host_is_superhost,
host_listings_count,
neighbourhood_cleansed,
#latitude, does not give extra info as all pretty similiar
#longitude, does not give extra info as all pretty similiar
property_type,
room_type,
accommodates,
bathrooms,
bedrooms,
beds,
bed_type,
#amenities, just one long string
#square_feet, we noticed that a lot of values are missing so excluded this variable
price,
#weekly_price, a lot of NAs
#monthly_price,a lot of NAs
security_deposit,
cleaning_fee,
guests_included,
extra_people,
minimum_nights,
maximum_nights,
number_of_reviews,
reviews_per_month,
number_of_reviews_ltm,
review_scores_rating,
review_scores_accuracy,
review_scores_cleanliness,
review_scores_checkin,
review_scores_communication,
review_scores_location,
review_scores_value,
listing_url,
city,
description,
neighborhood_overview,
#s_business_travel_ready,
cancellation_policy) %>%
#Converting characters to "doubles" and factors where appropriate
mutate(neighbourhood_cleansed=factor(neighbourhood_cleansed),
room_type=as.factor(room_type),
price=parse_number(price),
security_deposit=parse_number(security_deposit),
cleaning_fee=parse_number(cleaning_fee),
extra_people=parse_number(extra_people),
cancellation_policy=as.factor(cancellation_policy),
bed_type=as.factor(bed_type),
city=as.factor(city))
#Inspecting data frame to make sure all the variables are correctly attributed
glimpse(hong_kong_listings)
## Rows: 11,187
## Columns: 35
## $ id <dbl> 69074, 75083, 103760, 132773, 133390, 163…
## $ host_id <dbl> 160139, 304876, 304876, 304876, 654642, 3…
## $ host_since <date> 2010-07-07, 2010-11-30, 2010-11-30, 2010…
## $ host_is_superhost <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, …
## $ host_listings_count <dbl> 2, 12, 12, 12, 1, 12, 12, 1, 1, 1, 1, 8, …
## $ neighbourhood_cleansed <fct> Central & Western, Central & Western, Cen…
## $ property_type <chr> "Apartment", "Apartment", "Apartment", "A…
## $ room_type <fct> Entire home/apt, Entire home/apt, Entire …
## $ accommodates <dbl> 3, 3, 6, 6, 2, 6, 6, 2, 2, 8, 4, 4, 6, 3,…
## $ bathrooms <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 2, 1,…
## $ bedrooms <dbl> 1, 0, 2, 2, 1, 2, 2, 1, 1, 1, 2, 1, 3, 2,…
## $ beds <dbl> 2, 2, 3, 3, 1, 3, 3, 1, 1, 7, 3, 1, 3, 2,…
## $ bed_type <fct> Real Bed, Real Bed, Real Bed, Real Bed, R…
## $ price <dbl> 1395, 783, 845, 1046, 930, 690, 767, 698,…
## $ security_deposit <dbl> 2325, 775, 775, 775, 1163, 775, 775, NA, …
## $ cleaning_fee <dbl> 310, 271, 271, 302, NA, 302, 302, NA, NA,…
## $ guests_included <dbl> 2, 2, 2, 3, 1, 2, 2, 1, 1, 1, 4, 2, 2, 2,…
## $ extra_people <dbl> 155, 155, 194, 225, 0, 194, 194, 0, 0, 0,…
## $ minimum_nights <dbl> 3, 14, 2, 2, 2, 2, 2, 1, 1, 10, 4, 2, 4, …
## $ maximum_nights <dbl> 365, 365, 365, 365, 60, 365, 365, 365, 60…
## $ number_of_reviews <dbl> 134, 229, 271, 305, 27, 222, 225, 17, 163…
## $ reviews_per_month <dbl> 1.18, 2.02, 2.47, 2.81, 0.25, 2.07, 2.09,…
## $ number_of_reviews_ltm <dbl> 4, 1, 13, 48, 0, 16, 11, 0, 12, 0, 9, 2, …
## $ review_scores_rating <dbl> 97, 89, 89, 93, 97, 86, 86, 100, 98, NA, …
## $ review_scores_accuracy <dbl> 10, 8, 9, 10, 10, 9, 9, 10, 10, NA, 10, 9…
## $ review_scores_cleanliness <dbl> 9, 9, 9, 10, 9, 9, 9, 10, 10, NA, 9, 9, 7…
## $ review_scores_checkin <dbl> 10, 9, 10, 10, 10, 9, 10, 10, 10, NA, 10,…
## $ review_scores_communication <dbl> 10, 9, 10, 10, 10, 9, 10, 10, 10, NA, 10,…
## $ review_scores_location <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 9, NA, 10…
## $ review_scores_value <dbl> 9, 9, 9, 9, 10, 9, 9, 10, 10, NA, 9, 9, 8…
## $ listing_url <chr> "https://www.airbnb.com/rooms/69074", "ht…
## $ city <fct> Sheung Wan, Sheung Wan, Central, Hong Kon…
## $ description <chr> "An ideal Hong location any visitor--hip …
## $ neighborhood_overview <chr> "In the upper part of trendy, hip Sheung …
## $ cancellation_policy <fct> strict_14_with_grace_period, strict_14_wi…
skim(hong_kong_listings)
| Name | hong_kong_listings |
| Number of rows | 11187 |
| Number of columns | 35 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| Date | 1 |
| factor | 5 |
| logical | 1 |
| numeric | 24 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| property_type | 0 | 1.00 | 3 | 22 | 0 | 41 | 0 |
| listing_url | 0 | 1.00 | 34 | 37 | 0 | 11187 | 0 |
| description | 521 | 0.95 | 1 | 1000 | 0 | 8950 | 0 |
| neighborhood_overview | 5879 | 0.47 | 1 | 1000 | 0 | 3570 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| host_since | 12 | 1 | 2009-08-17 | 2020-06-10 | 2015-12-27 | 2355 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| neighbourhood_cleansed | 0 | 1.00 | FALSE | 18 | Yau: 4165, Cen: 2378, Wan: 2029, Kow: 471 |
| room_type | 0 | 1.00 | FALSE | 4 | Pri: 5376, Ent: 4940, Sha: 615, Hot: 256 |
| bed_type | 0 | 1.00 | FALSE | 5 | Rea: 11124, Pul: 28, Fut: 19, Air: 8 |
| city | 772 | 0.93 | FALSE | 343 | Hon: 8040, Hon: 406, She: 238, 香港: 178 |
| cancellation_policy | 0 | 1.00 | FALSE | 6 | str: 5435, fle: 4015, mod: 1687, sup: 29 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| host_is_superhost | 12 | 1 | 0.13 | FAL: 9669, TRU: 1506 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 2.50e+07 | 1.17e+07 | 69074.00 | 1.63e+07 | 2.63e+07 | 3.47e+07 | 4.38e+07 | ▃▅▆▇▇ |
| host_id | 0 | 1.00 | 8.84e+07 | 8.74e+07 | 32172.00 | 1.69e+07 | 5.25e+07 | 1.39e+08 | 3.49e+08 | ▇▃▂▂▁ |
| host_listings_count | 12 | 1.00 | 4.85e+01 | 1.05e+02 | 0.00 | 1.00e+00 | 5.00e+00 | 2.20e+01 | 3.86e+02 | ▇▁▁▁▁ |
| accommodates | 0 | 1.00 | 2.82e+00 | 2.18e+00 | 1.00 | 2.00e+00 | 2.00e+00 | 3.00e+00 | 1.60e+01 | ▇▁▁▁▁ |
| bathrooms | 17 | 1.00 | 1.16e+00 | 5.70e-01 | 0.00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 1.10e+01 | ▇▁▁▁▁ |
| bedrooms | 38 | 1.00 | 1.09e+00 | 8.50e-01 | 0.00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 1.10e+01 | ▇▁▁▁▁ |
| beds | 69 | 0.99 | 1.68e+00 | 1.44e+00 | 0.00 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 2.00e+01 | ▇▁▁▁▁ |
| price | 0 | 1.00 | 7.42e+02 | 1.89e+03 | 0.00 | 2.95e+02 | 4.81e+02 | 7.98e+02 | 7.80e+04 | ▇▁▁▁▁ |
| security_deposit | 5677 | 0.49 | 1.56e+03 | 3.75e+03 | 0.00 | 0.00e+00 | 8.00e+02 | 1.50e+03 | 3.96e+04 | ▇▁▁▁▁ |
| cleaning_fee | 5055 | 0.55 | 1.77e+02 | 2.33e+02 | 0.00 | 3.90e+01 | 1.39e+02 | 2.50e+02 | 4.80e+03 | ▇▁▁▁▁ |
| guests_included | 0 | 1.00 | 1.39e+00 | 1.06e+00 | 1.00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 1.60e+01 | ▇▁▁▁▁ |
| extra_people | 0 | 1.00 | 5.62e+01 | 1.47e+02 | 0.00 | 0.00e+00 | 0.00e+00 | 5.00e+01 | 2.34e+03 | ▇▁▁▁▁ |
| minimum_nights | 0 | 1.00 | 9.76e+00 | 2.83e+01 | 1.00 | 1.00e+00 | 2.00e+00 | 7.00e+00 | 1.10e+03 | ▇▁▁▁▁ |
| maximum_nights | 0 | 1.00 | 3.86e+05 | 2.87e+07 | 1.00 | 3.65e+02 | 1.12e+03 | 1.12e+03 | 2.15e+09 | ▇▁▁▁▁ |
| number_of_reviews | 0 | 1.00 | 1.77e+01 | 4.12e+01 | 0.00 | 0.00e+00 | 2.00e+00 | 1.40e+01 | 7.57e+02 | ▇▁▁▁▁ |
| reviews_per_month | 4155 | 0.63 | 8.40e-01 | 1.18e+00 | 0.01 | 1.20e-01 | 3.50e-01 | 1.03e+00 | 1.32e+01 | ▇▁▁▁▁ |
| number_of_reviews_ltm | 0 | 1.00 | 2.68e+00 | 7.55e+00 | 0.00 | 0.00e+00 | 0.00e+00 | 1.00e+00 | 1.38e+02 | ▇▁▁▁▁ |
| review_scores_rating | 4355 | 0.61 | 9.09e+01 | 1.12e+01 | 20.00 | 8.70e+01 | 9.40e+01 | 9.90e+01 | 1.00e+02 | ▁▁▁▂▇ |
| review_scores_accuracy | 4357 | 0.61 | 9.34e+00 | 1.12e+00 | 2.00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_cleanliness | 4357 | 0.61 | 9.09e+00 | 1.20e+00 | 2.00 | 9.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | ▁▁▁▂▇ |
| review_scores_checkin | 4356 | 0.61 | 9.50e+00 | 1.04e+00 | 2.00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_communication | 4357 | 0.61 | 9.51e+00 | 1.03e+00 | 2.00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_location | 4358 | 0.61 | 9.61e+00 | 8.50e-01 | 2.00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_value | 4358 | 0.61 | 9.13e+00 | 1.13e+00 | 2.00 | 9.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | ▁▁▁▂▇ |
Here is a description of some of the key variables in our dataset hong_kong_listings:
price= cost per nightcleaning_fee: cleaning feeextra_people: charge for having more than 1 personproperty_type: type of accommodation (House, Apartment, etc.)room_type:- Entire home/apt (guests have entire place to themselves)
- Private room (Guests have private room to sleep, all other rooms shared)
- Shared room (Guests sleep in room shared with others)
number_of_reviews: Total number of reviews for the listingreview_scores_rating: Average review score (0 - 100)neighbourhood*: three variables on a few major neighbourhoods
Handling missing values
We get rid of NAs by using once again mutate, we also filter for min/max nights and accommodates.
We assume for NAs in cleaning fee and security deposits to be 0. Which means, that if we have a NA we now that there is either no cleaning fee or no security deposit. we also think this is then reflected in the daily price.
Summary of property types
Property_type_summary <- hong_kong_listings%>%
group_by(property_type)%>%
summarise(count = n())%>%
mutate(property_proportion = count/sum(count))%>%
arrange(desc(count))
ggplot(data = Property_type_summary) +
geom_col(aes(y = count, x = property_type)) +
coord_flip()

Property_type_top10 <- Property_type_summary%>%
head(n=10) %>%
ggplot() +
geom_col(aes(y = reorder(property_type, count), x = count), fill = "#00B81F") +
theme_bw() +
labs(y = "Property type",
x = "",
title = "The most popular property types on Airbnb \n in Hong Kong")
Property_type_top10

The most common four Airbnb property types in Hong Kong are: apartment, condominium, serviced apartment, and hostel, and their proportions out of the total number of listings are: 67.5%, 9.01%, 4.43%, and 3.57%, respectively.
Summary of minimum nights
Minimum_nights_summary <- hong_kong_listings%>%
group_by(minimum_nights)%>%
summarise(count = n())%>%
mutate(frequency = count/sum(count))
Minimum_nights_top5 <- Minimum_nights_summary%>%
arrange(desc(count))%>%
head(n=5) %>%
ggplot() +
geom_col(aes(y = reorder(minimum_nights, count), x = count), fill = "darkorange") +
theme_bw() +
labs(y = "Minimum nights",
x = "",
title = "Top 5 minimum nights in Hong Kong")
Minimum_nights_top5

The most common values (top 5) of minimum nights are 1, 2, 29, 3, 28 nights respectively. The values that stand out among these common ones are 29 and 28 nights. Since Hong Kong is a metropolis, we think that these two types of Airbnb are intended for people who are in Hong Kong for business purposes rather than tourism. They are in need of a longer-term stay in Hong Kong, so the Airbnb acts like a rented space for them that requires them to stay for at least a month. The benfits of renting an Airbnb for that time is the ease of administration. It is often times impossible to find an apartment for a couple of weeks without going through the administrative hassle of exchanging documents, looking for credit risk and the like.
Filter and mutate the dataset
Based on the observations and summaries above, we filter and mutate our dataset in order to obtain only accommodations that are suitable for 2 guests who want to spend 4 nights in the Airbnb. We also filter accommodates for the range of <2:9> since we believe that booking a place for up to 9 accommodates by wealthy clients is reasonable. In addition, we create a new variable called prop_type_simplified that include the most common 4 property types and the rest are considered as Other. We also assume that for each NAvalue in both variables cleaning_feeand security_deposit, the value is 0, meaning that there is no cleanin_feeand no security_deposit.
#Filter dataset for two guests and 4 nights
#Clean dataset for cleaning_fee, security_deposit, property_type, minimum_nights and accommodates
hong_kong_listings_cleaned <- hong_kong_listings %>%
mutate(cleaning_fee = case_when( #considering cleaning_fee as 0 if displayed as NA
is.na(cleaning_fee) ~ 0,
TRUE ~ cleaning_fee),
security_deposit = case_when( #considering security_deposit as 0 if displayed as NA
is.na(security_deposit) ~ 0,
TRUE ~ security_deposit),
prop_type_simplified = case_when( #regrouping of property_types: put all less popular property types into "Other"
property_type %in% c("Apartment",
"Hostel",
"Condominium",
"Serviced apartment")~ property_type ,
TRUE ~ "Other"),
prop_type_simplified=as.factor(prop_type_simplified)) %>% #creating factors
filter(minimum_nights<=4, maximum_nights>=4, accommodates>=2 , accommodates<=9) #filtering dataframe
#Visually inspecting cleaned data set
glimpse(hong_kong_listings_cleaned)
## Rows: 6,437
## Columns: 36
## $ id <dbl> 69074, 103760, 132773, 133390, 163664, 16…
## $ host_id <dbl> 160139, 304876, 304876, 654642, 304876, 3…
## $ host_since <date> 2010-07-07, 2010-11-30, 2010-11-30, 2011…
## $ host_is_superhost <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, …
## $ host_listings_count <dbl> 2, 12, 12, 1, 12, 12, 1, 1, 1, 8, 8, 1, 1…
## $ neighbourhood_cleansed <fct> Central & Western, Central & Western, Cen…
## $ property_type <chr> "Apartment", "Apartment", "Apartment", "A…
## $ room_type <fct> Entire home/apt, Entire home/apt, Entire …
## $ accommodates <dbl> 3, 6, 6, 2, 6, 6, 2, 2, 4, 4, 6, 3, 4, 3,…
## $ bathrooms <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1,…
## $ bedrooms <dbl> 1, 2, 2, 1, 2, 2, 1, 1, 2, 1, 3, 2, 1, 1,…
## $ beds <dbl> 2, 3, 3, 1, 3, 3, 1, 1, 3, 1, 3, 2, 2, 1,…
## $ bed_type <fct> Real Bed, Real Bed, Real Bed, Real Bed, R…
## $ price <dbl> 1395, 845, 1046, 930, 690, 767, 698, 643,…
## $ security_deposit <dbl> 2325, 775, 775, 1163, 775, 775, 0, 0, 193…
## $ cleaning_fee <dbl> 310, 271, 302, 0, 302, 302, 0, 0, 504, 31…
## $ guests_included <dbl> 2, 2, 3, 1, 2, 2, 1, 1, 4, 2, 2, 2, 4, 1,…
## $ extra_people <dbl> 155, 194, 225, 0, 194, 194, 0, 0, 0, 155,…
## $ minimum_nights <dbl> 3, 2, 2, 2, 2, 2, 1, 1, 4, 2, 4, 2, 1, 1,…
## $ maximum_nights <dbl> 365, 365, 365, 60, 365, 365, 365, 60, 112…
## $ number_of_reviews <dbl> 134, 271, 305, 27, 222, 225, 17, 163, 240…
## $ reviews_per_month <dbl> 1.18, 2.47, 2.81, 0.25, 2.07, 2.09, 0.16,…
## $ number_of_reviews_ltm <dbl> 4, 13, 48, 0, 16, 11, 0, 12, 9, 2, 49, 0,…
## $ review_scores_rating <dbl> 97, 89, 93, 97, 86, 86, 100, 98, 95, 93, …
## $ review_scores_accuracy <dbl> 10, 9, 10, 10, 9, 9, 10, 10, 10, 9, 9, 9,…
## $ review_scores_cleanliness <dbl> 9, 9, 10, 9, 9, 9, 10, 10, 9, 9, 7, 9, 9,…
## $ review_scores_checkin <dbl> 10, 10, 10, 10, 9, 10, 10, 10, 10, 9, 9, …
## $ review_scores_communication <dbl> 10, 10, 10, 10, 9, 10, 10, 10, 10, 10, 9,…
## $ review_scores_location <dbl> 10, 10, 10, 10, 10, 10, 10, 9, 10, 9, 9, …
## $ review_scores_value <dbl> 9, 9, 9, 10, 9, 9, 10, 10, 9, 9, 8, 9, 9,…
## $ listing_url <chr> "https://www.airbnb.com/rooms/69074", "ht…
## $ city <fct> Sheung Wan, Central, Hong Kong Island, Ce…
## $ description <chr> "An ideal Hong location any visitor--hip …
## $ neighborhood_overview <chr> "In the upper part of trendy, hip Sheung …
## $ cancellation_policy <fct> strict_14_with_grace_period, strict_14_wi…
## $ prop_type_simplified <fct> Apartment, Apartment, Apartment, Apartmen…
skim(hong_kong_listings_cleaned)
| Name | hong_kong_listings_cleane… |
| Number of rows | 6437 |
| Number of columns | 36 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| Date | 1 |
| factor | 6 |
| logical | 1 |
| numeric | 24 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| property_type | 0 | 1.00 | 3 | 22 | 0 | 36 | 0 |
| listing_url | 0 | 1.00 | 34 | 37 | 0 | 6437 | 0 |
| description | 292 | 0.95 | 1 | 1000 | 0 | 5212 | 0 |
| neighborhood_overview | 2931 | 0.54 | 1 | 1000 | 0 | 2374 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| host_since | 10 | 1 | 2009-10-07 | 2020-06-09 | 2015-12-28 | 1935 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| neighbourhood_cleansed | 0 | 1.00 | FALSE | 18 | Yau: 2747, Cen: 1372, Wan: 888, Isl: 268 |
| room_type | 0 | 1.00 | FALSE | 4 | Ent: 3157, Pri: 2922, Hot: 188, Sha: 170 |
| bed_type | 0 | 1.00 | FALSE | 5 | Rea: 6404, Pul: 16, Fut: 8, Air: 6 |
| city | 468 | 0.93 | FALSE | 269 | Hon: 4538, She: 196, 香港: 97, Hon: 83 |
| cancellation_policy | 0 | 1.00 | FALSE | 6 | str: 3575, fle: 1777, mod: 1048, sup: 21 |
| prop_type_simplified | 0 | 1.00 | FALSE | 5 | Apa: 4129, Oth: 1251, Con: 541, Ser: 277 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| host_is_superhost | 10 | 1 | 0.13 | FAL: 5594, TRU: 833 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 2.40e+07 | 1.17e+07 | 69074.00 | 1.49e+07 | 2.46e+07 | 3.38e+07 | 4.38e+07 | ▅▆▇▇▇ |
| host_id | 0 | 1.00 | 9.35e+07 | 9.02e+07 | 44242.00 | 2.37e+07 | 5.25e+07 | 1.49e+08 | 3.49e+08 | ▇▂▂▂▁ |
| host_listings_count | 10 | 1.00 | 1.19e+01 | 2.49e+01 | 0.00 | 1.00e+00 | 3.00e+00 | 1.20e+01 | 3.86e+02 | ▇▁▁▁▁ |
| accommodates | 0 | 1.00 | 3.07e+00 | 1.58e+00 | 2.00 | 2.00e+00 | 2.00e+00 | 4.00e+00 | 9.00e+00 | ▇▂▁▁▁ |
| bathrooms | 4 | 1.00 | 1.13e+00 | 4.30e-01 | 0.00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 8.00e+00 | ▇▁▁▁▁ |
| bedrooms | 14 | 1.00 | 1.13e+00 | 7.40e-01 | 0.00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 1.00e+01 | ▇▁▁▁▁ |
| beds | 26 | 1.00 | 1.77e+00 | 1.22e+00 | 0.00 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 1.40e+01 | ▇▂▁▁▁ |
| price | 0 | 1.00 | 8.15e+02 | 1.67e+03 | 0.00 | 3.72e+02 | 5.50e+02 | 8.53e+02 | 6.67e+04 | ▇▁▁▁▁ |
| security_deposit | 0 | 1.00 | 5.64e+02 | 1.89e+03 | 0.00 | 0.00e+00 | 0.00e+00 | 7.84e+02 | 3.80e+04 | ▇▁▁▁▁ |
| cleaning_fee | 0 | 1.00 | 1.01e+02 | 1.88e+02 | 0.00 | 0.00e+00 | 0.00e+00 | 1.60e+02 | 4.69e+03 | ▇▁▁▁▁ |
| guests_included | 0 | 1.00 | 1.48e+00 | 1.03e+00 | 1.00 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 1.60e+01 | ▇▁▁▁▁ |
| extra_people | 0 | 1.00 | 6.87e+01 | 1.58e+02 | 0.00 | 0.00e+00 | 0.00e+00 | 1.00e+02 | 2.30e+03 | ▇▁▁▁▁ |
| minimum_nights | 0 | 1.00 | 1.55e+00 | 8.50e-01 | 1.00 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 4.00e+00 | ▇▂▁▂▁ |
| maximum_nights | 0 | 1.00 | 3.36e+05 | 2.68e+07 | 4.00 | 3.60e+02 | 1.12e+03 | 1.12e+03 | 2.15e+09 | ▇▁▁▁▁ |
| number_of_reviews | 0 | 1.00 | 2.32e+01 | 4.61e+01 | 0.00 | 1.00e+00 | 5.00e+00 | 2.30e+01 | 7.57e+02 | ▇▁▁▁▁ |
| reviews_per_month | 1541 | 0.76 | 9.10e-01 | 1.22e+00 | 0.01 | 1.40e-01 | 4.10e-01 | 1.16e+00 | 1.32e+01 | ▇▁▁▁▁ |
| number_of_reviews_ltm | 0 | 1.00 | 3.55e+00 | 8.71e+00 | 0.00 | 0.00e+00 | 0.00e+00 | 2.00e+00 | 1.38e+02 | ▇▁▁▁▁ |
| review_scores_rating | 1654 | 0.74 | 9.10e+01 | 1.06e+01 | 20.00 | 8.70e+01 | 9.30e+01 | 9.80e+01 | 1.00e+02 | ▁▁▁▂▇ |
| review_scores_accuracy | 1656 | 0.74 | 9.33e+00 | 1.09e+00 | 2.00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_cleanliness | 1656 | 0.74 | 9.13e+00 | 1.13e+00 | 2.00 | 9.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | ▁▁▁▂▇ |
| review_scores_checkin | 1655 | 0.74 | 9.50e+00 | 1.01e+00 | 2.00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_communication | 1656 | 0.74 | 9.52e+00 | 9.80e-01 | 2.00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_location | 1658 | 0.74 | 9.61e+00 | 8.40e-01 | 2.00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_value | 1658 | 0.74 | 9.14e+00 | 1.08e+00 | 2.00 | 9.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | ▁▁▁▂▇ |
Calculate total price for 4 nights and data transformation
To end the pre-processing section, we calculated price_4_nights as our target variable for regression. It is the total price of 4 nights and two people for each listing.
In addition, because some of the total price_4_nights is equal to 0, log(price_4_nights) would turn out to be negative infinity that hinders further analysis. Therefore, we transformed those price_4_nights that are equal to 0 by adding 1 while keeping others unchanged. Since log(1) is still 0, it does not affect our regression outcome.
hong_kong_listings_total_price<-hong_kong_listings_cleaned %>%
# price_4_nights calculation
mutate(price_4_nights=price*4+
cleaning_fee+
if_else(guests_included==1, extra_people*4,0),
# Add 1 to price_4_nights that are equal to 0
price_4_nights_transformed = price_4_nights +
if_else(price_4_nights==0, 1,0),
log_price_4_nights = log(price_4_nights),
log_price_4_nights_transformed = log(price_4_nights_transformed))
New variables: neighbourhood_simplified and rating_group
Using city knowledge, we create a new categorical variable neighbourhood_simplified where we group neighbourhoods together geographically into 5 different zones. We also create a new categorical variable, rating_group, to divide the properties into 3 categories; properties with review_scores_rating less than 90, greater than 90 and No Rating.
hong_kong_listings_neighbourhood_simplified <- hong_kong_listings_total_price %>%
mutate(
neighbourhood_simplified = case_when(
neighbourhood_cleansed=="Central & Western"~"zone_1",
neighbourhood_cleansed=="Eastern"~"zone_1",
neighbourhood_cleansed=="Islands"~"zone_2",
neighbourhood_cleansed=="Kowloon City"~"zone_3",
neighbourhood_cleansed=="Kwai Tsing"~"zone_4",
neighbourhood_cleansed=="Kwun Tong"~"zone_3",
neighbourhood_cleansed=="North"~"zone_4",
neighbourhood_cleansed=="Sai Kung"~"zone_5",
neighbourhood_cleansed=="Sha Tin"~"zone_4",
neighbourhood_cleansed=="Sham Shui Po"~"zone_3",
neighbourhood_cleansed=="Southern"~"zone_1",
neighbourhood_cleansed=="Tai Po"~"zone_4",
neighbourhood_cleansed=="Tsuen Wan"~"zone_4",
neighbourhood_cleansed=="Tuen Mun"~"zone_4",
neighbourhood_cleansed=="Wan Chai"~"zone_1",
neighbourhood_cleansed=="Wong Tai Sin"~"zone_3",
neighbourhood_cleansed=="Yau Tsim Mong"~"zone_3",
neighbourhood_cleansed=="Yuen Long"~"zone_4"),
rating_group= case_when(
review_scores_rating <90 ~ "Under 90",
is.na(review_scores_rating)~"No Rating",
TRUE ~ "Over 90"))
skim(hong_kong_listings_neighbourhood_simplified)
| Name | hong_kong_listings_neighb… |
| Number of rows | 6437 |
| Number of columns | 42 |
| _______________________ | |
| Column type frequency: | |
| character | 6 |
| Date | 1 |
| factor | 6 |
| logical | 1 |
| numeric | 28 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| property_type | 0 | 1.00 | 3 | 22 | 0 | 36 | 0 |
| listing_url | 0 | 1.00 | 34 | 37 | 0 | 6437 | 0 |
| description | 292 | 0.95 | 1 | 1000 | 0 | 5212 | 0 |
| neighborhood_overview | 2931 | 0.54 | 1 | 1000 | 0 | 2374 | 0 |
| neighbourhood_simplified | 0 | 1.00 | 6 | 6 | 0 | 5 | 0 |
| rating_group | 0 | 1.00 | 7 | 9 | 0 | 3 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| host_since | 10 | 1 | 2009-10-07 | 2020-06-09 | 2015-12-28 | 1935 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| neighbourhood_cleansed | 0 | 1.00 | FALSE | 18 | Yau: 2747, Cen: 1372, Wan: 888, Isl: 268 |
| room_type | 0 | 1.00 | FALSE | 4 | Ent: 3157, Pri: 2922, Hot: 188, Sha: 170 |
| bed_type | 0 | 1.00 | FALSE | 5 | Rea: 6404, Pul: 16, Fut: 8, Air: 6 |
| city | 468 | 0.93 | FALSE | 269 | Hon: 4538, She: 196, 香港: 97, Hon: 83 |
| cancellation_policy | 0 | 1.00 | FALSE | 6 | str: 3575, fle: 1777, mod: 1048, sup: 21 |
| prop_type_simplified | 0 | 1.00 | FALSE | 5 | Apa: 4129, Oth: 1251, Con: 541, Ser: 277 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| host_is_superhost | 10 | 1 | 0.13 | FAL: 5594, TRU: 833 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 2.40e+07 | 1.17e+07 | 69074.00 | 1.49e+07 | 2.46e+07 | 3.38e+07 | 4.38e+07 | ▅▆▇▇▇ |
| host_id | 0 | 1.00 | 9.35e+07 | 9.02e+07 | 44242.00 | 2.37e+07 | 5.25e+07 | 1.49e+08 | 3.49e+08 | ▇▂▂▂▁ |
| host_listings_count | 10 | 1.00 | 1.19e+01 | 2.49e+01 | 0.00 | 1.00e+00 | 3.00e+00 | 1.20e+01 | 3.86e+02 | ▇▁▁▁▁ |
| accommodates | 0 | 1.00 | 3.07e+00 | 1.58e+00 | 2.00 | 2.00e+00 | 2.00e+00 | 4.00e+00 | 9.00e+00 | ▇▂▁▁▁ |
| bathrooms | 4 | 1.00 | 1.13e+00 | 4.30e-01 | 0.00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 8.00e+00 | ▇▁▁▁▁ |
| bedrooms | 14 | 1.00 | 1.13e+00 | 7.40e-01 | 0.00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 1.00e+01 | ▇▁▁▁▁ |
| beds | 26 | 1.00 | 1.77e+00 | 1.22e+00 | 0.00 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 1.40e+01 | ▇▂▁▁▁ |
| price | 0 | 1.00 | 8.15e+02 | 1.67e+03 | 0.00 | 3.72e+02 | 5.50e+02 | 8.53e+02 | 6.67e+04 | ▇▁▁▁▁ |
| security_deposit | 0 | 1.00 | 5.64e+02 | 1.89e+03 | 0.00 | 0.00e+00 | 0.00e+00 | 7.84e+02 | 3.80e+04 | ▇▁▁▁▁ |
| cleaning_fee | 0 | 1.00 | 1.01e+02 | 1.88e+02 | 0.00 | 0.00e+00 | 0.00e+00 | 1.60e+02 | 4.69e+03 | ▇▁▁▁▁ |
| guests_included | 0 | 1.00 | 1.48e+00 | 1.03e+00 | 1.00 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 1.60e+01 | ▇▁▁▁▁ |
| extra_people | 0 | 1.00 | 6.87e+01 | 1.58e+02 | 0.00 | 0.00e+00 | 0.00e+00 | 1.00e+02 | 2.30e+03 | ▇▁▁▁▁ |
| minimum_nights | 0 | 1.00 | 1.55e+00 | 8.50e-01 | 1.00 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 4.00e+00 | ▇▂▁▂▁ |
| maximum_nights | 0 | 1.00 | 3.36e+05 | 2.68e+07 | 4.00 | 3.60e+02 | 1.12e+03 | 1.12e+03 | 2.15e+09 | ▇▁▁▁▁ |
| number_of_reviews | 0 | 1.00 | 2.32e+01 | 4.61e+01 | 0.00 | 1.00e+00 | 5.00e+00 | 2.30e+01 | 7.57e+02 | ▇▁▁▁▁ |
| reviews_per_month | 1541 | 0.76 | 9.10e-01 | 1.22e+00 | 0.01 | 1.40e-01 | 4.10e-01 | 1.16e+00 | 1.32e+01 | ▇▁▁▁▁ |
| number_of_reviews_ltm | 0 | 1.00 | 3.55e+00 | 8.71e+00 | 0.00 | 0.00e+00 | 0.00e+00 | 2.00e+00 | 1.38e+02 | ▇▁▁▁▁ |
| review_scores_rating | 1654 | 0.74 | 9.10e+01 | 1.06e+01 | 20.00 | 8.70e+01 | 9.30e+01 | 9.80e+01 | 1.00e+02 | ▁▁▁▂▇ |
| review_scores_accuracy | 1656 | 0.74 | 9.33e+00 | 1.09e+00 | 2.00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_cleanliness | 1656 | 0.74 | 9.13e+00 | 1.13e+00 | 2.00 | 9.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | ▁▁▁▂▇ |
| review_scores_checkin | 1655 | 0.74 | 9.50e+00 | 1.01e+00 | 2.00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_communication | 1656 | 0.74 | 9.52e+00 | 9.80e-01 | 2.00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_location | 1658 | 0.74 | 9.61e+00 | 8.40e-01 | 2.00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_value | 1658 | 0.74 | 9.14e+00 | 1.08e+00 | 2.00 | 9.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | ▁▁▁▂▇ |
| price_4_nights | 0 | 1.00 | 3.47e+03 | 6.70e+03 | 0.00 | 1.61e+03 | 2.39e+03 | 3.79e+03 | 2.67e+05 | ▇▁▁▁▁ |
| price_4_nights_transformed | 0 | 1.00 | 3.47e+03 | 6.70e+03 | 1.00 | 1.61e+03 | 2.39e+03 | 3.79e+03 | 2.67e+05 | ▇▁▁▁▁ |
| log_price_4_nights | 0 | 1.00 | -Inf | NaN | -Inf | 7.39e+00 | 7.78e+00 | 8.24e+00 | 1.25e+01 | ▁▂▇▁▁ |
| log_price_4_nights_transformed | 0 | 1.00 | 7.84e+00 | 6.80e-01 | 0.00 | 7.39e+00 | 7.78e+00 | 8.24e+00 | 1.25e+01 | ▁▁▅▇▁ |
Overview of our cleansed data
How many variables (coloumns)? How many observations (rows)?
The original dataset, listings, had 11187 observations with 106 variables. After cleaning the data, removing variables and observations with a lot of NAs, and using our own judgement to remove insignificant variables, we end up with a final dataset, hong_kong_listings_neighbourhood_simplified, with 6437 observations and 42 variables. This dataset is used for our regression models.
Which variables are numbers?
The original dataset, listings, had 39 numeric variables whereas our final cleaned dataset had 28 numeric variables. Some examples of numeric variables in the dataset are the variables id, accomodates, bedrooms, beds, price, price_4_nights etc.
Which are categorical or factor variables? - numeric or character variables with variables that have a fixed and known set of possible values?
The original dataset, listings, had 46 categorical or factor variables whereas our final cleaned dataset had 12 categorical and factor variables. Some examples of factor and categorical variables in the dataset are the variables neighbourhood_cleansed, room_type, bed_type.
Exploratory Data Analysis
Summary statistics and favstats
Now that we have cleaned our data sets for our specific target (4 nights, 2 people) we will conduct a exploratory data analysis.
#summary to check for NA's and general statistics
summary(hong_kong_listings_neighbourhood_simplified)
## id host_id host_since host_is_superhost
## Min. : 69074 Min. :4.42e+04 Min. :2009-10-07 Mode :logical
## 1st Qu.:14921794 1st Qu.:2.37e+07 1st Qu.:2014-11-16 FALSE:5594
## Median :24554597 Median :5.25e+07 Median :2015-12-28 TRUE :833
## Mean :24021748 Mean :9.35e+07 Mean :2016-02-29 NA's :10
## 3rd Qu.:33810314 3rd Qu.:1.49e+08 3rd Qu.:2017-09-07
## Max. :43751721 Max. :3.49e+08 Max. :2020-06-09
## NA's :10
## host_listings_count neighbourhood_cleansed property_type
## Min. : 0 Yau Tsim Mong :2747 Length:6437
## 1st Qu.: 1 Central & Western:1372 Class :character
## Median : 3 Wan Chai : 888 Mode :character
## Mean : 12 Islands : 268
## 3rd Qu.: 12 Kowloon City : 257
## Max. :386 North : 156
## NA's :10 (Other) : 749
## room_type accommodates bathrooms bedrooms
## Entire home/apt:3157 Min. :2.00 Min. :0.00 Min. : 0.00
## Hotel room : 188 1st Qu.:2.00 1st Qu.:1.00 1st Qu.: 1.00
## Private room :2922 Median :2.00 Median :1.00 Median : 1.00
## Shared room : 170 Mean :3.07 Mean :1.13 Mean : 1.13
## 3rd Qu.:4.00 3rd Qu.:1.00 3rd Qu.: 1.00
## Max. :9.00 Max. :8.00 Max. :10.00
## NA's :4 NA's :14
## beds bed_type price security_deposit
## Min. : 0.00 Airbed : 6 Min. : 0 Min. : 0
## 1st Qu.: 1.00 Couch : 3 1st Qu.: 372 1st Qu.: 0
## Median : 1.00 Futon : 8 Median : 550 Median : 0
## Mean : 1.77 Pull-out Sofa: 16 Mean : 815 Mean : 564
## 3rd Qu.: 2.00 Real Bed :6404 3rd Qu.: 853 3rd Qu.: 784
## Max. :14.00 Max. :66667 Max. :38000
## NA's :26
## cleaning_fee guests_included extra_people minimum_nights
## Min. : 0 Min. : 1.00 Min. : 0 Min. :1.00
## 1st Qu.: 0 1st Qu.: 1.00 1st Qu.: 0 1st Qu.:1.00
## Median : 0 Median : 1.00 Median : 0 Median :1.00
## Mean : 101 Mean : 1.48 Mean : 69 Mean :1.55
## 3rd Qu.: 160 3rd Qu.: 2.00 3rd Qu.: 100 3rd Qu.:2.00
## Max. :4689 Max. :16.00 Max. :2300 Max. :4.00
##
## maximum_nights number_of_reviews reviews_per_month number_of_reviews_ltm
## Min. :4.00e+00 Min. : 0 Min. : 0 Min. : 0.0
## 1st Qu.:3.60e+02 1st Qu.: 1 1st Qu.: 0 1st Qu.: 0.0
## Median :1.12e+03 Median : 5 Median : 0 Median : 0.0
## Mean :3.36e+05 Mean : 23 Mean : 1 Mean : 3.5
## 3rd Qu.:1.12e+03 3rd Qu.: 23 3rd Qu.: 1 3rd Qu.: 2.0
## Max. :2.15e+09 Max. :757 Max. :13 Max. :138.0
## NA's :1541
## review_scores_rating review_scores_accuracy review_scores_cleanliness
## Min. : 20 Min. : 2 Min. : 2
## 1st Qu.: 87 1st Qu.: 9 1st Qu.: 9
## Median : 93 Median :10 Median : 9
## Mean : 91 Mean : 9 Mean : 9
## 3rd Qu.: 98 3rd Qu.:10 3rd Qu.:10
## Max. :100 Max. :10 Max. :10
## NA's :1654 NA's :1656 NA's :1656
## review_scores_checkin review_scores_communication review_scores_location
## Min. : 2 Min. : 2 Min. : 2
## 1st Qu.: 9 1st Qu.: 9 1st Qu.: 9
## Median :10 Median :10 Median :10
## Mean :10 Mean :10 Mean :10
## 3rd Qu.:10 3rd Qu.:10 3rd Qu.:10
## Max. :10 Max. :10 Max. :10
## NA's :1655 NA's :1656 NA's :1658
## review_scores_value listing_url city
## Min. : 2 Length:6437 Hong Kong :4538
## 1st Qu.: 9 Class :character Shenzhen : 196
## Median : 9 Mode :character 香港 : 97
## Mean : 9 Hong Kong Island: 83
## 3rd Qu.:10 Kowloon : 80
## Max. :10 (Other) : 975
## NA's :1658 NA's : 468
## description neighborhood_overview cancellation_policy
## Length:6437 Length:6437 flexible :1777
## Class :character Class :character moderate :1048
## Mode :character Mode :character strict : 2
## strict_14_with_grace_period:3575
## super_strict_30 : 14
## super_strict_60 : 21
##
## prop_type_simplified price_4_nights price_4_nights_transformed
## Apartment :4129 Min. : 0 Min. : 1
## Condominium : 541 1st Qu.: 1612 1st Qu.: 1612
## Hostel : 239 Median : 2388 Median : 2388
## Other :1251 Mean : 3469 Mean : 3469
## Serviced apartment: 277 3rd Qu.: 3792 3rd Qu.: 3792
## Max. :266668 Max. :266668
##
## log_price_4_nights log_price_4_nights_transformed neighbourhood_simplified
## Min. : -Inf Min. : 0.00 Length:6437
## 1st Qu.: 7.39 1st Qu.: 7.39 Class :character
## Median : 7.78 Median : 7.78 Mode :character
## Mean : -Inf Mean : 7.84
## 3rd Qu.: 8.24 3rd Qu.: 8.24
## Max. :12.49 Max. :12.49
##
## rating_group
## Length:6437
## Class :character
## Mode :character
##
##
##
##
#running favstats on some interesting variable combinations
favstats(price_4_nights_transformed~accommodates,
data=hong_kong_listings_neighbourhood_simplified)
## accommodates min Q1 median Q3 max mean sd n missing
## 1 2 50 1396 1924 2951 266668 2770 5802 3600 0
## 2 3 1 1736 2472 3632 43744 3224 4001 894 0
## 3 4 1 2071 3070 4250 43772 3926 4251 1020 0
## 4 5 836 2550 3662 5210 43772 5058 5670 287 0
## 5 6 596 3018 3914 5446 232008 6276 17509 323 0
## 6 7 1488 3860 4514 5814 43772 6091 6501 82 0
## 7 8 312 3812 4558 5952 74807 6578 8747 201 0
## 8 9 312 4692 5328 6479 13836 5819 2662 30 0
favstats(price_4_nights_transformed~neighbourhood_cleansed,
data=hong_kong_listings_neighbourhood_simplified)
## neighbourhood_cleansed min Q1 median Q3 max mean sd n missing
## 1 Central & Western 404 2388 3308 4328 35184 3789 2456 1372 0
## 2 Eastern 900 1566 2212 3685 19996 3063 2526 152 0
## 3 Islands 1 2107 2792 3908 34536 3544 3658 268 0
## 4 Kowloon City 868 1704 3272 4300 43772 5120 8610 257 0
## 5 Kwai Tsing 868 1800 2420 3936 8000 2910 1776 21 0
## 6 Kwun Tong 1024 1536 2108 3302 198000 9603 35150 31 0
## 7 North 312 1070 1448 2333 47212 2649 5337 156 0
## 8 Sai Kung 808 1949 2646 4107 59988 4164 6745 84 0
## 9 Sha Tin 836 1444 2450 2986 5508 2415 976 56 0
## 10 Sham Shui Po 312 1242 2200 3462 15500 2691 2136 99 0
## 11 Southern 1 2577 3842 7370 232008 10433 30227 64 0
## 12 Tai Po 1041 1412 2668 3150 10512 3030 2421 27 0
## 13 Tsuen Wan 1268 2364 3162 4827 43772 7092 11815 30 0
## 14 Tuen Mun 1056 1611 1844 2684 7656 2344 1422 24 0
## 15 Wan Chai 588 2016 3054 4284 266668 3783 9130 888 0
## 16 Wong Tai Sin 808 2186 2656 3008 4000 2663 946 10 0
## 17 Yau Tsim Mong 50 1396 1860 2869 74807 2956 4945 2747 0
## 18 Yuen Long 312 1412 1774 2342 34564 2354 2983 151 0
favstats(price_4_nights_transformed~host_is_superhost,
data=hong_kong_listings_neighbourhood_simplified)
## host_is_superhost min Q1 median Q3 max mean sd n missing
## 1 FALSE 1 1580 2388 3766 266668 3424 6466 5594 0
## 2 TRUE 712 1828 2572 3893 198000 3786 8110 833 0
favstats(price_4_nights_transformed~prop_type_simplified,
data=hong_kong_listings_neighbourhood_simplified)
## prop_type_simplified min Q1 median Q3 max mean sd n missing
## 1 Apartment 1 1800 2784 4000 43772 3385 3359 4129 0
## 2 Condominium 50 1370 2062 3532 266668 3712 12528 541 0
## 3 Hostel 312 1288 1612 2144 11996 1836 1100 239 0
## 4 Other 312 1450 1996 2932 232008 3610 10212 1251 0
## 5 Serviced apartment 528 1434 1860 3692 43772 5030 9578 277 0
favstats(price_4_nights_transformed~minimum_nights,
data=hong_kong_listings_neighbourhood_simplified)
## minimum_nights min Q1 median Q3 max mean sd n missing
## 1 1 1 1452 2020 3332 266668 3397 8056 4121 0
## 2 2 50 2016 3004 4162 43772 3583 3458 1326 0
## 3 3 1 2138 3102 4300 29996 3545 2277 726 0
## 4 4 684 2524 3341 4400 23996 3823 2487 264 0
Data visualization
Building upon the above summary and favstats investigations, we visualize our data by using ggplot2.
#Distribution of Airbnb property types in Hong Kong
ggplot(hong_kong_listings_neighbourhood_simplified,
aes(y=(prop_type_simplified),
fill = neighbourhood_simplified))+
geom_bar()+
facet_wrap(~neighbourhood_simplified)+
labs(title = "Distribution of Airbnb Property Types \n in Different Geographic Zones ",
x = "Property type",
y = "Number of Properties") +
theme_bw() +
theme(title = element_text(size = 15, face = "bold"),
axis.text.x = element_text(size = 10, angle=30),
axis.text.y = element_text(size = 10), legend.position = "none")

# Density plot of ratings by zones
ggplot(hong_kong_listings_neighbourhood_simplified, aes(x=review_scores_rating, fill=neighbourhood_simplified, alpha = 0.1))+
geom_density()+
scale_alpha(guide = "none") +
labs(title = "Density plot of ratings by Different \n Geographic Zones",
x = "Ratings",
y = "Density") + theme_bw()+
theme(title = element_text(size = 15, face = "bold"),
axis.text.x = element_text(size = 8),
axis.text.y = element_text(size = 8),
legend.text = element_text(size=8),
legend.position = "bottom")

# Distribution of average cleaning fee and security deposit by property type
cleaning_security <- hong_kong_listings_neighbourhood_simplified %>%
group_by(prop_type_simplified) %>%
summarise(mean_cleaning_fee = mean(cleaning_fee),
mean_security_deposit = mean(security_deposit))
cleaning_security <- pivot_longer(cleaning_security,
cols = 2:3, names_to = "Type", values_to = "value")
ggplot(cleaning_security,aes(x=prop_type_simplified, y = value, fill = Type))+
geom_col(position = "dodge")+
labs(title = "Distribution of Average Cleaning Fee and \n Security Deposit by Property Type",
x = "Property Type",
y = "Dollars") +
theme_bw()+
theme(title = element_text(size = 15, face = "bold"),
axis.text.x = element_text(size = 10),
axis.text.y = element_text(size = 10),
legend.text = element_text(size=10))

# Boxplot of log(prices_4_night) by zones
ggplot(hong_kong_listings_neighbourhood_simplified,
aes(x=neighbourhood_simplified, y = log_price_4_nights_transformed,
fill = neighbourhood_simplified, alpha =0.5))+
geom_boxplot()+
labs(title = "Boxplot of Total Price for 4 nights \n by zones",
subtitle = "Zone 3 has the lowest median total price",
x = "Zones",
y = "Log (Price for 4 Nights)") +
theme_bw()+
theme(title = element_text(size = 15, face = "bold"),
axis.text.x = element_text(size = 10),
axis.text.y = element_text(size = 10),
legend.position = "none")

Correlation matrix
#Producing scatterplot-correlation matrix between important variables in the dataset
ggp <- hong_kong_listings_neighbourhood_simplified %>%
select(c(price_4_nights,
neighbourhood_simplified,
accommodates,
bathrooms,
beds,
security_deposit,
cleaning_fee,
number_of_reviews,
review_scores_rating)) %>%
ggpairs(cardinality_threshold = NULL)
print(ggp, progress = F)

What are the correlations between variables? Does each scatterplot support a linear relationship between variables? Do any of the correlations appear to be conditional on the value of a categorical variable?
Above, we check the correlations between the numeric variables in the dataset. Intuitively, we see that the variables Price_4_Nights and Price are highly correlated at 0.997 since Price_4_nights is calculated from Price. We also see that the variables accommodates and beds have a very strong relationship, with correlation equal to 0.758. The variables reviews_per_month and number_of_reviews_ltm are also highly correlated at 0.826. Furthermore, we see that the variable review_scores_rating have very strong relationships with each of the other rating categories such as review_scores_accuracy and review_scores_cleanliness etc with a correlation coefficients greater than 0.7. This would be particularly useful when we select variables for our regression analysis as we know that using the variable review_scores_rating would suffice.
Mapping
library(leaflet)
leaflet(data = filter(listings, minimum_nights <= 4)) %>%
addProviderTiles("OpenStreetMap.Mapnik") %>%
addCircleMarkers(lng = ~longitude,
lat = ~latitude,
radius = 1,
fillColor = "blue",
fillOpacity = 0.4,
popup = ~listing_url,
label = ~property_type)
Regression Analysis
Price_4_nights vs. Log (price_4_nights)
# histogram of price_4_nights
ggplot(hong_kong_listings_total_price, aes (x = price_4_nights))+
geom_histogram()+
xlim(c(0,20000))+
labs(title = "Histogram of Total Prices for 4 Nights",
x = "Total Prices for 4 Nights",
y = "Count")+
theme(title = element_text(size=15),
axis.text.x = element_text(size=10),
axis.text.y=element_text(size=10))+
theme_bw()

# histogram of log(price_4_nights)
ggplot(hong_kong_listings_total_price, aes (x = log_price_4_nights))+
geom_histogram()+
labs(title = "Histogram of Log (Prices for 4 Nights)",
x = "Log Prices for 4 Nights",
y = "Count")+
theme(title = element_text(size=15),
axis.text.x = element_text(size=10),
axis.text.y=element_text(size=10))+
theme_bw()

We should use log(price_4_nights) because we can see from the histograms that the log(price_4_nights) distribution has a roughly normal shape, while the distribution of total price_4_nights is right-skewed. If we use the total price_4_nights in the regression analysis, the regression line might not be linear and variance might not be constant.
Model 1
# explanatory variables: prop_type_simplified, number_of_reviews, review_scores_rating
model1 <- lm(log_price_4_nights_transformed ~
prop_type_simplified +
number_of_reviews +
review_scores_rating,
data = hong_kong_listings_total_price)
msummary(model1)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.284387 0.078870 92.36 < 2e-16
## prop_type_simplifiedCondominium -0.142477 0.033567 -4.24 2.2e-05
## prop_type_simplifiedHostel -0.449322 0.048289 -9.30 < 2e-16
## prop_type_simplifiedOther -0.173379 0.023501 -7.38 1.9e-13
## prop_type_simplifiedServiced apartment -0.172016 0.048864 -3.52 0.00044
## number_of_reviews -0.000868 0.000176 -4.92 8.9e-07
## review_scores_rating 0.007045 0.000854 8.25 < 2e-16
##
## (Intercept) ***
## prop_type_simplifiedCondominium ***
## prop_type_simplifiedHostel ***
## prop_type_simplifiedOther ***
## prop_type_simplifiedServiced apartment ***
## number_of_reviews ***
## review_scores_rating ***
##
## Residual standard error: 0.621 on 4776 degrees of freedom
## (1654 observations deleted due to missingness)
## Multiple R-squared: 0.0502, Adjusted R-squared: 0.049
## F-statistic: 42.1 on 6 and 4776 DF, p-value: <2e-16
car::vif(model1)
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.02 4 1.00
## number_of_reviews 1.01 1 1.00
## review_scores_rating 1.02 1 1.01
plot(model1)




Interpretations of Regression Output
Since we transformed the dependent variable by taking the logarithm of price_4_nights, we need to exponentiate the coefficients, then subtract the number by 1 to arrive at the unit increase in Y variable by increasing one unit of X variable. Property type is also a categorical variable, so when performing regression analysis we have Apartment has our baseline, which is not shown in the regression output report.
The coefficient for number_of_reviews is -0.000868, so the unit increase in price_4_nights will be (e^-0.000868 -1). That is, for every increase of 1 in the review rating score, the price_4_nights will decrease by 0.000868.
The coefficient for review_scores_rating is 0.007045, so the unit increase in price_4_nights will be (e^0.007043 -1). That is, for every increase of 1 in the review rating score, the price_4_nights will increase by 0.00707.
if the property type is
condominium, everything else equal,price_4_nightswill increase by (e^-0.142477 -1 ) = -0.133, or decrease by 0.133 compared to property type being apartment.
if the property type ishostel, everything else equal,price_4_nightswill increase by (e^-0.449322 -1) = -0.362, or decrease by 0.362 compared to property type being apartment.if the property type is
other, everything else equal,price_4_nightswill increase by (e^-0.173379 -1 ) = -0.159, or decrease by 0.159 compared to property type being apartment.
if the property type is serviced apartment, everything else equal, price_4_nights will increase by (e^-0.172016 -1) = -0.158, or decrease by 0.158 compared to property type being apartment.
Interpretation of the above plots
- first plot (Fitted vs Residual):
- detects several types of violations in the linear regression assumptions
- Does linearity hold? This is indicated by the mean residual value for every fitted value region being close to 0. The closer ther red line is to the dashed line
- Whether homoskedasticity holds. The spread of residuals should be approximately the same across the x-axis.
- Whether there are outliers. This is indicated by some ‘extreme’ residuals that are far from the rest.
- detects several types of violations in the linear regression assumptions
- In the second plot (Normal Q-Q Plot):
- The Q-Q plot, or quantile-quantile plot, is a graphical tool to help us assess if a set of data plausibly came from some theoretical distribution such as a Normal or exponential.
- A Q-Q plot is a scatterplot created by plotting two sets of quantiles against one another. If both sets of quantiles came from the same distribution, we should see the points forming a line that’s roughly straight
- The Q-Q plot, or quantile-quantile plot, is a graphical tool to help us assess if a set of data plausibly came from some theoretical distribution such as a Normal or exponential.
- In the third plot (Scale Location):
- red line is approximately horizontal. Then the average magnitude of the standardized residuals isn’t changing much as a function of the fitted values.
- spread around the red line doesn’t vary with the fitted values. Then the variability of magnitudes doesn’t vary much as a function of the fitted values.
- Fourth plot (Residuals vs Leverage):
- This can help detect outliers in a linear regression mode:
- We’re looking at how the spread of standardized residuals changes as the leverage, or sensitivity of the fitted _i to a change in y_i, increases. Firstly, this can also be used to detect heteroskedasticity and non-linearity. The spread of standardized residuals shouldn’t change as a function of leverage: here it appears to decrease, indicating heteroskedasticity.
- Second, points with high leverage may be influential: that is, deleting them would change the model a lot. For this we can look at Cook’s distance, which measures the effect of deleting a point on the combined parameter vector. Cook’s distance is the dotted red line here, and points outside the dotted line have high influence. In this case there are no points outside the dotted line
- This can help detect outliers in a linear regression mode:
Model 2
# explanatory variables in model1 plus room_type
model2 <- lm(log_price_4_nights_transformed ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type,
data = hong_kong_listings_total_price)
msummary(model2)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.806235 0.075081 103.97 < 2e-16
## prop_type_simplifiedCondominium -0.126730 0.030977 -4.09 4.4e-05
## prop_type_simplifiedHostel -0.218213 0.046868 -4.66 3.3e-06
## prop_type_simplifiedOther 0.005852 0.023389 0.25 0.802
## prop_type_simplifiedServiced apartment -0.020805 0.045874 -0.45 0.650
## number_of_reviews -0.000450 0.000164 -2.75 0.006
## review_scores_rating 0.003369 0.000799 4.22 2.5e-05
## room_typeHotel room -0.245381 0.051155 -4.80 1.7e-06
## room_typePrivate room -0.533044 0.018560 -28.72 < 2e-16
## room_typeShared room -0.249947 0.056373 -4.43 9.5e-06
##
## (Intercept) ***
## prop_type_simplifiedCondominium ***
## prop_type_simplifiedHostel ***
## prop_type_simplifiedOther
## prop_type_simplifiedServiced apartment
## number_of_reviews **
## review_scores_rating ***
## room_typeHotel room ***
## room_typePrivate room ***
## room_typeShared room ***
##
## Residual standard error: 0.573 on 4773 degrees of freedom
## (1654 observations deleted due to missingness)
## Multiple R-squared: 0.192, Adjusted R-squared: 0.191
## F-statistic: 126 on 9 and 4773 DF, p-value: <2e-16
car::vif(model2)
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.29 4 1.03
## number_of_reviews 1.02 1 1.01
## review_scores_rating 1.05 1 1.02
## room_type 1.34 3 1.05
plot(model2)




room_type is a significant indicator of price_4_nights, because as shown in the summary statistics below, the t-values for three different room types all have absolute values greater than 2.
Further Regression Analysis
Model 3
When performing regression analysis, we removed variables that have perfect collinearity with others (property type and property_type_simplified). In addition, after checking for Variance Inflation Factors, we found out that neighbourhood_cleansed and city have very high collinearity, so we also removed both variables from our regression analysis and used neighbourhood_simplified instead. Although the VIF of reviews_per_month, room_type and review_scores_rating are all larger than 5 but smaller than 10, we still decided to keep these variables for our next model because they could be influential on our total price.
model3 <- lm(log_price_4_nights_transformed ~ .
-log_price_4_nights
- price_4_nights
- price_4_nights_transformed
- listing_url
- id
- host_id
- description
- neighborhood_overview
- city
- property_type
- neighbourhood_cleansed,
data = hong_kong_listings_neighbourhood_simplified)
msummary(model3)
## Estimate Std. Error t value
## (Intercept) 7.36e+00 3.31e-01 22.26
## host_since -1.40e-05 1.06e-05 -1.32
## host_is_superhostTRUE 1.12e-01 1.92e-02 5.84
## host_listings_count -1.27e-03 3.94e-04 -3.23
## room_typeHotel room -2.71e-02 4.08e-02 -0.66
## room_typePrivate room -2.04e-01 1.99e-02 -10.25
## room_typeShared room -3.03e-01 6.05e-02 -5.01
## accommodates 7.50e-02 8.27e-03 9.07
## bathrooms 5.07e-03 2.31e-02 0.22
## bedrooms 1.98e-02 1.29e-02 1.54
## beds 1.66e-02 1.02e-02 1.63
## bed_typeFuton 2.28e-01 3.17e-01 0.72
## bed_typePull-out Sofa 1.70e-01 2.72e-01 0.63
## bed_typeReal Bed 2.57e-01 2.46e-01 1.04
## price 3.48e-04 6.65e-06 52.39
## security_deposit 1.31e-05 3.78e-06 3.45
## cleaning_fee 2.94e-04 3.73e-05 7.89
## guests_included -2.07e-02 7.72e-03 -2.68
## extra_people 4.45e-04 4.79e-05 9.29
## minimum_nights 3.46e-02 9.69e-03 3.58
## maximum_nights -2.98e-06 3.37e-06 -0.88
## number_of_reviews 9.66e-05 2.39e-04 0.40
## reviews_per_month -2.97e-02 1.39e-02 -2.14
## number_of_reviews_ltm -4.97e-04 1.25e-03 -0.40
## review_scores_rating 5.75e-03 1.91e-03 3.01
## review_scores_accuracy -1.09e-02 1.34e-02 -0.82
## review_scores_cleanliness 2.21e-02 1.10e-02 2.02
## review_scores_checkin 2.75e-02 1.18e-02 2.33
## review_scores_communication -2.70e-02 1.34e-02 -2.02
## review_scores_location -1.35e-02 1.27e-02 -1.07
## review_scores_value -5.33e-02 1.27e-02 -4.20
## cancellation_policymoderate 3.51e-03 2.31e-02 0.15
## cancellation_policystrict -1.08e-01 3.50e-01 -0.31
## cancellation_policystrict_14_with_grace_period 2.05e-02 2.04e-02 1.01
## prop_type_simplifiedCondominium -3.29e-02 2.69e-02 -1.23
## prop_type_simplifiedHostel -1.10e-01 4.83e-02 -2.29
## prop_type_simplifiedOther -1.30e-02 2.03e-02 -0.64
## prop_type_simplifiedServiced apartment 2.84e-02 3.86e-02 0.74
## neighbourhood_simplifiedzone_2 -1.07e-01 3.32e-02 -3.22
## neighbourhood_simplifiedzone_3 -1.27e-01 1.93e-02 -6.55
## neighbourhood_simplifiedzone_4 -3.01e-01 3.24e-02 -9.30
## neighbourhood_simplifiedzone_5 -9.94e-02 5.53e-02 -1.80
## rating_groupUnder 90 -1.15e-02 2.37e-02 -0.48
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## host_since 0.18630
## host_is_superhostTRUE 5.9e-09 ***
## host_listings_count 0.00125 **
## room_typeHotel room 0.50628
## room_typePrivate room < 2e-16 ***
## room_typeShared room 5.9e-07 ***
## accommodates < 2e-16 ***
## bathrooms 0.82648
## bedrooms 0.12324
## beds 0.10233
## bed_typeFuton 0.47180
## bed_typePull-out Sofa 0.53199
## bed_typeReal Bed 0.29711
## price < 2e-16 ***
## security_deposit 0.00057 ***
## cleaning_fee 4.6e-15 ***
## guests_included 0.00741 **
## extra_people < 2e-16 ***
## minimum_nights 0.00036 ***
## maximum_nights 0.37687
## number_of_reviews 0.68665
## reviews_per_month 0.03258 *
## number_of_reviews_ltm 0.69063
## review_scores_rating 0.00268 **
## review_scores_accuracy 0.41263
## review_scores_cleanliness 0.04341 *
## review_scores_checkin 0.01990 *
## review_scores_communication 0.04385 *
## review_scores_location 0.28604
## review_scores_value 2.8e-05 ***
## cancellation_policymoderate 0.87915
## cancellation_policystrict 0.75836
## cancellation_policystrict_14_with_grace_period 0.31482
## prop_type_simplifiedCondominium 0.22027
## prop_type_simplifiedHostel 0.02230 *
## prop_type_simplifiedOther 0.52338
## prop_type_simplifiedServiced apartment 0.46200
## neighbourhood_simplifiedzone_2 0.00132 **
## neighbourhood_simplifiedzone_3 6.7e-11 ***
## neighbourhood_simplifiedzone_4 < 2e-16 ***
## neighbourhood_simplifiedzone_5 0.07247 .
## rating_groupUnder 90 0.62822
##
## Residual standard error: 0.346 on 2598 degrees of freedom
## (3796 observations deleted due to missingness)
## Multiple R-squared: 0.714, Adjusted R-squared: 0.709
## F-statistic: 154 on 42 and 2598 DF, p-value: <2e-16
car::vif(model3)
## GVIF Df GVIF^(1/(2*Df))
## host_since 1.41 1 1.19
## host_is_superhost 1.25 1 1.12
## host_listings_count 1.61 1 1.27
## room_type 6.08 3 1.35
## accommodates 4.27 1 2.07
## bathrooms 1.85 1 1.36
## bedrooms 1.79 1 1.34
## beds 3.92 1 1.98
## bed_type 1.05 3 1.01
## price 1.14 1 1.07
## security_deposit 1.18 1 1.09
## cleaning_fee 1.40 1 1.18
## guests_included 1.62 1 1.27
## extra_people 1.65 1 1.29
## minimum_nights 1.54 1 1.24
## maximum_nights 1.01 1 1.01
## number_of_reviews 3.60 1 1.90
## reviews_per_month 7.07 1 2.66
## number_of_reviews_ltm 3.72 1 1.93
## review_scores_rating 6.77 1 2.60
## review_scores_accuracy 3.49 1 1.87
## review_scores_cleanliness 2.63 1 1.62
## review_scores_checkin 2.30 1 1.52
## review_scores_communication 2.66 1 1.63
## review_scores_location 1.70 1 1.30
## review_scores_value 3.13 1 1.77
## cancellation_policy 1.35 3 1.05
## prop_type_simplified 2.09 4 1.10
## neighbourhood_simplified 2.49 4 1.12
## rating_group 2.43 1 1.56
plot(model3)




Model 4
For our model 4, we further removed variables that have t-values less than 2 (host_since,bathrooms, bedrooms, beds, bed_type, maximum_nights, number_of_reviews_ltm,number_of_reviews, review_scores_accuracy, review_scores_location, cancellation_policy and rating_group) to refine our model.
model4 <- lm(log_price_4_nights_transformed ~
host_is_superhost +
host_listings_count +
room_type +
accommodates +
price +
security_deposit +
cleaning_fee +
guests_included +
extra_people +
minimum_nights +
review_scores_rating +
review_scores_cleanliness +
review_scores_checkin +
review_scores_communication +
review_scores_value +
prop_type_simplified +
neighbourhood_simplified+
rating_group,
data = hong_kong_listings_neighbourhood_simplified)
msummary(model4)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.48e+00 8.65e-02 86.57 < 2e-16
## host_is_superhostTRUE 1.08e-01 1.70e-02 6.37 2.0e-10
## host_listings_count -1.32e-03 3.21e-04 -4.12 3.8e-05
## room_typeHotel room -1.27e-02 3.78e-02 -0.33 0.73801
## room_typePrivate room -2.21e-01 1.63e-02 -13.55 < 2e-16
## room_typeShared room -4.01e-01 4.62e-02 -8.69 < 2e-16
## accommodates 1.05e-01 4.66e-03 22.62 < 2e-16
## price 2.19e-04 4.43e-06 49.45 < 2e-16
## security_deposit 1.38e-05 3.51e-06 3.93 8.5e-05
## cleaning_fee 3.88e-04 3.72e-05 10.43 < 2e-16
## guests_included -2.61e-02 6.51e-03 -4.00 6.4e-05
## extra_people 5.45e-04 4.10e-05 13.31 < 2e-16
## minimum_nights 3.63e-02 8.10e-03 4.48 7.7e-06
## review_scores_rating 1.51e-03 1.44e-03 1.05 0.29372
## review_scores_cleanliness 3.37e-02 8.75e-03 3.84 0.00012
## review_scores_checkin 8.18e-03 1.03e-02 0.79 0.42676
## review_scores_communication -5.56e-03 1.07e-02 -0.52 0.60256
## review_scores_value -5.57e-02 9.82e-03 -5.68 1.5e-08
## prop_type_simplifiedCondominium -7.05e-02 2.24e-02 -3.15 0.00163
## prop_type_simplifiedHostel -1.19e-01 3.35e-02 -3.54 0.00040
## prop_type_simplifiedOther 2.28e-02 1.72e-02 1.33 0.18433
## prop_type_simplifiedServiced apartment 7.79e-02 3.38e-02 2.31 0.02112
## neighbourhood_simplifiedzone_2 -1.06e-01 2.91e-02 -3.65 0.00027
## neighbourhood_simplifiedzone_3 -1.54e-01 1.61e-02 -9.58 < 2e-16
## neighbourhood_simplifiedzone_4 -3.49e-01 2.83e-02 -12.32 < 2e-16
## neighbourhood_simplifiedzone_5 1.95e-02 5.24e-02 0.37 0.71024
## rating_groupUnder 90 -3.48e-02 1.92e-02 -1.82 0.06946
##
## (Intercept) ***
## host_is_superhostTRUE ***
## host_listings_count ***
## room_typeHotel room
## room_typePrivate room ***
## room_typeShared room ***
## accommodates ***
## price ***
## security_deposit ***
## cleaning_fee ***
## guests_included ***
## extra_people ***
## minimum_nights ***
## review_scores_rating
## review_scores_cleanliness ***
## review_scores_checkin
## review_scores_communication
## review_scores_value ***
## prop_type_simplifiedCondominium **
## prop_type_simplifiedHostel ***
## prop_type_simplifiedOther
## prop_type_simplifiedServiced apartment *
## neighbourhood_simplifiedzone_2 ***
## neighbourhood_simplifiedzone_3 ***
## neighbourhood_simplifiedzone_4 ***
## neighbourhood_simplifiedzone_5
## rating_groupUnder 90 .
##
## Residual standard error: 0.405 on 4741 degrees of freedom
## (1669 observations deleted due to missingness)
## Multiple R-squared: 0.599, Adjusted R-squared: 0.596
## F-statistic: 272 on 26 and 4741 DF, p-value: <2e-16
car::vif(model4)
## GVIF Df GVIF^(1/(2*Df))
## host_is_superhost 1.12 1 1.06
## host_listings_count 1.35 1 1.16
## room_type 2.80 3 1.19
## accommodates 1.66 1 1.29
## price 1.05 1 1.02
## security_deposit 1.16 1 1.08
## cleaning_fee 1.42 1 1.19
## guests_included 1.52 1 1.23
## extra_people 1.34 1 1.16
## minimum_nights 1.39 1 1.18
## review_scores_rating 6.66 1 2.58
## review_scores_cleanliness 2.82 1 1.68
## review_scores_checkin 3.06 1 1.75
## review_scores_communication 3.16 1 1.78
## review_scores_value 3.22 1 1.79
## prop_type_simplified 1.57 4 1.06
## neighbourhood_simplified 1.87 4 1.08
## rating_group 2.27 1 1.51
plot(model4)




Model 5
For our model 5, we further removed variables from model 4 that are insignificant (have t-values less than 2). They are: review_scores_checkin and review_scores_communication.
model5 <- lm(log_price_4_nights_transformed ~
host_is_superhost +
host_listings_count +
accommodates +
price +
security_deposit +
cleaning_fee +
guests_included +
extra_people +
minimum_nights +
number_of_reviews +
review_scores_rating +
review_scores_cleanliness +
review_scores_value +
prop_type_simplified +
neighbourhood_simplified+
rating_group,
data = hong_kong_listings_neighbourhood_simplified)
msummary(model5)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.35e+00 8.37e-02 87.82 < 2e-16
## host_is_superhostTRUE 9.40e-02 1.76e-02 5.35 9.4e-08
## host_listings_count -2.32e-03 3.21e-04 -7.21 6.4e-13
## accommodates 1.15e-01 4.40e-03 26.09 < 2e-16
## price 2.23e-04 4.52e-06 49.31 < 2e-16
## security_deposit 1.73e-05 3.59e-06 4.82 1.5e-06
## cleaning_fee 5.02e-04 3.71e-05 13.54 < 2e-16
## guests_included -8.97e-03 6.51e-03 -1.38 0.16812
## extra_people 4.16e-04 3.92e-05 10.62 < 2e-16
## minimum_nights 5.40e-02 8.17e-03 6.61 4.2e-11
## number_of_reviews -5.98e-04 1.24e-04 -4.83 1.4e-06
## review_scores_rating 2.04e-03 1.36e-03 1.50 0.13325
## review_scores_cleanliness 3.26e-02 8.95e-03 3.64 0.00028
## review_scores_value -5.78e-02 9.82e-03 -5.89 4.2e-09
## prop_type_simplifiedCondominium -6.31e-02 2.29e-02 -2.75 0.00589
## prop_type_simplifiedHostel -1.63e-01 3.30e-02 -4.94 8.2e-07
## prop_type_simplifiedOther -2.16e-03 1.69e-02 -0.13 0.89792
## prop_type_simplifiedServiced apartment 9.34e-02 3.43e-02 2.72 0.00646
## neighbourhood_simplifiedzone_2 -1.10e-01 2.97e-02 -3.69 0.00022
## neighbourhood_simplifiedzone_3 -1.85e-01 1.60e-02 -11.58 < 2e-16
## neighbourhood_simplifiedzone_4 -3.49e-01 2.90e-02 -12.07 < 2e-16
## neighbourhood_simplifiedzone_5 -4.75e-02 5.34e-02 -0.89 0.37352
## rating_groupUnder 90 -4.84e-02 1.95e-02 -2.48 0.01316
##
## (Intercept) ***
## host_is_superhostTRUE ***
## host_listings_count ***
## accommodates ***
## price ***
## security_deposit ***
## cleaning_fee ***
## guests_included
## extra_people ***
## minimum_nights ***
## number_of_reviews ***
## review_scores_rating
## review_scores_cleanliness ***
## review_scores_value ***
## prop_type_simplifiedCondominium **
## prop_type_simplifiedHostel ***
## prop_type_simplifiedOther
## prop_type_simplifiedServiced apartment **
## neighbourhood_simplifiedzone_2 ***
## neighbourhood_simplifiedzone_3 ***
## neighbourhood_simplifiedzone_4 ***
## neighbourhood_simplifiedzone_5
## rating_groupUnder 90 *
##
## Residual standard error: 0.415 on 4745 degrees of freedom
## (1669 observations deleted due to missingness)
## Multiple R-squared: 0.579, Adjusted R-squared: 0.577
## F-statistic: 296 on 22 and 4745 DF, p-value: <2e-16
car::vif(model5)
## GVIF Df GVIF^(1/(2*Df))
## host_is_superhost 1.15 1 1.07
## host_listings_count 1.29 1 1.14
## accommodates 1.41 1 1.19
## price 1.04 1 1.02
## security_deposit 1.16 1 1.08
## cleaning_fee 1.35 1 1.16
## guests_included 1.45 1 1.20
## extra_people 1.17 1 1.08
## minimum_nights 1.35 1 1.16
## number_of_reviews 1.11 1 1.05
## review_scores_rating 5.69 1 2.39
## review_scores_cleanliness 2.81 1 1.68
## review_scores_value 3.07 1 1.75
## prop_type_simplified 1.35 4 1.04
## neighbourhood_simplified 1.72 4 1.07
## rating_group 2.24 1 1.50
plot(model5)




Model 6
Lastly, we removed one more variable guests_included that has t-value less than 2 in model 5. Model 6 is our final regression model, as all the variables in the model are significant.
model6 <- lm(log_price_4_nights_transformed ~
host_is_superhost +
host_listings_count +
accommodates +
price +
security_deposit +
cleaning_fee +
extra_people +
minimum_nights +
number_of_reviews +
review_scores_rating +
review_scores_cleanliness +
review_scores_value +
prop_type_simplified +
neighbourhood_simplified,
data = hong_kong_listings_neighbourhood_simplified)
msummary(model6)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.21e+00 6.26e-02 115.19 < 2e-16
## host_is_superhostTRUE 9.81e-02 1.75e-02 5.62 2.1e-08
## host_listings_count -2.37e-03 3.19e-04 -7.42 1.4e-13
## accommodates 1.12e-01 4.04e-03 27.80 < 2e-16
## price 2.23e-04 4.52e-06 49.33 < 2e-16
## security_deposit 1.71e-05 3.58e-06 4.77 1.9e-06
## cleaning_fee 4.96e-04 3.64e-05 13.60 < 2e-16
## extra_people 4.06e-04 3.86e-05 10.51 < 2e-16
## minimum_nights 5.45e-02 8.17e-03 6.66 3.0e-11
## number_of_reviews -6.10e-04 1.23e-04 -4.94 7.9e-07
## review_scores_rating 3.77e-03 1.17e-03 3.21 0.00134
## review_scores_cleanliness 3.21e-02 8.95e-03 3.58 0.00034
## review_scores_value -6.10e-02 9.75e-03 -6.26 4.3e-10
## prop_type_simplifiedCondominium -6.28e-02 2.29e-02 -2.74 0.00615
## prop_type_simplifiedHostel -1.60e-01 3.30e-02 -4.86 1.2e-06
## prop_type_simplifiedOther -1.49e-03 1.69e-02 -0.09 0.92964
## prop_type_simplifiedServiced apartment 9.24e-02 3.43e-02 2.69 0.00707
## neighbourhood_simplifiedzone_2 -1.13e-01 2.97e-02 -3.82 0.00014
## neighbourhood_simplifiedzone_3 -1.92e-01 1.58e-02 -12.15 < 2e-16
## neighbourhood_simplifiedzone_4 -3.52e-01 2.90e-02 -12.14 < 2e-16
## neighbourhood_simplifiedzone_5 -4.86e-02 5.34e-02 -0.91 0.36266
##
## (Intercept) ***
## host_is_superhostTRUE ***
## host_listings_count ***
## accommodates ***
## price ***
## security_deposit ***
## cleaning_fee ***
## extra_people ***
## minimum_nights ***
## number_of_reviews ***
## review_scores_rating **
## review_scores_cleanliness ***
## review_scores_value ***
## prop_type_simplifiedCondominium **
## prop_type_simplifiedHostel ***
## prop_type_simplifiedOther
## prop_type_simplifiedServiced apartment **
## neighbourhood_simplifiedzone_2 ***
## neighbourhood_simplifiedzone_3 ***
## neighbourhood_simplifiedzone_4 ***
## neighbourhood_simplifiedzone_5
##
## Residual standard error: 0.415 on 4747 degrees of freedom
## (1669 observations deleted due to missingness)
## Multiple R-squared: 0.578, Adjusted R-squared: 0.576
## F-statistic: 325 on 20 and 4747 DF, p-value: <2e-16
car::vif(model6)
## GVIF Df GVIF^(1/(2*Df))
## host_is_superhost 1.13 1 1.06
## host_listings_count 1.28 1 1.13
## accommodates 1.19 1 1.09
## price 1.04 1 1.02
## security_deposit 1.15 1 1.07
## cleaning_fee 1.30 1 1.14
## extra_people 1.14 1 1.07
## minimum_nights 1.35 1 1.16
## number_of_reviews 1.10 1 1.05
## review_scores_rating 4.23 1 2.06
## review_scores_cleanliness 2.81 1 1.68
## review_scores_value 3.02 1 1.74
## prop_type_simplified 1.34 4 1.04
## neighbourhood_simplified 1.67 4 1.07
plot(model6)




Models Overview
huxtable::huxreg(model1,
model2,
model3,
model4,
model5,
model6)
| (1) | (2) | (3) | (4) | (5) | (6) | |
|---|---|---|---|---|---|---|
| (Intercept) | 7.284 *** | 7.806 *** | 7.365 *** | 7.485 *** | 7.349 *** | 7.208 *** |
| (0.079) | (0.075) | (0.331) | (0.086) | (0.084) | (0.063) | |
| prop_type_simplifiedCondominium | -0.142 *** | -0.127 *** | -0.033 | -0.070 ** | -0.063 ** | -0.063 ** |
| (0.034) | (0.031) | (0.027) | (0.022) | (0.023) | (0.023) | |
| prop_type_simplifiedHostel | -0.449 *** | -0.218 *** | -0.110 * | -0.119 *** | -0.163 *** | -0.160 *** |
| (0.048) | (0.047) | (0.048) | (0.033) | (0.033) | (0.033) | |
| prop_type_simplifiedOther | -0.173 *** | 0.006 | -0.013 | 0.023 | -0.002 | -0.001 |
| (0.024) | (0.023) | (0.020) | (0.017) | (0.017) | (0.017) | |
| prop_type_simplifiedServiced apartment | -0.172 *** | -0.021 | 0.028 | 0.078 * | 0.093 ** | 0.092 ** |
| (0.049) | (0.046) | (0.039) | (0.034) | (0.034) | (0.034) | |
| number_of_reviews | -0.001 *** | -0.000 ** | 0.000 | -0.001 *** | -0.001 *** | |
| (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | ||
| review_scores_rating | 0.007 *** | 0.003 *** | 0.006 ** | 0.002 | 0.002 | 0.004 ** |
| (0.001) | (0.001) | (0.002) | (0.001) | (0.001) | (0.001) | |
| room_typeHotel room | -0.245 *** | -0.027 | -0.013 | |||
| (0.051) | (0.041) | (0.038) | ||||
| room_typePrivate room | -0.533 *** | -0.204 *** | -0.221 *** | |||
| (0.019) | (0.020) | (0.016) | ||||
| room_typeShared room | -0.250 *** | -0.303 *** | -0.401 *** | |||
| (0.056) | (0.060) | (0.046) | ||||
| host_since | -0.000 | |||||
| (0.000) | ||||||
| host_is_superhostTRUE | 0.112 *** | 0.108 *** | 0.094 *** | 0.098 *** | ||
| (0.019) | (0.017) | (0.018) | (0.017) | |||
| host_listings_count | -0.001 ** | -0.001 *** | -0.002 *** | -0.002 *** | ||
| (0.000) | (0.000) | (0.000) | (0.000) | |||
| accommodates | 0.075 *** | 0.105 *** | 0.115 *** | 0.112 *** | ||
| (0.008) | (0.005) | (0.004) | (0.004) | |||
| bathrooms | 0.005 | |||||
| (0.023) | ||||||
| bedrooms | 0.020 | |||||
| (0.013) | ||||||
| beds | 0.017 | |||||
| (0.010) | ||||||
| bed_typeFuton | 0.228 | |||||
| (0.317) | ||||||
| bed_typePull-out Sofa | 0.170 | |||||
| (0.272) | ||||||
| bed_typeReal Bed | 0.257 | |||||
| (0.246) | ||||||
| price | 0.000 *** | 0.000 *** | 0.000 *** | 0.000 *** | ||
| (0.000) | (0.000) | (0.000) | (0.000) | |||
| security_deposit | 0.000 *** | 0.000 *** | 0.000 *** | 0.000 *** | ||
| (0.000) | (0.000) | (0.000) | (0.000) | |||
| cleaning_fee | 0.000 *** | 0.000 *** | 0.001 *** | 0.000 *** | ||
| (0.000) | (0.000) | (0.000) | (0.000) | |||
| guests_included | -0.021 ** | -0.026 *** | -0.009 | |||
| (0.008) | (0.007) | (0.007) | ||||
| extra_people | 0.000 *** | 0.001 *** | 0.000 *** | 0.000 *** | ||
| (0.000) | (0.000) | (0.000) | (0.000) | |||
| minimum_nights | 0.035 *** | 0.036 *** | 0.054 *** | 0.054 *** | ||
| (0.010) | (0.008) | (0.008) | (0.008) | |||
| maximum_nights | -0.000 | |||||
| (0.000) | ||||||
| reviews_per_month | -0.030 * | |||||
| (0.014) | ||||||
| number_of_reviews_ltm | -0.000 | |||||
| (0.001) | ||||||
| review_scores_accuracy | -0.011 | |||||
| (0.013) | ||||||
| review_scores_cleanliness | 0.022 * | 0.034 *** | 0.033 *** | 0.032 *** | ||
| (0.011) | (0.009) | (0.009) | (0.009) | |||
| review_scores_checkin | 0.028 * | 0.008 | ||||
| (0.012) | (0.010) | |||||
| review_scores_communication | -0.027 * | -0.006 | ||||
| (0.013) | (0.011) | |||||
| review_scores_location | -0.014 | |||||
| (0.013) | ||||||
| review_scores_value | -0.053 *** | -0.056 *** | -0.058 *** | -0.061 *** | ||
| (0.013) | (0.010) | (0.010) | (0.010) | |||
| cancellation_policymoderate | 0.004 | |||||
| (0.023) | ||||||
| cancellation_policystrict | -0.108 | |||||
| (0.350) | ||||||
| cancellation_policystrict_14_with_grace_period | 0.021 | |||||
| (0.020) | ||||||
| neighbourhood_simplifiedzone_2 | -0.107 ** | -0.106 *** | -0.110 *** | -0.113 *** | ||
| (0.033) | (0.029) | (0.030) | (0.030) | |||
| neighbourhood_simplifiedzone_3 | -0.127 *** | -0.154 *** | -0.185 *** | -0.192 *** | ||
| (0.019) | (0.016) | (0.016) | (0.016) | |||
| neighbourhood_simplifiedzone_4 | -0.301 *** | -0.349 *** | -0.349 *** | -0.352 *** | ||
| (0.032) | (0.028) | (0.029) | (0.029) | |||
| neighbourhood_simplifiedzone_5 | -0.099 | 0.019 | -0.048 | -0.049 | ||
| (0.055) | (0.052) | (0.053) | (0.053) | |||
| rating_groupUnder 90 | -0.011 | -0.035 | -0.048 * | |||
| (0.024) | (0.019) | (0.020) | ||||
| N | 4783 | 4783 | 2641 | 4768 | 4768 | 4768 |
| R2 | 0.050 | 0.192 | 0.714 | 0.599 | 0.579 | 0.578 |
| logLik | -4508.062 | -4121.225 | -919.227 | -2444.830 | -2559.903 | -2563.900 |
| AIC | 9032.124 | 8264.449 | 1926.455 | 4945.660 | 5167.805 | 5171.800 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | ||||||
Interpretations based on these models
Bathroom, bedroom, beds and accomodates
Bathroom, bedroom and number of beds are insignificant explanatory factors for the price of an airbnb for 4 nights, because their corresponding t-values are less than 1.96, as shown in model 3. Therefore, we removed these three variables in the following models. However, the size of the Airbnb (accommodates) does has explanatory power in predicting the total price for 4 nights.
Hence, the model 6, which is model 5 plus bedrooms is our strongest model so far. It can explain around 58% of the deviation of prices by the included variables. The strongest price driver is by no surprise the number of accommodates in the airbnb and being a superhost. This does not come out of the blue, because we all know from own experience that prices per night for a hotel room are often per person prices, hence the price of a room will increase if there is an extra person living in that room.
Superhost
Based on our final regression model (model6), we can see that after controlling for other variables, Superhosts do command a pricing premium, because it is a significant variable in the model and has a coefficient of 0.101 when regressing against log(price_4_nights). Therefore, the fact that the host is a superhost increases the price_4_nights by (e^0.101-1) = 0.106 compared to the host not being a superhost. This makes economic sense, because being a superhost is very similar to a brand name and strong brands typically have higher pricing power.
Cancellation Policy
In our model 3, we see that cancellation policy is not a significant explanatory variable because all the different values of cancellation policy have t-values less than 1.96. To again test for its significance, we tried to include cancellation policy in our final model and see what happens. However, adding the variable is neither significant nor adds any explanatory power to our model. So we come to the conclusion that it is best to remove this variable from our model. It is better to have a less “complex” model with the same explanatory model as the complex.
Number of host listings
Since our Hong Kong dataset does not include information regarding whether the hosts advertise the exact locations of their listings, we choose to explore the relationship between the number of host listings and the price_4_nights. From our model 6, the coefficient for host_listings_count is -0.00225 when regressing against log(price_4_nights). Therefore, for every increase in host listings, the price_4_nights decreases by 0.00225. This might be the case because as a host owns more listings, he/she cares less about pricing of each individual listing, which leads to a slight price decrease.
#Prediction for price_4_nights in Hong Kong
#Filtering for properties that satisfy the conditions, have a private room, at least 10 reviews and an average rating over 90
hong_kong_listings_predict <- hong_kong_listings_neighbourhood_simplified %>%
#Since all room_types besides shared room have private rooms, we only have to filter out room types that are shared rooms
filter(room_type != "Shared room", number_of_reviews >= 10, rating_group == "Over 90")
#log prediction + transformation
prediction <- exp(predict(model6, newdata= hong_kong_listings_predict, interval = "confidence"))
prediction %>%
summary()
## fit lwr upr
## Min. : 1196 Min. : 1100 Min. : 1300
## 1st Qu.: 1880 1st Qu.: 1799 1st Qu.: 1961
## Median : 2427 Median : 2315 Median : 2542
## Mean : 2937 Mean : 2785 Mean : 3101
## 3rd Qu.: 3356 3rd Qu.: 3222 3rd Qu.: 3499
## Max. :43440 Max. :37933 Max. :51163
plot(model6$residuals)

#non log
# here we look at the model without the log -> small differences
model_predict <- lm(price_4_nights ~
host_is_superhost +
host_listings_count +
accommodates +
price +
security_deposit +
cleaning_fee +
extra_people +
minimum_nights +
number_of_reviews +
review_scores_rating +
review_scores_cleanliness +
review_scores_value +
prop_type_simplified +
neighbourhood_simplified,
data = hong_kong_listings_predict)
confint(model_predict, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) -4.66e+02 420.4591
## host_is_superhostTRUE -4.99e+01 5.0839
## host_listings_count -1.43e+00 0.6577
## accommodates -2.72e+01 -11.6086
## price 3.96e+00 3.9912
## security_deposit 2.91e-03 0.0162
## cleaning_fee 9.60e-01 1.0924
## extra_people 7.36e-01 0.9229
## minimum_nights -1.93e+01 12.9198
## number_of_reviews -3.84e-01 -0.0280
## review_scores_rating -4.35e+00 7.5335
## review_scores_cleanliness -3.45e+01 19.9229
## review_scores_value -2.27e+01 31.7656
## prop_type_simplifiedCondominium -3.96e+01 46.5905
## prop_type_simplifiedHostel -2.92e+01 105.5375
## prop_type_simplifiedOther 5.67e+00 69.1511
## prop_type_simplifiedServiced apartment -1.78e+02 -38.2587
## neighbourhood_simplifiedzone_2 -8.30e+01 12.1307
## neighbourhood_simplifiedzone_3 -5.11e+01 9.5439
## neighbourhood_simplifiedzone_4 -3.00e+01 86.0904
## neighbourhood_simplifiedzone_5 2.14e+01 199.6289
predict(model_predict, newdata = hong_kong_listings_predict, interval = "confidence") %>%
summary()
## fit lwr upr
## Min. : 331 Min. : 298 Min. : 364
## 1st Qu.: 1744 1st Qu.: 1698 1st Qu.: 1791
## Median : 2557 Median : 2517 Median : 2600
## Mean : 3223 Mean : 3175 Mean : 3271
## 3rd Qu.: 3905 3rd Qu.: 3858 3rd Qu.: 3958
## Max. :43497 Max. :43325 Max. :43670
plot(model_predict$residuals)



Conclusion
Model effectiveness and limitations
Our final regression model (model6) includes the following 13 explanatory variables:
- host_is_superhost
- host_listings_count
- accommodates
- price
- security_deposit
- cleaning_fee
- extra_people
- minimum_nights
- number_of_reviews
- review_scores_cleanliness
- review_scores_value
- prop_type_simplified
- neighbourhood_simplified
This model has an adjusted R-Squared of 0.579, meaning that we were able to explain 58% of the variability of price_4_nights using the above variables. However, it is worth noticing that our adjusted R-Squared decreases from 0.709 in model 3 as we removed the insignificant variables. This is probably due to the fact that as we add more variables to a model, the ability to account for the variations increase. However, an efficient regression model should only contain variables that are significant and should not be highly complex. Therefore, we believe that our final model is a strong one based on the current dataset.
However, there are much more factors that could affect price_4_nights that is not reflected in this dataset and analysis. For example, there are macroeconomic factors that can impact the pricing of Airbnb listings, especially under unusual circumstances that could limit travel conditions like now. In addition, total prices could vary greatly among different seasons due to holidays and vacations. We should also take into account the effect of pricing by competitors like Booking.com and Expedia. These are all variables that are not incorporated in the model and are worth exploring in future analysis.
Take Aways
This exercise has allowed us to apply all our knowledge in R and beyond. We were able to incorporate our statistical knowledge that we gathered through this course to a real life problem. We learned how to use real data to read our surroundings and take action accordingly. If we are traveling on a budget (as most student usually do) we know what variables or in this case qualities we need to remove from our filter to find the cheapest accommodation for our budget.
We want to thank Prof. Kostis and his army of TAs that were always supportive in this new environment (We are not only talking about COVID here ;))