Article Timeline

How to scrape Booking.com?

Reading time: 7 min read

Techjury is supported by its audience. When you purchase through links on our site, we may earn an affiliate commission. Learn more.

Table of Contents

Requests vs. Selenium
Conclusion

In the era of big data, scraping web content has become a significant process in data collection. Many industries, including the hotel sector, can significantly benefit from web scraping, a method that fetches key data from websites.

This tutorial delves into using Python to scrape data from Booking.com. Such a task is crucial for anyone needing to track competition or monitor the influx of new properties in specific areas.

Requests vs. Selenium

This tutorial will use Selenium over traditional HTTP request-based libraries like requests due to Selenium’s ability to handle dynamic content. Booking.com is JavaScript-heavy, meaning the page's content is rendered dynamically. Libraries like requests can fetch the HTML of the page, but they aren’t able to interact with JavaScript-rendered content.

Note that you’ll be fetching data from all the search results pages, meaning you also need to click the Next button. This is where Selenium comes in. It allows interactions with the web page, making it an ideal tool for this tutorial.

Depending on many factors, you may need to set up user agents, use more complex headers, and route your requests through a proxy in order to not get blocked. In many cases, using and rotating different user agents may do the trick, but in other situations, proxy servers are a lifesaver. Should you find yourself on a quest for proxies, look no further, as Oxylabs has one of the best-quality proxy servers you can find in the market.

Set the environment

Open your terminal, create a new directory, and then create a virtual environment:

Unset

$ mkdir scraper

$ cd scraper

$ python3 -m venv venv

Now activate this virtual environment and install the dependencies:

Python

$ source venv/bin/activate

$ pip install selenium webdriver-manager

Selenium library provides the API for accessing the driver. The webdriver-manager library allows you to install and update the Chrome driver from the code, which would otherwise be a manual process.

Data points to be extracted

While following this tutorial, you’ll collect the following data points:

Name of the hotel
Rating
Review count
Address of the hotel
Price

You can easily customize this scraper to collect any other data points you may need.

Quick Overview of CSS selectors

Before you delve into the actual scraping, let's discuss CSS selectors, a crucial concept in web scraping. CSS selectors are patterns used to select elements you want to style on a web page. They can select elements based on their ID,

class, attribute, or relative position in the HTML document. Compared to XPath, CSS selectors are generally preferred due to their readability, simplicity, and speed.

You can build CSS selectors by following the document structure and identifying unique attributes or combinations that point to the element you want to access. For example, div[data-type="hotel-card"] will select all the div elements with an attribute data-type with the value hotel-card.

To create the selectors, open the following page in Chrome:

https://www.booking.com/searchresults.html?ss=New+York&checkin=2 023-12-31&checkout=2024-01-01

This URL opens the search results for hotels in New York, with dates set as the new year eve of 2024.

Once the page loads, right-click the hotel name and select Inspect. It opens the developer tools, where you can build and test your CSS selector.

test your CSS selector

You can press Ctrl + F (Windows) or Cmd + F (macOS)on your keyboard while the Developer Tools are open, which will enable you to search the page with the constructed CSS selector expression. After spending time on this page, you’ll know that the following CSS selectors work best for the required data points:

Element CSS Selector

name div[data-testid="title"]

review information [data-testid="review-score"]

price [data-testid="price-and-discoun ted-price"]

address [data-testid="address"]

The review information contains both ratings in the review count.

Handling pagination

For pagination, we can look for the next page button. The following CSS selector matches the next page:

button[aria-label="Next page"]

With this in mind, let's move on to writing the code.

Approach to web scraping

To effectively scrape Booking.com, you'll have to perform the following steps:

Initialize the Selenium web driver
Open the Booking.com search results page
Determine the total number of result pages
Extract the desired data from each hotel listing
Navigate through all the result pages and repeat step 4

To get a clearer picture, let's start with the skeleton of our code:

Python

def init_driver():

pass

def get_total_pages(driver):

pass

def extract_hotel_info(driver):

pass

def extract_hotel_data(hotel):

pass

def navigate_pages(driver, total_pages):

pass

def main():

pass

if __name__ == "__main__":

main()

These are the core functions that will make up your script. You'll flesh out each of these functions by following the next tutorial steps.

Building the functions

Initializing the driver

The below function will set up the Selenium web driver for you. The ChromeDriverManager automatically downloads the driver binary required for Selenium to interact with Chrome.

Python

from selenium import webdriver

from selenium.webdriver.chrome.service import Service

from webdriver_manager.chrome import ChromeDriverManager

def init_driver():

driver_service = Service(ChromeDriverManager().install())

return webdriver.Chrome(service=driver_service)

Getting the total number of pages

This function fetches the total number of pages in the search results.

This step is important as it determines the number of times the loop should run:

Python

from selenium.common.exceptions import NoSuchElementException

from selenium.webdriver.common.by import By

def get_total_pages(driver):

try:

total_pages = int(

driver.find_element(

By.CSS_SELECTOR, 'div[data-testid="pagination"] li:last-child'

).text

)

except NoSuchElementException as e:

print("Error finding total pages: ", e)

total_pages = 0

return total_pages

Extracting the hotel container

First, you need to extract the container that holds one hotel listing. Once you have this, you can run a loop over each item and extract individual hotel information.

This function extracts the data of all hotels on the current page. It first targets all hotel cards using a CSS selector and then extracts the relevant data from each hotel.

Instead of using multiple try-catch blocks, we’ll use the contextlib library in this code, which informs Python runtime that NoSuchElementException should be suppressed.

Python

from contextlib import suppress

def extract_hotel_info(driver):

data = []

all_hotels = driver.find_elements(

By.CSS_SELECTOR, 'div[data-testid="property-card"]'

)

for hotel in all_hotels:

with suppress(NoSuchElementException):

result = extract_hotel_data(hotel)

data.append(result)

return data

Extracting individual hotel information

Once you have an HTML block containing exactly one hotel information, you can use the following function to extract the data you’re looking for.

Note that this is the most critical function, which does the job of scraping data. If you need more data points, this is where you make changes:

Python

def extract_hotel_data(hotel):

result = {}

result["name"] = hotel.find_element(

By.CSS_SELECTOR, 'div[data-testid="title"]'

).text

review_score, _, review_count = hotel.find_element(

By.CSS_SELECTOR, '[data-testid="review-score"]'

).text.split("\n")

result["review_score"] = review_score

result["review_count"] = review_count

result["price"] = hotel.find_element(

By.CSS_SELECTOR, '[data-testid="price-and-discounted-price"]'

).text

result["address"] = hotel.find_element(

By.CSS_SELECTOR, '[data-testid="address"]'

).text

return result

Navigating to other pages

Lastly, you also need to access other search results pages. The following code snippet achieves this:

Python

import time

def navigate_pages(driver, total_pages):

try:

decline_cookies = driver.find_element(By.CSS_SELECTOR,

'[id="onetrust-reject-all-handler"]')

decline_cookies.click()

except NoSuchElementException:

print("No cookies to decline.")

data = []

for _ in range(total_pages)[:2]: # limit to first 2 pages for this example data.extend(extract_hotel_info(driver))

try:

next_page_btn = driver.find_element(

By.CSS_SELECTOR, 'button[aria-label="Next page"]'

)

next_page_btn.click()

time.sleep(5) # wait for the next page to load

except NoSuchElementException as e:

print("Error finding next page button: ", e)

break

return data

Note that the pages are limited to 2 just for this example. You can delete the [:2] to collect the data from all 13 pages.

Exporting data

If you want to export the data to a CSV, you can add a function as follows:

Python

def export_data(data):

csv_file = "property_data.csv"

with open(csv_file, "w", newline="", encoding="utf-8") as file:

writer = csv.DictWriter(file, fieldnames=data[0].keys())

writer.writeheader()

writer.writerows(data)

Executing all the functions

Finally, the main() function sets up the driver, fetches the total number of pages, scrapes all hotels, and exports the scraped data into a CSV file:

Python

def main():

url =

"https://www.booking.com/searchresults.html?ss=New+York&checkin=2023-12-31&chec kout=2024-01-01"

driver = init_driver()

driver.get(url)

total_pages = get_total_pages(driver)

print(f"Total pages: {total_pages}")

data = navigate_pages(driver, total_pages)

driver.quit()

print(data)

if __name__ == "__main__":

main()

The complete Booking.com scraper script looks like this:

Python

import time

from selenium import webdriver

from selenium.webdriver.chrome.service import Service

from webdriver_manager.chrome import ChromeDriverManager

from selenium.common.exceptions import NoSuchElementException

from selenium.webdriver.common.by import By

from contextlib import suppress

def init_driver():

driver_service = Service(ChromeDriverManager().install())

return webdriver.Chrome(service=driver_service)

def get_total_pages(driver):

try:

total_pages = int(

driver.find_element(

By.CSS_SELECTOR, 'div[data-testid="pagination"] li:last-child' ).text

)

except NoSuchElementException as e:

print("Error finding total pages: ", e)

total_pages = 0

return total_pages

def extract_hotel_info(driver):

data = []

all_hotels = driver.find_elements(

By.CSS_SELECTOR, 'div[data-testid="property-card"]'

)

for hotel in all_hotels:

with suppress(NoSuchElementException):

result = extract_hotel_data(hotel)

data.append(result)

return data

def extract_hotel_data(hotel):

result = {}

result["name"] = hotel.find_element(

By.CSS_SELECTOR, 'div[data-testid="title"]'

).text

review_score, _, review_count = hotel.find_element(

By.CSS_SELECTOR, '[data-testid="review-score"]'

).text.split("\n")

result["review_score"] = review_score

result["review_count"] = review_count

result["price"] = hotel.find_element(

By.CSS_SELECTOR, '[data-testid="price-and-discounted-price"]' ).text

result["address"] = hotel.find_element(

By.CSS_SELECTOR, '[data-testid="address"]'

).text

return result

def navigate_pages(driver, total_pages):

try:

decline_cookies = driver.find_element(By.CSS_SELECTOR,

'[id="onetrust-reject-all-handler"]')

decline_cookies.click()

except NoSuchElementException:

print("No cookies to decline.")

data = []

for _ in range(total_pages)[:2]: # limit to first 2 pages for this example data.extend(extract_hotel_info(driver))

try:

next_page_btn = driver.find_element(

By.CSS_SELECTOR, 'button[aria-label="Next page"]'

)

next_page_btn.click()

time.sleep(5) # wait for the next page to load

except NoSuchElementException as e:

print("Error finding next page button: ", e)

break

return data

def export_data(data):

csv_file = "property_data.csv"

with open(csv_file, "w", newline="", encoding="utf-8") as file: writer = csv.DictWriter(file, fieldnames=data[0].keys())

writer.writeheader()

writer.writerows(data)

def main():

url =

"https://www.booking.com/searchresults.html?ss=New+York&checkin=2023-12-31&chec kout=2024-01-01"

driver = init_driver()

driver.get(url)

total_pages = get_total_pages(driver)

print(f"Total pages: {total_pages}")

data = navigate_pages(driver, total_pages)

driver.quit()

print(data)

if __name__ == "__main__":

main()

Conclusion

And there you have it, a step-by-step guide to scraping Booking.com using Python and Selenium. Following this approach, you can extend the script to scrape other dynamic websites. Remember that it’s essential to approach this endeavour with respect to the terms and conditions of the website you’re scraping. Happy coding, and enjoy the journey!

Leave your comment

Your email address will not be published.