Updated · Jan 10, 2024
Artem is a management consultant with a strong background in marketing and branding. As a valuable m... | See full bio
Updated · Nov 17, 2023
Artem is a management consultant with a strong background in marketing and branding. As a valuable m... | See full bio
Girlie is an accomplished writer with an interest in technology and literature. With years of experi... | See full bio
In the era of big data, scraping web content has become a significant process in data collection. Many industries, including the hotel sector, can significantly benefit from web scraping, a method that fetches key data from websites.
This tutorial delves into using Python to scrape data from Booking.com. Such a task is crucial for anyone needing to track competition or monitor the influx of new properties in specific areas.
This tutorial will use Selenium over traditional HTTP request-based libraries like requests due to Selenium’s ability to handle dynamic content. Booking.com is JavaScript-heavy, meaning the page's content is rendered dynamically. Libraries like requests can fetch the HTML of the page, but they aren’t able to interact with JavaScript-rendered content.
Note that you’ll be fetching data from all the search results pages, meaning you also need to click the Next button. This is where Selenium comes in. It allows interactions with the web page, making it an ideal tool for this tutorial.
Depending on many factors, you may need to set up user agents, use more complex headers, and route your requests through a proxy in order to not get blocked. In many cases, using and rotating different user agents may do the trick, but in other situations, proxy servers are a lifesaver. Should you find yourself on a quest for proxies, look no further, as Oxylabs has one of the best-quality proxy servers you can find in the market.
Open your terminal, create a new directory, and then create a virtual environment:
Unset
$ mkdir scraper
$ cd scraper
$ python3 -m venv venv
Now activate this virtual environment and install the dependencies:
Python
$ source venv/bin/activate
$ pip install selenium webdriver-manager
Selenium library provides the API for accessing the driver. The webdriver-manager library allows you to install and update the Chrome driver from the code, which would otherwise be a manual process.
While following this tutorial, you’ll collect the following data points:
You can easily customize this scraper to collect any other data points you may need.
Before you delve into the actual scraping, let's discuss CSS selectors, a crucial concept in web scraping. CSS selectors are patterns used to select elements you want to style on a web page. They can select elements based on their ID,
class, attribute, or relative position in the HTML document. Compared to XPath, CSS selectors are generally preferred due to their readability, simplicity, and speed.
You can build CSS selectors by following the document structure and identifying unique attributes or combinations that point to the element you want to access. For example, div[data-type="hotel-card"] will select all the div elements with an attribute data-type with the value hotel-card.
To create the selectors, open the following page in Chrome:
https://www.booking.com/searchresults.html?ss=New+York&checkin=2 023-12-31&checkout=2024-01-01
This URL opens the search results for hotels in New York, with dates set as the new year eve of 2024.
Once the page loads, right-click the hotel name and select Inspect. It opens the developer tools, where you can build and test your CSS selector.
You can press Ctrl + F (Windows) or Cmd + F (macOS)on your keyboard while the Developer Tools are open, which will enable you to search the page with the constructed CSS selector expression. After spending time on this page, you’ll know that the following CSS selectors work best for the required data points:
Element CSS Selector |
name div[data-testid="title"] |
review information [data-testid="review-score"] |
price [data-testid="price-and-discoun ted-price"] |
address [data-testid="address"] |
The review information contains both ratings in the review count.
For pagination, we can look for the next page button. The following CSS selector matches the next page:
button[aria-label="Next page"]
With this in mind, let's move on to writing the code.
To effectively scrape Booking.com, you'll have to perform the following steps:
To get a clearer picture, let's start with the skeleton of our code:
Python
def init_driver():
pass
def get_total_pages(driver):
pass
def extract_hotel_info(driver):
pass
def extract_hotel_data(hotel):
pass
def navigate_pages(driver, total_pages):
pass
def main():
pass
if __name__ == "__main__":
main()
These are the core functions that will make up your script. You'll flesh out each of these functions by following the next tutorial steps.
The below function will set up the Selenium web driver for you. The ChromeDriverManager automatically downloads the driver binary required for Selenium to interact with Chrome.
Python
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
def init_driver():
driver_service = Service(ChromeDriverManager().install())
return webdriver.Chrome(service=driver_service)
This function fetches the total number of pages in the search results.
This step is important as it determines the number of times the loop should run:
Python
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
def get_total_pages(driver):
try:
total_pages = int(
driver.find_element(
By.CSS_SELECTOR, 'div[data-testid="pagination"] li:last-child'
).text
)
except NoSuchElementException as e:
print("Error finding total pages: ", e)
total_pages = 0
return total_pages
First, you need to extract the container that holds one hotel listing. Once you have this, you can run a loop over each item and extract individual hotel information.
This function extracts the data of all hotels on the current page. It first targets all hotel cards using a CSS selector and then extracts the relevant data from each hotel.
Instead of using multiple try-catch blocks, we’ll use the contextlib library in this code, which informs Python runtime that NoSuchElementException should be suppressed.
Python
from contextlib import suppress
def extract_hotel_info(driver):
data = []
all_hotels = driver.find_elements(
By.CSS_SELECTOR, 'div[data-testid="property-card"]'
)
for hotel in all_hotels:
with suppress(NoSuchElementException):
result = extract_hotel_data(hotel)
data.append(result)
return data
Once you have an HTML block containing exactly one hotel information, you can use the following function to extract the data you’re looking for.
Note that this is the most critical function, which does the job of scraping data. If you need more data points, this is where you make changes:
Python
def extract_hotel_data(hotel):
result = {}
result["name"] = hotel.find_element(
By.CSS_SELECTOR, 'div[data-testid="title"]'
).text
review_score, _, review_count = hotel.find_element(
By.CSS_SELECTOR, '[data-testid="review-score"]'
).text.split("\n")
result["review_score"] = review_score
result["review_count"] = review_count
result["price"] = hotel.find_element(
By.CSS_SELECTOR, '[data-testid="price-and-discounted-price"]'
).text
result["address"] = hotel.find_element(
By.CSS_SELECTOR, '[data-testid="address"]'
).text
return result
Lastly, you also need to access other search results pages. The following code snippet achieves this:
Python
import time
def navigate_pages(driver, total_pages):
try:
decline_cookies = driver.find_element(By.CSS_SELECTOR,
'[id="onetrust-reject-all-handler"]')
decline_cookies.click()
except NoSuchElementException:
print("No cookies to decline.")
data = []
for _ in range(total_pages)[:2]: # limit to first 2 pages for this example data.extend(extract_hotel_info(driver))
try:
next_page_btn = driver.find_element(
By.CSS_SELECTOR, 'button[aria-label="Next page"]'
)
next_page_btn.click()
time.sleep(5) # wait for the next page to load
except NoSuchElementException as e:
print("Error finding next page button: ", e)
break
return data
Note that the pages are limited to 2 just for this example. You can delete the [:2] to collect the data from all 13 pages.
If you want to export the data to a CSV, you can add a function as follows:
Python
def export_data(data):
csv_file = "property_data.csv"
with open(csv_file, "w", newline="", encoding="utf-8") as file:
writer = csv.DictWriter(file, fieldnames=data[0].keys())
writer.writeheader()
writer.writerows(data)
Finally, the main() function sets up the driver, fetches the total number of pages, scrapes all hotels, and exports the scraped data into a CSV file:
Python
def main():
url =
"https://www.booking.com/searchresults.html?ss=New+York&checkin=2023-12-31&chec kout=2024-01-01"
driver = init_driver()
driver.get(url)
total_pages = get_total_pages(driver)
print(f"Total pages: {total_pages}")
data = navigate_pages(driver, total_pages)
driver.quit()
print(data)
if __name__ == "__main__":
main()
The complete Booking.com scraper script looks like this:
Python
import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from contextlib import suppress
def init_driver():
driver_service = Service(ChromeDriverManager().install())
return webdriver.Chrome(service=driver_service)
def get_total_pages(driver):
try:
total_pages = int(
driver.find_element(
By.CSS_SELECTOR, 'div[data-testid="pagination"] li:last-child' ).text
)
except NoSuchElementException as e:
print("Error finding total pages: ", e)
total_pages = 0
return total_pages
def extract_hotel_info(driver):
data = []
all_hotels = driver.find_elements(
By.CSS_SELECTOR, 'div[data-testid="property-card"]'
)
for hotel in all_hotels:
with suppress(NoSuchElementException):
result = extract_hotel_data(hotel)
data.append(result)
return data
def extract_hotel_data(hotel):
result = {}
result["name"] = hotel.find_element(
By.CSS_SELECTOR, 'div[data-testid="title"]'
).text
review_score, _, review_count = hotel.find_element(
By.CSS_SELECTOR, '[data-testid="review-score"]'
).text.split("\n")
result["review_score"] = review_score
result["review_count"] = review_count
result["price"] = hotel.find_element(
By.CSS_SELECTOR, '[data-testid="price-and-discounted-price"]' ).text
result["address"] = hotel.find_element(
By.CSS_SELECTOR, '[data-testid="address"]'
).text
return result
def navigate_pages(driver, total_pages):
try:
decline_cookies = driver.find_element(By.CSS_SELECTOR,
'[id="onetrust-reject-all-handler"]')
decline_cookies.click()
except NoSuchElementException:
print("No cookies to decline.")
data = []
for _ in range(total_pages)[:2]: # limit to first 2 pages for this example data.extend(extract_hotel_info(driver))
try:
next_page_btn = driver.find_element(
By.CSS_SELECTOR, 'button[aria-label="Next page"]'
)
next_page_btn.click()
time.sleep(5) # wait for the next page to load
except NoSuchElementException as e:
print("Error finding next page button: ", e)
break
return data
def export_data(data):
csv_file = "property_data.csv"
with open(csv_file, "w", newline="", encoding="utf-8") as file: writer = csv.DictWriter(file, fieldnames=data[0].keys())
writer.writeheader()
writer.writerows(data)
def main():
url =
"https://www.booking.com/searchresults.html?ss=New+York&checkin=2023-12-31&chec kout=2024-01-01"
driver = init_driver()
driver.get(url)
total_pages = get_total_pages(driver)
print(f"Total pages: {total_pages}")
data = navigate_pages(driver, total_pages)
driver.quit()
print(data)
if __name__ == "__main__":
main()
And there you have it, a step-by-step guide to scraping Booking.com using Python and Selenium. Following this approach, you can extend the script to scrape other dynamic websites. Remember that it’s essential to approach this endeavour with respect to the terms and conditions of the website you’re scraping. Happy coding, and enjoy the journey!
Your email address will not be published.
Updated · Jan 10, 2024
Updated · Jan 09, 2024
Updated · Jan 05, 2024
Updated · Jan 03, 2024