In my pipeline, I have a step wherein I am doing some web scrapping with selenium. That is: using selenium to open a web page and using web locators extract information from website. Ideally, I would like to have a python callable to do this and use Airflow/Composer. But that selenium will make use of a chrome driver to go to the web page (headless/without opening it physically) and get information. What is the best way to do this? Do I need a virtual machine with chrome installed? Can I do with Airflow/Composer? CC: @ms4446
Thanks
Hi @ayushmaheshwari ,
Here is how you can integrate Selenium with Cloud Composer effectively:
Best Way to Integrate Selenium with Cloud Composer: The optimal approach is to encapsulate Selenium and all its dependencies within a Docker container. This setup includes Python, Selenium, a WebDriver (like ChromeDriver configured for headless operation), and necessary libraries. You can then utilize the KubernetesPodOperator in Airflow to execute this container as a task. This method provides excellent isolation, simplifies dependency management, and enhances scalability.
Need for a Virtual Machine with Chrome: No, you don’t need a virtual machine with Chrome installed. The Docker container approach includes everything required for headless browser automation, making a separate VM unnecessary.
Implementation Steps with Cloud Composer:
Additional Tips:
Is Web scrapping using selenium is secure? I mean, will it consider as security breach if we fetch through it? I got to know that selenium is using HTTP connection in the script
Hi @ms4446
I'm not sure about how to create a Docker Image and what exactly you mean by build an image with the required setup? Do you mean the required dependencies for the selenium based script?
Can you provide some guidance or some tutorial that explains this as I have never done it . As well as about pushing it to Google Container Registry or a similar service. I shall be extremely grateful
Also @ms4446 , do you think I can use cloud function, write the whole web scraping script in a cloud function and trigger the cloud function through Composer/Airflow? Isn't that better option?
Thanks
Hi @ayushmaheshwari ,
A Docker image serves as a blueprint for Docker containers, encapsulating the necessary operating system, software, libraries, and configurations. Here's are some steps you could take:
Create a Dockerfile: This text file contains commands to build your image. Here's an example for a Selenium project:
# Use an official Python runtime as a parent image
FROM python:3.9-slim-buster
# Set the working directory to /app
WORKDIR /app
# Copy project files into the container
COPY . /app
# Install dependencies from requirements.txt
RUN pip install --trusted-host pypi.python.org -r requirements.txt
# Expose port 80
EXPOSE 80
# Define environment variable
ENV NAME World
# Run your script on container launch
CMD ["python", "app.py"]
Build the Docker Image: In your terminal, navigate to the Dockerfile's directory and run:
docker build -t selenium-scraper .
This creates an image tagged as selenium-scraper
.
Push to Google Container Registry (GCR): After configuring Docker to use gcloud
, tag and push your image:
docker tag selenium-scraper gcr.io/your-project-id/selenium-scraper
docker push gcr.io/your-project-id/selenium-scraper
Using Cloud Functions
While possible, Cloud Functions have limitations for web scraping:
However, for lightweight, short tasks, Cloud Functions triggered by Cloud Composer can be a simpler option:
Choosing the Right Approach