Airflow DAG: Web scrapping using selenium

ayushmaheshwari · 04-16-2024 11:44 AM

In my pipeline, I have a step wherein I am doing some web scrapping with selenium. That is: using selenium to open a web page and using web locators extract information from website. Ideally, I would like to have a python callable to do this and use Airflow/Composer. But that selenium will make use of a chrome driver to go to the web page (headless/without opening it physically) and get information. What is the best way to do this? Do I need a virtual machine with chrome installed? Can I do with Airflow/Composer? CC: @ms4446

Thanks

ms4446

Hi @ayushmaheshwari ,

Here is how you can integrate Selenium with Cloud Composer effectively:

Best Way to Integrate Selenium with Cloud Composer: The optimal approach is to encapsulate Selenium and all its dependencies within a Docker container. This setup includes Python, Selenium, a WebDriver (like ChromeDriver configured for headless operation), and necessary libraries. You can then utilize the KubernetesPodOperator in Airflow to execute this container as a task. This method provides excellent isolation, simplifies dependency management, and enhances scalability.
Need for a Virtual Machine with Chrome: No, you don’t need a virtual machine with Chrome installed. The Docker container approach includes everything required for headless browser automation, making a separate VM unnecessary.
Implementation Steps with Cloud Composer:
- Create a Docker Image: Build an image with the required setup and push it to Google Container Registry or a similar service.
- Modify Your Airflow DAG: Configure the KubernetesPodOperator to use your Docker image. Define tasks in the DAG to execute your scraping script, using Airflow variables or secrets for any sensitive or dynamic inputs.
- Set Environment Variables: Utilize Airflow's capabilities to manage environment variables and secure storage of sensitive information like API keys.
- Testing and Deployment: Always test your setup thoroughly in a staging environment before rolling out to production to ensure everything works as expected.

Additional Tips:

Security: Use VPNs or proxies when dealing with sensitive or geo-restricted content. Securely handle credentials using Airflow’s mechanisms.
Scaling: Manage the scalability of your tasks by adjusting Airflow settings related to concurrency and resource allocation.
Cost Management: Keep your Docker images lean and monitor Google Cloud billing to control costs.
Robustness: Include error handling, retries, and timeouts in your setup to make your pipeline more resilient.
Compliance: Adhere to legal and technical standards such as respecting robots.txt files and using appropriate user-agent strings.
Debugging: Leverage local testing, detailed logging, and Cloud Logging for monitoring and troubleshooting.

Maulik2012

Is Web scrapping using selenium is secure? I mean, will it consider as security breach if we fetch through it? I got to know that selenium is using HTTP connection in the script

ayushmaheshwari

Hi @ms4446

I'm not sure about how to create a Docker Image and what exactly you mean by build an image with the required setup? Do you mean the required dependencies for the selenium based script?

Can you provide some guidance or some tutorial that explains this as I have never done it . As well as about pushing it to Google Container Registry or a similar service. I shall be extremely grateful

Also @ms4446 , do you think I can use cloud function, write the whole web scraping script in a cloud function and trigger the cloud function through Composer/Airflow? Isn't that better option?

Thanks

ms4446

Hi @ayushmaheshwari ,

A Docker image serves as a blueprint for Docker containers, encapsulating the necessary operating system, software, libraries, and configurations. Here's are some steps you could take:

Create a Dockerfile: This text file contains commands to build your image. Here's an example for a Selenium project:

 

# Use an official Python runtime as a parent image
FROM python:3.9-slim-buster

# Set the working directory to /app
WORKDIR /app

# Copy project files into the container
COPY . /app

# Install dependencies from requirements.txt
RUN pip install --trusted-host pypi.python.org -r requirements.txt

# Expose port 80
EXPOSE 80

# Define environment variable
ENV NAME World

# Run your script on container launch
CMD ["python", "app.py"] 

Build the Docker Image: In your terminal, navigate to the Dockerfile's directory and run:
```
docker build -t selenium-scraper .
```
This creates an image tagged as selenium-scraper.

Push to Google Container Registry (GCR): After configuring Docker to use gcloud, tag and push your image:

 
docker tag selenium-scraper gcr.io/your-project-id/selenium-scraper
docker push gcr.io/your-project-id/selenium-scraper

Using Cloud Functions

While possible, Cloud Functions have limitations for web scraping:

Timeouts: Maximum 9-minute execution time.
Resource Limits: Less control over the environment compared to VMs or containers.
Concurrency and Scaling: Handling high concurrency might require additional setup.

However, for lightweight, short tasks, Cloud Functions triggered by Cloud Composer can be a simpler option:

Write your scraping script as a function.
Deploy it to Cloud Functions.
Use Cloud Composer to schedule and trigger the function.

Choosing the Right Approach

Docker and Cloud Composer: Offers more control and scalability, suitable for complex or long-running tasks.
Cloud Functions: Simpler and potentially cost-effective for smaller, well-defined tasks.