Splitting Looker instances: the case for multiple ...

marout · 12-11-2024 01:07 PM

Authored by: @maryiaborukhava and @nishantmaharish. Co-authored by: @marout

Introduction

This article provides a comprehensive guide on how to split productional Looker instance content across multiple instances. This strategy is designed to improve performance and provide content isolation for Looker instances, particularly those with a large number of users and extensive content.

Why have multiple Looker instances?

From a single Looker instance to multiple instances

Recently, our team decided to introduce an additional Looker instance and allocate a group of users from one tenant to this new instance. Our Looker setup has undergone multiple iterations. In addition to the production instance, we have established a development environment to offload development and QA activities from the production environment. This additional instance enhanced performance for multiple workflows. However, users of the production environment experienced challenges as a result of the increasing number of LookML models, Explores, dashboards, Looks, and schedules in the production environment. These issues included delayed responses from API endpoints, longer dashboard rendering times, slower search performance in Explores, and interface delays when interacting with buttons in Looker. As a result, we decided to introduce a second production environment for Looker and relocate a group of users there.

Advantages and disadvantages of supporting multiple Looker instances

The primary benefits of utilizing multiple environments include enhanced availability and improved performance, among others:

Enhanced Availability. A Looker instance is a shared environment. While it's possible to isolate data access for various user groups and distribute the computational load of executing SQL queries across multiple database connections, some elements remain shared among all users of the same instance. These shared elements include the Looker queries queue, the maximum number of rendering threads that can be used concurrently for generating scheduled reports, and the general capacity of the Looker server. Distributing user groups and tenants across multiple Looker production environments minimize the risk of these groups impacting one another. In the event of a server outage, only a specific group of users will be affected, ensuring uninterrupted service for others.
Performance Improvement. As the instance size increases, there are inherent limits to growth before performance is adversely affected. Introducing a new environment helps address the delayed responses from API endpoints, dense schedules distributions, longer dashboard rendering times, and interface delays.
Content Optimization: Hosting a large number of dashboards, Looks, and Explores on a single Looker instance can significantly impact search performance and efficiency when looking for relevant content. Additionally, the Content Validator may become inefficient if it has to run against a vast amount of Looker dashboards and Looks. By distributing Looker content across multiple instances, content validation is greatly improved and search processes are streamlined.
Customized Administration for Tenants: When a Looker instance is shared among multiple tenants, tasks such as model configuration must be handled by administrators and cannot be delegated to more users, because of the risk of data leakage between tenants. Providing a separate Looker instance for each tenant offers excellent tenant isolation and allows greater autonomy in managing the instance according to each tenant's unique requirements. This setup ensures that features like Looker's System Activity Explores are directly accessible to all tenants without additional filtering, and certain configuration tasks can be carried out without the bottleneck of requiring administrative approval.

The main disadvantages of maintaining multiple Looker instances include increased administrative effort and cost as well as the need to have a strategy that lets users from different tenants work together effectively:

Maintenance. If you are hosting Looker instances in-house, this effort includes updates, security patches, and monitoring. However, even with managed Looker instances, extra effort will still be needed to administer permissions, database connectivity, and Looker peripheral jobs that use the Looker API.
Cost. Supporting multiple Looker instances implies additional costs for allocating those resources.
Collaborative Development. Maintaining multiple Looker instances introduces complexities for users from different tenants when they need to interact with each other, particularly in developing LookML code and sharing reports, as they will be using different Looker instances. If your tenants need to work closely together, you should develop a strategy to facilitate their collaboration.

Is an additional instance necessary?

When deciding whether your organization needs additional Looker instances, it's essential to evaluate whether the advantages outweigh the disadvantages. In particular, consider adding an additional Looker instance if your current setup suffers from server performance issues such as extensive Looker queueing, frequent rush hours with many scheduled reports, reduced responsiveness, or difficulties with search and content validation functionality.

Looker instance per tenant

Problem statement

Our main objective was to migrate and separate the content for a specific group of users (belonging to a particular tenant) from the single production instance to a newly created production environment dedicated to them. The goal was to facilitate a smooth transition for users who stayed with the original instance while minimizing any disruption for those transitioning to the new production environment.

The process of splitting the Looker instance involves three key steps:

Configuring the new instance
Adapting existing scripts
Migrating user content to the new instance

Configuring a new instance

Before starting the instance split, you must secure the additional Looker instance with Google. Once you have access to the new instance, follow these steps to ensure a smooth transition of code, reports, and users:

Configure Looker instance to correctly authenticate users (for example,, configuring SAML authentication).
Establish the necessary database connections and ensure that they have the same names as those in the original Looker instance.
Create Looker projects and connect them to the same Git repositories as in the original instance. We recommend preserving the same project names; however, this is not required.
Replicate roles, groups, and permission and model sets as they were configured in the original instance. This process can be done manually or programmatically, which we will discuss later.
Recreate group hierarchies.
Assign roles within groups.
Configure the default group for the new users on their first login.
Recreate user attributes if they were used in the original instance.
Set up SMTP for sending mail.
Optional: Extend user permissions to allow access to the System Activity Explores on the new instance.

Adapting existing scripts

The periphery jobs include micro-services, scripts, and UIs. For our Looker environment, a suite of peripheral jobs using the Looker API is supported to programmatically monitor and manage the environment. These tasks include checking server availability, testing the health of database connections, cleaning up unused content, assigning users to groups based on their accounts, and killing long-running queries. All of these jobs had to be extended to support the new Looker production instance. The changes to your environment will depend on how many jobs you have and their internal structure, but this step is fairly straightforward and includes extending the list of Looker instance URLs to include the new instance.

Migrating user content to the new instance

The final and most crucial step is migrating the Looker content for users. This task can be challenging if you aim to ensure a smooth transition for users while addressing the following requirements:

Migrating roles, groups, permission and model sets
Migrating dashboards and Looks (preserving ownership)
Migrating folder structure
Migrating schedules (preserving ownership)
Migrating alerts (preserving ownership)
Migrating boards (preserving ownership)
Migrating favorites
Disabling the migrated alerts and schedules in the original instance to avoid sending duplicate emails to users
Providing visibility of the migrated content to the owners and users

Available solutions

The Looker platform does not offer this functionality, and a straightforward solution has yet to be found on the market. There are open-source command-line tools available for migrating content between Looker instances, for example, Looker Deployer and Gazer. However, these tools do not cover the whole spectrum of the activities required for the successful instance split.

Gazer. This tool facilitates the transfer of content such as dashboards, Looks, folders, alerts, schedules, models, roles, and groups across instances. While Gazer offers granular control, it requires custom logic for bulk migrations. It does not support the migration of boards and favorites, nor does it ensure the preservation of content ownership. Additionally, Gazer may fail if alert or scheduled plan configurations are not as expected.
Looker Deployer. Built on top of Gazer, Looker Deployer supports all of Gazer's functionalities and additionally allows for the migration of boards across instances. However, it does not support the migration of favorites or the preservation of content ownership. Currently, Looker Deployer is supported by Google, though it is not officially warranted, with the latest release in July 2023.

Custom solution

Instead of building a custom solution from scratch, we leveraged Gazer and enhanced it with additional functionality. We used Gazer to perform basic migration for dashboards, Looks, alerts, and scheduled plans. Subsequently, we enhanced it with additional capabilities to meet extended requirements, such as preserving content ownership, migrating boards and favorites, and providing visibility into the migrated content.

	Gazer	Looker Deployer	Custom Solution
Migrates roles, groups, and permission and model sets	✓	✓	✓
Migrates dashboards and Looks	✓	✓	✓
Migrates folder structure	✓	✓	✓
Migrates schedules	✓	✓	✓
Migrates alerts	✓	✓	✓
Migrates boards	❌	✓	✓
Migrates favorites	❌	❌	✓
Disables alerts and schedules in the original instance after migration	❌	❌	✓
Preserves ownership (dashboards, Looks, schedules, and alerts)	❌	❌	✓
Links to the migrated content in the original instance	❌	❌	✓

To maintain content ownership during migration, we first instructed users to log in to the new instance to ensure that their accounts were created, since we use SAML for all instances. Gazer commands to migrate dashboards and Looks rely on the user credentials and assign this user as the content owner. In our custom script, we temporarily elevated the content owners to admin status and generated their API3 credentials, which were passed to the Gazer command to migrate dashboards and Looks. The client's credentials were used to migrate the content if an owner’s account didn’t exist in the new Looker instance. Let’s have a look at the code snippet.

# migrate_content.py
import time
import subprocess
import tempfile
import json
from subprocess import CalledProcessError
import looker_sdk
from looker_sdk.sdk.api40 import models

source_base_url = "https://source.looker.com:19999"    # Replace with the source instance url
target_base_url = "https://target.looker.com:19999"    # Replace with the target instance url
source_client_id = "CLIENT_ID"    # Replace with the client id
source_client_secret = "CLIENT_SECRET"   # Replace with the client secret
target_client_id = "CLIENT_ID"    # Replace with the client id
target_client_secret = "CLIENT_SECRET"    # Replace with the client secret

content_id = "1234"    # Replace with the content id
content_type = "dashboard"    # Replace with the content type, it can be look or dashboard
folder_id = "4321"    # Replace with the folder id where the content should be deployed

class LookerSettings(looker_sdk.api_settings.ApiSettings):
    """
    A helper class needed to be able to initialise Looker SDK by passing client_id, client_secret instead of ini file, see https://pypi.org/project/looker-sdk/
    """
    def __init__(self, *args, **kw_args):
        self.client_id = kw_args.pop("client_id")
        self.client_secret = kw_args.pop("client_secret")
        self.base_url = kw_args.pop("base_url")
        self.timeout = kw_args.pop("timeout")
        super().__init__(*args, **kw_args)

    def read_config(self) -> looker_sdk.api_settings.SettingsConfig:
        config = super().read_config()
        config["client_id"] = self.client_id
        config["client_secret"] = self.client_secret
        config["base_url"] = self.base_url
        config["timeout"] = self.timeout
        return config

# Initialize the Looker SDKs
source_sdk = looker_sdk.init40(config_settings=LookerSettings(client_id=source_client_id, client_secret=source_client_secret, base_url=source_base_url, timeout=300))
target_sdk = looker_sdk.init40(config_settings=LookerSettings(client_id=target_client_id, client_secret=target_client_secret, base_url=target_base_url, timeout=300))

def find_owner_id_in_target(owner_id_in_source):
    """
    Finds if the owner exists in the target instance
    """
    try:
        # Fetch owner details from the source instance
        owner_details_in_source = source_sdk.user(owner_id_in_source)
        # Search for the owner in the target instance using their email
        owners_in_target = target_sdk.search_users(email=owner_details_in_source.email)
        # Filter out the embedded user
        non_embedded_owners_in_target = [owner for owner in owners_in_target if owner['display_name'] != 'Embed User']

        if len(non_embedded_owners_in_target) == 0:
            print("Could not find owner in target instance.")
            return None
        return non_embedded_owners_in_target[0].get('id')
    except Exception as exp:
        print(f"Error while finding if owner exists in the target instance. Owner id in source is {owner_id_in_source}. Error: {exp}")
        return None

def get_owner_details_in_target(owner_id_in_source):
    """
    Fetches the owner details in the target instance
    """
    try:
        start_time = time.time()
        # Find the owner ID in the target instance
        owner_id_in_target = find_owner_id_in_target(owner_id_in_source)
        if owner_id_in_target is None:
            return None
        
        # Add the owner to the admin group in the target instance
        target_sdk.add_group_user(
            group_id="3",    # Replace with the admin group id where the user should be added
            body=models.GroupIdForGroupUserInclusion(
                user_id=owner_id_in_target
        ))
        # Create API credentials for the owner in the target instance
        api_creds = target_sdk.create_user_credentials_api3(user_id=owner_id_in_target)
        if not api_creds:
            print(f"Could not fetch the api credentials of the owner. Owner id in target is {owner_id_in_target}")
            return None
        
        print(f"Fetching owner details in target took: {time.time() - start_time}")
        return {
            "owner_id_in_target": owner_id_in_target,
            "api_creds": api_creds
        }
    except Exception as exp:
        print(f"Error while fetching owner details in target. Owner id in source is {owner_id_in_source}. Error: {exp}")

def revoke_permission_and_delete_credentials(owner_id_in_target, api_3_credentials_id):
    """
    Revokes the permissions and deletes the credentials of the owner in the target instance."""
    try:
        start_time = time.time()
        target_sdk.delete_user_credentials_api3(user_id=owner_id_in_target, credentials_api3_id=api_3_credentials_id)
        target_sdk.delete_group_user(
            group_id="3",    # Replace with the admin group id where the user should be added
            user_id=owner_id_in_target
        )
        print(f"Revoking permissions and deleting credentials took: {time.time() - start_time}")
    except Exception as exp:
        print(f"Error while revoking permissions and deleting credentials. Owner id in target is {owner_id_in_target}. Error: {exp}")

def __run_cli_command(gzr_command: 'list[str]', arguments_to_scrap: 'list[int]') -> str:
        """
        Run the passed command, returning the output.
        If the command results in error, it will throw the corresponding
        error while also masking specified arguments, to avoid exposing secrets in the log.
        """
        proc = subprocess.run(
            gzr_command, universal_newlines=True, capture_output=True, check=False
        )
        if proc.returncode != 0:
            # Replace the credentials so they would not be exposed in the logs
            gzr_command_safe = gzr_command.copy()
            for arg in arguments_to_scrap:
                gzr_command_safe[arg] = "XXX"

            raise CalledProcessError(
                returncode=proc.returncode,
                cmd=gzr_command_safe,
                output=proc.stdout,
                stderr=proc.stderr,
            )
        return str(proc.stdout)

def fetch_content_from_looker(content_id, content_type, host, port, client_id, client_secret):
    """
    Fetches the content from the source instance using Gazer
    """
    gzr_command = ["gzr", content_type, "cat", content_id, "--host", host, "--port", port,
        "--client-id", client_id, "--client-secret", client_secret
    ]
    try:
        output = __run_cli_command(gzr_command, [-1, -3])
    except CalledProcessError as err:
        data = {
            "statusCode": 500,
            "message": f"Failed to fetch content using Gazer: {err}. \nStdErr: {err.stderr}, StdOut: {err.output}",
        }
        print(data)
        raise err
    # Parsing JSON
    try:
        return json.loads(output)
    except json.JSONDecodeError as err:
        data = {
            "statusCode": 500,
            "message": f"Failed to parse the output from Gazer: {err}. \nOutput: {output}",
        }
        print(data)
        raise err

def deploy_content_to_looker(content_type, content_file_path, folder_id, host, port, client_id, client_secret):
    """
    Deploys the content to the target instance using Gazer
    """
    gzr_command = ["gzr", content_type, "import", content_file_path, folder_id, "--host", host, "--port", port,
        "--client-id", client_id, "--client-secret", client_secret, "--force"
    ]
    try:
        output = __run_cli_command(gzr_command, [-2, -4])
    except CalledProcessError as err:
        data = {
            "statusCode": 500,
            "message": f"Failed to deploy content using Gazer: {err}, StdOut: {err.output}",
        }
        print(data)
        raise err
    except Exception as err:
        data = {
            "statusCode": 500,
            "message": f"Failed to deploy content using Gazer: {err}",
        }
        print(data)
        raise err
    return output

def write_content_to_file(content_type, content_id, content, tmpdirname):
    """
    Writes the content to a file
    """
    try:
        local_file_path = f"{tmpdirname}/{content_type.capitalize()}_{content_id}.json"

        # Writing a file to a temporary directory
        with open(local_file_path, "w") as file:
            file.write(json.dumps(content))

        return local_file_path
    except Exception as exp:
        print(f"Error while writing the content to a file. Content type: {content_type}, Content id: {content_id}. Error: {exp}")
        return None

with tempfile.TemporaryDirectory() as tmpdirname:
    source_host, source_port = source_base_url.replace("https://", "").split(":")
    content = fetch_content_from_looker(content_id, content_type, source_host, source_port, source_client_id, source_client_secret)
    local_file_path = write_content_to_file(content_type, content_id, content, tmpdirname)

    owner_id_in_source = content["user_id"]
    owner_details_in_target = get_owner_details_in_target(owner_id_in_source)

    if owner_details_in_target is not None:
        print('Using Owner Details')
        target_host, target_port = target_base_url.replace("https://", "").split(":")
        # Deploy the content to the target instance
        try:
            deploy_content_to_looker(content_type, local_file_path, folder_id, target_host, target_port, owner_details_in_target.get('api_creds').client_id, owner_details_in_target.get('api_creds').client_secret)
        except Exception as exp:
            print(f"Error while deploying the content to the target instance. Error: {exp}")
        finally:
            revoke_permission_and_delete_credentials(owner_details_in_target["owner_id_in_target"], owner_details_in_target["api_creds"].id)
    else:
        print('Using Client Credentials')
        target_host, target_port = target_base_url.replace("https://", "").split(":")
        # Use a Client Credentails to deploy the content
        try:
            deploy_content_to_looker(content_type, local_file_path, folder_id, target_host, target_port, target_client_id, target_client_secret)
        except Exception as exp:
            print(f"Error while deploying the content to the target instance. Error: {exp}")

Ensure looker_sdk and gazer are installed on your machine and you have replaced the values (like base URL, client ID and secret, etc.) used in the script. Then use the following command to run it.

python3 migrate_content.py

Once the content is migrated, the credentials will be deleted and the owner will be removed from the admin group.

This approach also proved effective in preserving the ownership of migrated alerts and scheduled plans. However, there is an important consideration: The owner of an alert or scheduled plan may differ from the owner of the associated content. In such cases, the ownership of alerts and plans will default to the owner of the migrated content, as Gazer utilizes the same credentials provided to create these alerts and plans.

We maintained a record of the content mappings and associated users during the migration process across both instances. Subsequently, we retrieved the details of the boards and favorites on the original instance using the Looker SDK and utilized this mapping to facilitate their migration to the new instance.

To provide visibility to the migrated content, we made the following changes to the original content present in the old instance:

Programmatically added a button at the top of the dashboard with a message and a link to the migrated content
Updated the Look description to include a message and link to the migrated content

These enhancements minimized disruption during the migration, allowing users to navigate to the migrated content in the new instance easily. The following method adds buttons to dashboards in the original instance with a message and a link to the migrated content.

def _get_content_id_from_gazer_output(gazer_output: str | None, slug_of_content_in_source: str) -> str:
    """
    Extracts the content id in the target from the gazer output or using the slug
    """
    try:
        content_id_on_target = None
        if gazer_output is None:
            # If gazer output is not available, search the content using the slug as the content might have been migrated
            dashboards = target_sdk.search_dashboards(slug=slug_of_content_in_source)
            if len(dashboards) > 0:
                content_id_on_target = dashboards[0].get('id')
            else:
                return None
        else:
            # Gazer output will have the content id in the target
            content_id_on_target = str(gazer_output.split()[-1]).strip()
        return content_id_on_target
    except Exception as exp:
        print(f'Unable to retrieve content id from gazer output or slug for the content. Exception: {exp}')
        return None

def update_content_in_source_with_its_id_in_target(gazer_output: str | None, content_id_in_source: str, slug_of_content_in_source: str):
    start_time = time.time()
    try:
        content_id_in_target = _get_content_id_from_gazer_output(gazer_output, slug_of_content_in_source)
        if content_id_in_target is None:
            print(f'Could not find the content id in the target for the content with id {content_id_in_source}')
            return
        link_in_the_new_instance = f'{target_base_url}/dashboards/{content_id_in_target}'
        request_body = {
            "dashboard_id": content_id_in_source,
            "type": "button",
            "rich_content_json": f'{{"text":"This content has been migrated to the new Looker instance. Please, click here to access it in the new location.","description":"Clicking on the button will take you to the new instance","newTab":true,"alignment":"center","size":"medium","style":"FILLED","color":"#f9ab00","href":"{link_in_the_new_instance}"}}',
        }
        # Create a dashboard element in the source content (dashboard)
        dashboard_element_details = source_sdk.create_dashboard_element(request_body)
        # Get the dashboard layout of the source content
        dashboard_layout_of_source_content = source_sdk.dashboard_dashboard_layouts(dashboard_id=content_id_in_source, fields="dashboard_layout_components")
        # Get the dashboard layout components of the new element
        dashboard_layout_components_of_source_content = dashboard_layout_of_source_content[0].get('dashboard_layout_components')
        # Find the dashboard layout component of the new element (we need to update it so that the new element, which is button is at the top of the dashboard)
        dashboard_layout_component_of_new_element = ''
        for dashboard_layout in dashboard_layout_components_of_source_content:
            if dashboard_layout.get('dashboard_element_id') == dashboard_element_details.get('id'):
                dashboard_layout_component_of_new_element = dashboard_layout
                break

        if not dashboard_layout_component_of_new_element:
            print(f'No dashboard layout component found for the new element in the source content with id {content_id_in_source}')
            return
        
        # Update the dashboard layout component of the new element to move it to the top of the dashboard
        source_sdk.update_dashboard_layout_component(dashboard_layout_component_id=dashboard_layout_component_of_new_element.get('id'), body=models.WriteDashboardLayoutComponent(
            dashboard_element_id=dashboard_element_details.get('id'),
            dashboard_layout_id=dashboard_layout_component_of_new_element.get('dashboard_layout_id'),
            row=-10,
            column=0,
            width=22,
            height=0
        ))
    except Exception as exp:
        print(f'Error occurred while updating the content in the source with its id in the target. Exception: {exp}')
    
    print(f'Time taken to update the content in the source with its id in the target: {time.time() - start_time} seconds')

To use the methods, include them in the same file mentioned above (migrate_content.py). The updated file should look like this:

# Rest of the code
owner_id_in_source = content["user_id"]
owner_details_in_target = get_owner_details_in_target(owner_id_in_source)

gazer_output = None
if owner_details_in_target is not None:
	# Code in the if block
else:
	# Code in the else block

update_content_in_source_with_its_id_in_target(gazer_output, content_id, content.get('slug'))

Then migrate the content again using the command:

python3 migrate_content.py

The method above includes the following steps:

Creation of the dashboard element in the migrated dashboard in the original Looker instance.
Identification of the dashboard layout component for the new dashboard element.
If the layout component is located, it is updated to position the new element at the top of the dashboard.

This process generates a button that appears as shown below.

User communication

Effective communication with users is one of the most critical aspects of a successful Looker instance split. Since the process can disrupt users' workflows, keeping them informed about the migration details is essential. Here are a few crucial steps that helped us ensure a smooth transition:

Announce the migration in advance: Inform users well ahead of time to allow them to prepare for the upcoming changes.
Implement a code/content freeze period: Schedule a freeze period during the migration to prevent users from updating dashboards in the original instance while the content is being migrated to the new environment.
Establish a grace period: Allow users to access both the original and new instances simultaneously, giving them time to update bookmarks and verify that everything is functioning correctly in the new environment. During the grace period, it is crucial to limit user access in the original environment to view-only permissions. This ensures that content cannot be modified in the original environment, allowing any new changes to be deployed exclusively in the new environment.
Be transparent about the implications: Clearly communicate any changes, such as saved Explore queries not being transferred, changes to Looker instance URLs, and the need for Looker API users to update their instance references.

Conclusion

Managing multiple production Looker instances necessitates increased maintenance and monitoring efforts from the team responsible for overseeing the Looker environment and incurs additional costs. However, this setup offers the benefit of improved performance for heavily utilized Looker instances. Notably, we have observed a more than threefold reduction in the average number of query errors following the migration, which can be attributed to content pruning during the migration process. Furthermore, the high density of scheduled jobs has significantly decreased in both instances, as the schedules were effectively distributed across multiple instances. The delays experienced when searching for dashboards and Explores have also visibly reduced.