How to Calculate Query Cost?

pankaj_cloud · 10-08-2023 04:56 AM

I am trying to calculate the cost of each BQ query run in an org and for the same, I looking for an exact formula to calculate the cost accurately. can anybody help me with this?

ms4446

BigQuery has two compute pricing models:

On-demand pricing: You are charged for the number of bytes processed by each query. The first 1 TiB of query data processed per month is free.
Capacity pricing: You are charged for compute capacity used to run queries, measured in slots (virtual CPUs) over time. This model takes advantage of BigQuery editions.

To calculate the cost of a query, you need to know the following:

The size of the data processed by the query, in bytes.
The pricing model you are using.

On-demand pricing formula:

Query cost = (Data processed in bytes) * $5 per TB

Capacity pricing formula:

Query cost = (Query slots used) * (Price per slot-hour) * (Query duration in hours)

Here are some tips on how to optimize BigQuery costs:

Optimize queries: Ensure that your queries are optimized to scan only the necessary columns and rows. Use SELECT *judiciously.
Use preview: Before running a query, you can use the "Query Validator" in the BigQuery Console to see how much data will be processed. This can give you an estimate of the cost before you actually run the query.
Set cost controls: You can set custom quotas to prevent users from running queries that would process more than a specified amount of data per day.
Analyze cost trends: Regularly review your BigQuery costs and usage patterns. This can help you identify any unexpected spikes in costs and take corrective actions.

Here are some additional tips that may be helpful:

Use partitioned tables: Partitioning your tables can improve query performance and reduce costs. Partitions allow BigQuery to scan only the data that is relevant to your query.
Use materialized views: Materialized views can be used to pre-compute the results of frequently run queries. This can improve query performance and reduce costs, especially for queries that process large amounts of data.
Use BigQuery BI Engine: BigQuery BI Engine is a service that can be used to accelerate analytical queries. BigQuery BI Engine uses a columnar storage format and query optimization techniques to improve query performance. It can also reduce costs by reducing the amount of data that needs to be scanned.

digambar19

Hi ,

Can we calculate this query cost in Java program before actual query execution, i.e at dry run?

If yes how we can do that?

ms4446

Yes, you can calculate the estimated cost of a BigQuery query in a Java program before actual execution by performing a dry run. A dry run in BigQuery allows you to validate the query and estimate the cost without running the query against the data.

Here’s how you can calculate the query cost using a dry run in Java:

Steps to Perform a Dry Run in Java

Set Up Your Google Cloud Project: Ensure you have the necessary credentials and project setup.
Use the BigQuery Java Client Library: The com.google.cloud:google-cloud-bigquery library allows you to interact with BigQuery.
Prepare Your Query: Write the SQL query you want to estimate the cost for.
Perform a Dry Run: Use the dry run option to get the estimated bytes processed.
Calculate the Cost: Use the bytes processed from the dry run to estimate the cost using the pricing model.

import com.google.cloud.bigquery.BigQuery;
import com.google.cloud.bigquery.BigQueryOptions;
import com.google.cloud.bigquery.QueryJobConfiguration;
import com.google.cloud.bigquery.Job;
import com.google.cloud.bigquery.JobInfo;
import com.google.cloud.bigquery.JobStatistics;
import com.google.cloud.bigquery.QueryParameterValue;

public class BigQueryDryRunExample {
    public static void main(String[] args) {
        // Initialize BigQuery service
        BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService();

        // SQL query
        String query = "SELECT * FROM `your-project.your-dataset.your-table` WHERE condition = @value";

        // Configure the dry run query
        QueryJobConfiguration queryConfig = QueryJobConfiguration.newBuilder(query)
            .addNamedParameter("value", QueryParameterValue.string("your-condition"))
            .setDryRun(true) // Enable dry run
            .build();

        // Run the query
        Job job = bigquery.create(JobInfo.of(queryConfig));

        // Wait for the query to complete
        job = job.waitFor();

        if (job != null && job.getStatus().getError() == null) {
            // Get the estimated bytes processed
            JobStatistics.QueryStatistics stats = job.getStatistics();
            long estimatedBytesProcessed = stats.getTotalBytesProcessed();

            // Calculate cost
            double costPerTB = 5.0; // Cost per TB in USD
            double cost = (estimatedBytesProcessed / (double) (1L << 40)) * costPerTB;

            System.out.printf("Estimated cost: $%.6f\n", cost);
        } else {
            // Handle errors
            if (job == null) {
                System.out.println("Job no longer exists");
            } else {
                System.out.println("Error: " + job.getStatus().getError().toString());
            }
        }
    }
}

Initialize BigQuery Service: Use BigQueryOptions.getDefaultInstance().getService() to initialize the BigQuery client.
Prepare Query Configuration: QueryJobConfiguration is used to define the query and enable the dry run.
Run Query: bigquery.create(JobInfo.of(queryConfig)) executes the query with dry run enabled.
Retrieve Statistics: Use job.getStatistics() to get JobStatistics.QueryStatistics, which contains the estimated bytes processed.
Calculate Cost: Convert bytes to terabytes and multiply by the cost per TB (currently $5 per TB).

Additional Considerations

Parameterized Queries: Using QueryParameterValue helps you handle SQL parameters safely.
Handling Errors: Always check for errors and handle exceptions properly.
Credential Management: Ensure you have set up authentication correctly, typically using a service account.

Dependencies

Ensure you include the Google Cloud BigQuery client library in your Maven or Gradle build file. For Maven:

<dependency>
  <groupId>com.google.cloud</groupId>
  <artifactId>google-cloud-bigquery</artifactId>
  <version>2.33.1</version>
</dependency>

Remember to replace 2.33.1 with the latest available version.

digambar19

Hi @ms4446 ,
Thanks for the quick reply.

I have the bselow query if you can answer,

1. I need the cost in euros, so I assume I need CST per TB in euros.
But can we get that through API so that I don't want to change once it changes?

2 JobStatistics.QueryStatistics contains some fields /methods

public Integer getBillingTier()
public Long getTotalBytesBilled()
public Long getTotalBytesProcessed()
public Long getEstimatedBytesProcessed()
public Long getTotalSlotMs()
public List<QueryStage> getQueryPlan()

currently, I am getting null values, what access do I need to provide to my service account so that I will receive actual values?

3. Also, if I have values TotalBytesBilled and getTotalBytesProcessed which one is advisable to calculate query cost?

Thanks in advance.

ms4446

1. Cost in Euros and Currency Conversion through API

To calculate the cost in euros, you need to convert the cost from USD to EUR using the current exchange rate. Unfortunately, BigQuery does not provide cost calculations in different currencies directly through its API. You will need to use an external currency conversion API to get the current exchange rate and apply it to the cost in USD.

Here’s how you can handle it:

Fetch Exchange Rates: Use an external API like ExchangeRate-API or Open Exchange Rates to get the current USD to EUR conversion rate.
Apply Conversion: Convert the cost from USD to EUR using the fetched exchange rate.

Here's how you can integrate currency conversion:

import java.net.HttpURLConnection;
import java.net.URL;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import org.json.JSONObject;

public class CurrencyConverter {
    public static double getUsdToEurRate() throws Exception {
        String apiKey = "YOUR_API_KEY"; // Replace with your API key
        URL url = new URL("https://api.exchangerate-api.com/v4/latest/USD?apiKey=" + apiKey);
        HttpURLConnection conn = (HttpURLConnection) url.openConnection();
        conn.setRequestMethod("GET");

        BufferedReader in = new BufferedReader(new InputStreamReader(conn.getInputStream()));
        String inputLine;
        StringBuilder content = new StringBuilder();
        while ((inputLine = in.readLine()) != null) {
            content.append(inputLine);
        }
        in.close();

        JSONObject json = new JSONObject(content.toString());
        return json.getJSONObject("rates").getDouble("EUR");
    }

    public static void main(String[] args) {
        try {
            double usdToEurRate = getUsdToEurRate();
            System.out.printf("Current USD to EUR rate: %.4f\n", usdToEurRate);
            // Assume costInUsd is calculated from the dry run
            double costInUsd = 0.00488;
            double costInEur = costInUsd * usdToEurRate;
            System.out.printf("Estimated cost in EUR: %.4f\n", costInEur);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

2. Access Permissions for Job Statistics

To get the values for JobStatistics.QueryStatistics fields, your service account needs appropriate permissions. Here are the key permissions and roles required:

Permissions:
- bigquery.jobs.list
- bigquery.jobs.get
- bigquery.jobs.create
- bigquery.readsessions.create
- bigquery.datasets.get
- bigquery.tables.get
Roles:
- roles/bigquery.jobUser (to allow creating and listing jobs)
- roles/bigquery.dataViewer (to allow reading datasets and tables)

Granting Permissions

Use the Google Cloud Console or gcloud command-line tool to grant permissions:

Via Console:

Go to the IAM & Admin page in the Google Cloud Console.
Select your project.
Click on the service account.
Click "Edit" and then "Add another role."
Add the BigQuery Job User and BigQuery Data Viewer roles.

Via gcloud:

gcloud projects add-iam-policy-binding PROJECT_ID \ --member="serviceAccount:SERVICE_ACCOUNT_EMAIL" \ --role="roles/bigquery.jobUser"

gcloud projects add-iam-policy-binding PROJECT_ID \ --member="serviceAccount:SERVICE_ACCOUNT_EMAIL" \ --role="roles/bigquery.dataViewer"

3. TotalBytesBilled vs TotalBytesProcessed

getTotalBytesProcessed(): This field shows the amount of data your query processes. It’s useful for understanding the scope of the query but does not account for free tier or billing policies.
getTotalBytesBilled(): This field indicates the amount of data billed for the query. It includes only the data that counts towards billing, accounting for any free tier, rounding policies, or minimum billing requirements.

Which to Use for Cost Calculation:

Use TotalBytesBilled: For accurate cost calculations, use TotalBytesBilled because it reflects the actual data you will be billed for after considering the free tier and billing policies.

Example of Retrieving Job Statistics

Here’s how you might retrieve and use these statistics in Java:

Job job = bigquery.create(JobInfo.of(queryConfig));
job = job.waitFor();

if (job != null && job.getStatus().getError() == null) {
    JobStatistics.QueryStatistics stats = job.getStatistics();
    Long totalBytesBilled = stats.getTotalBytesBilled();
    Long totalBytesProcessed = stats.getTotalBytesProcessed();

    System.out.printf("Total Bytes Processed: %d\n", totalBytesProcessed);
    System.out.printf("Total Bytes Billed: %d\n", totalBytesBilled);

    // Calculate cost in USD
    double costPerTB = 5.0; // USD
    double costInUsd = (totalBytesBilled / (double) (1L << 40)) * costPerTB;

    System.out.printf("Estimated cost in USD: %.6f\n", costInUsd);
    // Convert to EUR as needed
} else {
    System.out.println("Error: " + job.getStatus().getError().toString());
}

nhanhoangle

Hi @ms4446 My company is using theCapacity pricing model by purchasing slots for BigQuery. I'm calculating the cost of each query so that everyone on my team can review their query performance. I checked the INFORMATION_SCHEMA.JOBS table using the total_slot_ms column with the formula: ROUND((total_slot_ms / 3600000) * hour_slot_price, 2), but this only accounts for about a quarter of the daily cost. However, using ROUND((total_bytes_billed / POW(10, 12)) * 6.25, 2) gives a number that is closer to the actual cost.

SELECT
  date(creation_time) as creation_date,
  job_id,
  user_email,
  creation_time,
  start_time,
  end_time,
  total_bytes_processed,
  total_bytes_billed,
  ROUND((total_bytes_billed / POW(10, 12)) * 6.25, 2) AS data_processed_cost_usd,
  total_slot_ms,
  ROUND((total_slot_ms / 3600000) * 0.0768, 2) AS slot_time_cost_usd,  
  SAFE_DIVIDE(total_slot_ms, TIMESTAMP_DIFF(end_time, start_time, MILLISECOND)) AS avg_slots,
  query
FROM
  `region-eu`.INFORMATION_SCHEMA.JOBS
WHERE 1 = 1
  AND job_type = 'QUERY'
  AND state = 'DONE'
  AND creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)  -- Last 7 days
ORDER BY
  total_cost_usd DESC

Do you have any recommendations? Should we continue using total_bytes_billed to calculate the cost of each query even though we are on Capacity pricing? Many thanks!

ms4446

Since your company is using the Capacity pricing model by purchasing slots, the cost associated with each query should ideally reflect the consumption of those slots rather than data processed or bytes billed, which is more relevant to the On-Demand pricing model.

The total_slot_ms metric, available in the INFORMATION_SCHEMA.JOBS table, is the most relevant measure of how many slots a query consumes and should be the primary factor in cost calculations. To determine the cost per query, use the formula:

Cost Per Query = (total_slot_ms/ 3600000)× hour_slot_price

This formula directly correlates the time slots are in use with your committed costs. If you find that the sum of these calculated costs does not match the total daily costs, it could be due to idle slot time or other system overheads not directly attributed to specific queries.

While total_bytes_billed is traditionally used in On-Demand pricing models, where costs are based on data processed, it can still serve as a supplementary metric under Capacity pricing. It helps assess query efficiency by comparing data processed against slot usage. A query that processes large amounts of data but consumes fewer slots might indicate efficient usage, even under Capacity pricing.

However, the most critical aspect of managing costs under this pricing model is optimizing slot utilization. Unused slots still contribute to the overall costs, so maximizing their usage across queries is essential. Tools like BigQuery's reservation management can help in better allocating slots across teams or projects, ensuring that your committed resources are fully utilized.

In conclusion, for accurate cost tracking under Capacity pricing, focus on total_slot_ms for cost calculation while using total_bytes_billed as a secondary efficiency measure. This approach will provide a comprehensive understanding of your BigQuery usage and help in optimizing costs effectively.

nhanhoangle

Many thanks @ms4446 It's clear to me now.