Cloud Run Job causes a memory leak when writing fi...

shikajiro · 09-27-2024 09:38 PM

Memory leak when writing files with Filestore mounted with NFS in Cloud Run Job.

I created a task that runs for over an hour in cloud run job. The resulting files are 1-10 GB in size, and the cloud run job requires a lot of memory because the file storage is in the memory area. Therefore, we prepared Filestore, mounted it with NFS and saved the files in /mnt/example. I thought I could run with less memory if I did that.

However, the memory of the cloud run job increased by the size of the file, and the result was not what I expected.

How can I make it run with more files but less memory?

The following is a simplified version of the actual command used

gcloud run jobs create mnt-example \
    --project example \
    --region asia-northeast1 \
    --image asia-northeast1-docker.pkg.dev/example/example-app/mnt-example:latest \
    --add-volume name=nfs,type=nfs,location=10.118.2.2:/example \
    --add-volume-mount volume=nfs,mount-path=/mnt/example \
    --cpu 1 \
    --memory 1Gi \
    --tasks 1 \
    --task-timeout 3h
gcloud run jobs execute mnt-example \
    --project example \
    --region asia-northeast1

ChieKo

Hi @shikajiro,

The issue is that while the data is stored on Filestore, the application still needs to buffer the data in memory before writing it to the NFS mount. Writing large files directly to a network file system can be memory-intensive because the application typically needs to hold a significant portion of the file in memory before flushing it to disk.

To reduce memory usage when writing large files to Filestore via NFS in Cloud Run Jobs, consider these strategies:

Buffered I/O and Streaming:
- Use buffered I/O: Instead of writing to the file byte by byte, use buffered I/O operations. This minimizes the number of system calls, significantly improving performance and reducing memory usage. Most programming languages provide libraries to handle buffered I/O efficiently (e.g., BufferedWriter in Java, io.BufferedWriter in Python).

Streaming: Write data in smaller chunks or streams instead of loading the entire file into memory before writing. Process the data in manageable segments, writing each segment to the file and then releasing it from memory.

Optimize the Application:
- Code Review: Analyze your code to identify areas where large amounts of data are being held in memory unnecessarily. Optimize your data structures and algorithms to minimize memory footprint.

Asynchronous Writing: If possible, refactor your code to write to the file asynchronously. This allows other parts of your application to continue executing while data is being written, potentially improving overall performance and reducing memory pressure.

Consider Alternative Approaches:
- Cloud Storage (GCS): If possible, consider using Cloud Storage (GCS) directly. GCS is designed for storing large files and is better optimized for this purpose than using NFS over a network. You would write your files to GCS directly instead of using Filestore.

Different File System: Although you're already using Filestore, consider other options depending on the requirements. If you need more write performance, you might investigate using a different Filestore instance type or explore using a different file system altogether.

The key is to avoid loading the entire dataset into memory at once. Address the core problem by changing how the data is written to disk. Prioritize the first two points before considering increasing memory.

I hope the above information is helpful.

shikajiro

@ChieKo

Thanks for the reply.

I do not expect my implementation to use a lot of memory when writing files, as the mechanism is to append a small number of data to the same file multiple times.

Specifically, the code we are testing is the process of continuously saving the ZOOM audio, and this process is called multiple times during the call. It is not opening a large file, so it does not seem to use a lot of memory.

https://github.com/zoom/meetingsdk-headless-linux-sample/blob/main/src/raw_record/ZoomSDKAudioRawDat...

void ZoomSDKAudioRawDataDelegate::writeToFile(const string &path, AudioRawData *data)
{
    static std::ofstream file;.
file.open(path, std::ios::out | std::ios::binary | std::ios::app);

if (!file.is_open())
        return Log::error("failed to open audio file path: ’ + path);

    file.write(data->GetBuffer(), data->GetBufferLen());

    file.close();
file.flush();

    stringstream ss;.
    ss << ‘Writing “ << data->GetBufferLen() << ”b to “ << path << ” at “ << data->GetSampleRate() << ”Hz’;

    Log::info(ss.str());
}

Mounting to GCS was also tried, but was abandoned as it was caught by the GCS write limit due to the high number of writes.

Also, the content of the `Buffered I/O and Streaming` link was Firestore instead of Filestore

mhnagaoka

Same problem here. o/

I have even tried to open (in append mode) and close the file on every chunk. To no avail.

    print(f"Downloading from {args.url}")
    response = requests.get(args.url, stream=True, headers=headers)
    print(f"{response.headers=}")
    if 200 <= response.status_code <= 299:
        print(f"Saving to {args.output}")
        total_size = int(response.headers.get("content-length", 0))
        chunk_size = 8192 * 1024
        with tqdm(
            total=total_size, unit="iB", unit_scale=True, mininterval=1.0
        ) as progress_bar:
            for chunk in response.iter_content(chunk_size=chunk_size):
                if chunk:
                    with open(args.output, "ab") as f:
                        f.write(chunk)
                        progress_bar.update(len(chunk))
                    # os.sync()
    else:
        print(f"Failed to download file: {response.status_code=}")

Cloud Run Job causes a memory leak when writing files with Filestore mounted with NFS.