Solved: Read file from GCS cloud storage line by line

gradientopt · 10-25-2023 11:25 AM

Hi,

1. I have a very large file stored in GCS that is too big to fit in memory at once, I am wondering if it is possible to read the file line by line?

2. Moreover, my GCS bucket is mounted to my compute engine using gcsfuse, would that affect the answer to the first question?

3.Would gcsfuse affect the speed/stability of reading/writing to GCS in general? compared to reading/writing using storage client as follows

    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(blob_name)

    # Mode can be specified as wb/rb for bytes mode.
    # See: https://docs.python.org/3/library/io.html
    with blob.open("w") as f:
        f.write("Hello world")

    with blob.open("r") as f:
        print(f.read())

KirTitievsky

That's what I would expect. Although I can't say exactly how the local filesystem buffering works. It's likely going to be more than a single line since you can't know how many bytes you need to read to find the next newline. But you can afford to read in much smaller chunks.

View solution in original post

KirTitievsky

You can't quite read files line by line. You can read them in chunks of fixed size. You probably don't want to read one line at a time. The bigger the chunk, the less expensive it is and the higher the overall throughput. I believe blob.open("r") does this by default. See the chunk_size parameter docs for a description.

Fuse should behave in a similar way. You configure this using the --sequential-read-size-mb setting when you mount a bucket.

gradientopt

Thanks for the reply! I am a bit confused about this, the --sequential-read-size-mb setting only applies when one is mounting a bucket, which is a one time thing? How does that relate to my question which may change the "chunk size" each time I read a file?

KirTitievsky

You can't change the chunk size every time you read the file. What happens under the covers is something like this: the client fetches a chunk_size bytes from GCS and then the python layer feeds that to the code line by line (by finding line breaks). When you run out of bytes in the current chunk, the next chunk is downloaded. It's really the same way "normal" disk IO works, but it's worth understanding that you download bytes in chunks.

In short, you should get the behavior you want by default.

gradientopt

Got it, so just make sure I understand correctly. If I have a file stored in GCS with path file_path which is mounted by gcsfuse, I can write things like this and it would work normally as expected?

with open(file_path) as infile:
    for line in infile:
        print(line)

KirTitievsky

yes.

gradientopt

And also,

1.what is the default --sequential-read-size-mb if I did not set it explicitly?

2.I already mounted my bucket using

gcsfuse my-bucket /path/to/mount/point

without specifying any option, is there a way to update the option after it has already been mounted?

KirTitievsky

Looks like the default is 200MB. This will be useful in figuring out what value you want. I would be surprised if you can change this value without restarting the client (i.e. unmounting), but don't know. Here's hoping someone smarter picks this up.

gradientopt

Got it, thanks! Just make sure I understand this, with the following code,

if file_path is a local disk, then the memory high water mark will just be that of a single line, however if file_path is in a directory mounted by gcsfuse, then the memory high water mark would be 200MB since under the hood each time the client would request 200MB of data from GCS?

with open(file_path) as infile:
    for line in infile:
        print(line)

KirTitievsky

That's what I would expect. Although I can't say exactly how the local filesystem buffering works. It's likely going to be more than a single line since you can't know how many bytes you need to read to find the next newline. But you can afford to read in much smaller chunks.

gradientopt

Thanks very much!