Streaming Data to and from Google Cloud Storage

One great feature of Google Cloud Storage is it’s ability to support streaming uploads and downloads. This allows us to upload data of arbitrary length, without having to download it onto disk. This can be a great option in a variety of scenarios, like archiving tweets to Cloud Storage or say saving data being scraped by bot from the internet. Also very useful when you plan to process your data later and don’t what to store it in a structured format rightaway.

To test this out, I wrote a simple Python program, which keeps printing “Hello World !!” indefinitely to the console.

#sayhello.py
for i in range(0,10000*100) :
print(i," Hello World !")

I created a bucket on Cloud Storage where I could stream output of sayhello.py using this simple command :

python sayhello.py | gsutil cp - gs://mybucket/helloworld.dat

screenshot of the uploaded file in my bucket

Another great thing you can do is compress the stream before writing to Cloud Storage. It is important that we use a compression method, which does not corrupt data if our stream ends abruptly (due to exception or error). This is where the ncompress utility for Linux comes into picture. As per it’s documentation it performs LZW compression on data. You can install it using the below command.

sudo apt-get install ncompress

The output of our Python program can be simply piped into ‘compress’ command, which inturn can be piped to gsutil command. The gsutil command natively supports uploading and downloading of data as a stream from Google Cloud Storage.

python sayhello.py | compress | gsutil cp - gs://mybucket/helloworld.zzz

If we check the size of file named helloworld.zzz, we can see that it is almost ten times smaller than the uncompressed object.
screenshot of the uploaded compressed file

To retrieve our data back, we can simply download the file as a stream and pipe the data to ‘uncompress’.

gsutil cp gs://mybucket/helloworld.zzz - | uncompress  > data.txt

tail of the uncompressed file data.txt

Note: This approach works very well in case of unhandled exceptions in source program (‘sayhello.py’ in our case). I also confirmed that even if the process running sayhello.py is killed, data before going down is persisted in Cloud Storage.

However, if you kill the command which pipes the outputs using CTRL+C, nothing gets stored in the cloud. It looks like gsutil needs to have a clean exit, for data to be saved in the cloud.

Streaming transfers are also supported in client libraries for most languages (except Python and Ruby). Checkout the official documentation here:
https://cloud.google.com/storage/docs/streaming#storage-stream-upload-object-php