GCSFS
A pythonic file-system interface to Google Cloud Storage.
Please file issues and requests on github and we welcome pull requests.
This package depends on fsspec, and inherits many useful behaviours from there, including integration with Dask, and the facility for key-value dict-like objects of the type used by zarr.
Installation
The GCSFS library can be installed using conda
:
conda install -c conda-forge gcsfs
or pip
:
pip install gcsfs
or by cloning the repository:
git clone https://github.com/fsspec/gcsfs/
cd gcsfs/
pip install .
Examples
Locate and read a file:
>>> import gcsfs
>>> fs = gcsfs.GCSFileSystem(project='my-google-project')
>>> fs.ls('my-bucket')
['my-file.txt']
>>> with fs.open('my-bucket/my-file.txt', 'rb') as f:
... print(f.read())
b'Hello, world'
Read with delimited blocks:
>>> fs.read_block(path, offset=1000, length=10, delimiter=b'\n')
b'A whole line of text\n'
Write with blocked caching:
>>> with fs.open('mybucket/new-file', 'wb') as f:
... f.write(2*2**20 * b'a')
... f.write(2*2**20 * b'a') # data is flushed and file closed
>>> fs.du('mybucket/new-file')
{'mybucket/new-file': 4194304}
Because GCSFS faithfully copies the Python file interface it can be used
smoothly with other projects that consume the file interface like gzip
or
pandas
.
>>> with fs.open('mybucket/my-file.csv.gz', 'rb') as f:
... g = gzip.GzipFile(fileobj=f) # Decompress data with gzip
... df = pd.read_csv(g) # Read CSV file with Pandas
Credentials
Several modes of authentication are supported:
if
token=None
(default), GCSFS will attempt to use your default gcloud credentials or, attempt to get credentials from the google metadata service, or fall back to anonymous access. This will work for most users without further action. Note that the default project may also be found, but it is often best to supply this anyway (only affects bucket- level operations).if
token='cloud'
, we assume we are running within google (compute or container engine) and fetch the credentials automatically from the metadata service.if
token=dict(...)
ortoken=<filepath>
, you may supply a token generated by the gcloud utility. This can be
a python dictionary
the path to a file containing the JSON returned by logging in with the gcloud CLI tool (e.g.,
~/.config/gcloud/application_default_credentials.json
or~/.config/gcloud/legacy_credentials/<YOUR GOOGLE USERNAME>/adc.json
)the path to a service account key
a google.auth.credentials.Credentials object
Note that
~
will not be automatically expanded to the user home directory, and must be manually expanded with a utility likeos.path.expanduser()
.you can also generate tokens via Oauth2 in the browser using
token='browser'
, which gcsfs then caches in a special file, ~/.gcs_tokens, and can subsequently be accessed withtoken='cache'
.anonymous only access can be selected using
token='anon'
, e.g. to access public resources such as ‘anaconda-public-data’.
The acquired session tokens are not preserved when serializing the instances, so it is safe to pass them to worker processes on other machines if using in a distributed computation context. If credentials are given by a file path, however, then this file must exist on every machine.
Integration
The libraries intake
, pandas
and dask
accept URLs with the prefix
“gcs://”, and will use gcsfs to complete the IO operation in question. The
IO functions take an argument storage_options
, which will be passed
to GCSFileSystem
, for example:
df = pd.read_excel("gcs://bucket/path/file.xls",
storage_options={"token": "anon"})
This gives the chance to pass any credentials or other necessary arguments needed to gcsfs.
Async
gcsfs
is implemented using aiohttp
, and offers async functionality.
A number of methods of GCSFileSystem
are async
, for for each of these,
there is also a synchronous version with the same name and lack of a “_”
prefix.
If you wish to call gcsfs
from async code, then you should pass
asynchronous=True, loop=loop
to the constructor (the latter is optional,
if you wish to use both async and sync methods). You must also explicitly
await the client creation before making any GCS call.
async def run_program():
gcs = GCSFileSystem(asynchronous=True)
print(await gcs._ls(""))
asyncio.run(run_program()) # or call from your async code
Concurrent async operations are also used internally for bulk operations
such as pipe/cat
, get/put
, cp/mv/rm
. The async calls are
hidden behind a synchronisation layer, so are designed to be called
from normal code. If you are not
using async-style programming, you do not need to know about how this
works, but you might find the implementation interesting.
For every synchronous function there is asynchronous one prefixed by _
, but
the open
operation does not support async operation. If you need it to open
some file in async manner, it’s better to asynchronously download it to
temporary location and working with it from there.
Proxy
gcsfs
uses aiohttp
for calls to the storage api, which by default
ignores HTTP_PROXY/HTTPS_PROXY
environment variables. To read
proxy settings from the environment provide session_kwargs
as follows:
fs = GCSFileSystem(project='my-google-project', session_kwargs={'trust_env': True})
For further reference check aiohttp proxy support.
Contents
- API
- For Developers
- GCSFS and FUSE
- Changelog
- 2025.3.0
- 2025.2.0
- 2024.12.0
- 2024.10.0
- 2024.9.0
- 2024.6.1
- 2024.6.0
- 2024.5.0
- 2024.3.1
- 2024.2.0
- 2023.12.2
- 2023.12.1
- 2023.12.0
- 2023.10.0
- 2023.9.2
- 2023.9.1
- 2023.9.0
- 2023.6.0
- 2023.5.0
- 2023.4.0
- 2023.3.0
- 2023.1.0
- 2022.11.0
- 2022.10.0
- 2022.8.1
- 2022.7.1
- 2022.5.0
- 2022.3.0
- 2022.02.0
- 2022.01.0
- 2021.11.1
- 2021.11.0
- 2021.10.1
- 2021.10.0
- 2021.09.0
- 2021.08.1
- 2021.07.0
- 2021.06.1
- 2021.06.0
- 2021.05.0
- Version 2021.04.0
- Version 0.8.0
- Version 0.7.0
- Version 0.6.0
- Version 0.5.3
- Version 0.5.2
- Version 0.5.1
- Version 0.5.0
- Version 0.4.0