Accelerate AI/ML data loading with Hyperdisk ML

Autopilot Standard

This guide covers how to simplify and accelerate the loading of AI/ML model weights on Google Kubernetes Engine (GKE) using Hyperdisk ML. The Compute Engine Persistent Disk CSI driver is the primary way for you to access Hyperdisk ML storage with GKE clusters.

Overview

Hyperdisk ML is a high performance storage solution that can be used to scale out your applications. It provides high aggregate throughput to many virtual machines concurrently, making it ideal if you want to run AI/ML workloads that need access to large amounts of data.

When enabled in read-only-many mode, you can use Hyperdisk ML to accelerate the loading of model weights by up to 11.9X relative to loading directly from a model registry. This acceleration is made possible by the Google Cloud Hyperdisk architecture that allows scaling to 2,500 concurrent nodes at 1.2 TB/s. This lets you drive better load times and reduce Pod over-provisioning for your AI/ML inference workloads.

The high level steps to create and use Hyperdisk ML are as follows:

Pre-cache or hydrate data in a Persistent Disk disk image: Load Hyperdisk ML volumes with data from an external data source (for example, Gemma weights loaded from Cloud Storage) that can be used for serving. The Persistent Disk for the disk image must be compatible with Google Cloud Hyperdisk.
Create a Hyperdisk ML volume using a pre-existing Google Cloud Hyperdisk: Create a Kubernetes volume that references the Hyperdisk ML volume loaded with data. Optionally, you can create multi-zone storage classes to ensure your data is available in all zones that your Pods will run.
Create a Kubernetes Deployment to consume the Hyperdisk ML volume: Reference the Hyperdisk ML volume with accelerated data loading for your applications to consume.

Multi-zone Hyperdisk ML volumes

Hyperdisk ML disks are only available in a single zone. Optionally, you can use the Hyperdisk ML multi-zone feature to dynamically link multiple zonal disks that contain the same content in a single logical PersistentVolumeClaim and PersistentVolume. Zonal disks referenced by the multi-zone feature must be located in the same region. For example, if your regional cluster is created in us-central1, the multi-zone disks must be located in the same region (for example, us-central1-a, us-central1-b).

A common use case for AI/ML inference is to run Pods across zones for improved accelerator availability and cost efficiency with Spot VMs. Since Hyperdisk ML is zonal, if your inference server runs many Pods across zones, GKE automatically clone the disks across zones to ensure your data follows your application.

Hydration of Hyperdisk ML from external data sources and creation of multi-zone PV for accessing the data across zones.

Multi-zone Hyperdisk ML volumes have the following limitations:

Volume resize and volume snapshots operations are not supported.
Multi-zone Hyperdisk ML volumes are only supported in read-only mode.
When using pre-existing disks with a multi-zone Hyperdisk ML volume, GKE does not perform checks to validate that the disk content across zones are the same. If any of the disks contain diverging content, make sure your application takes potential inconsistency between zones into account.

To learn more, see Create a multi-zone ReadOnlyMany Hyperdisk ML volume from a VolumeSnapshot.

Before you begin

Before you start, make sure you have performed the following tasks:

Enable the Google Kubernetes Engine API.

Enable Google Kubernetes Engine API

If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running gcloud components update.
Note: For existing gcloud CLI installations, make sure to set the compute/region and compute/zone properties. By setting default locations, you can avoid errors in gcloud CLI like the following: One of [--zone, --region] must be supplied: Please specify location.

Set your default region and zone to one of the supported values.
Ensure your Google Cloud project has sufficient quota to create the necessary nodes in this guide. The example code for GKE cluster and Kubernetes resource creation require the following minimum quota in the region of your choice: 88 C3 CPUs, 8 NVIDIA L4 GPUs.

Requirements

To use Hyperdisk ML volumes in GKE, your clusters must meet the following requirements:

Use Linux clusters running GKE version 1.30.2-gke.1394000 or later. If you use a release channel, ensure that the channel has the minimum GKE version or later that is required for this driver.
Make sure that the Compute Engine Persistent Disk CSI driver is enabled. The Compute Engine Persistent Disk driver is enabled by default on new Autopilot and Standard clusters and cannot be disabled or edited when using Autopilot. If you need to enable the Compute Engine Persistent Disk CSI driver from your cluster, see Enabling the Compute Engine Persistent Disk CSI Driver on an existing cluster.
If you want to tune the readahead value, use GKE version 1.29.2-gke.1217000 or later.
If you want to use the multi-zone dynamically provisioned feature, use GKE version 1.30.2-gke.1394000 or later.
Hyperdisk ML is only supported on certain node types and zones. To learn more, see About Google Cloud Hyperdisk in the Compute Engine documentation.

Get access to the model

To get access to the Gemma models for deployment to GKE, you must first sign the license consent agreement then generate a Hugging Face access token.

You must sign the consent agreement to use Gemma. Follow these instructions:

Access the model consent page on Kaggle.com.
Verify consent using your Hugging Face account.
Accept the model terms.

Generate an access token

To access the model through Hugging Face, you'll need a Hugging Face token.

Follow these steps to generate a new token if you don't have one already:

Click Your Profile > Settings > Access Tokens.
Select New Token.
Specify a Name of your choice and a Role of at least Read.
Select Generate a token.
Copy the generated token to your clipboard.

Create a GKE cluster

You can serve LLMs on GPUs in a GKE Autopilot or Standard cluster. We recommend that you use a Autopilot cluster for a fully managed Kubernetes experience. To choose the GKE mode of operation that's the best fit for your workloads, see Choose a GKE mode of operation.

Autopilot

In Cloud Shell, run the following command:
```
gcloud container clusters create-auto hdml-gpu-l4 \
  --project=PROJECT \
  --region=REGION \
  --release-channel=rapid \
  --cluster-version=1.30.2-gke.1394000
```
Replace the following values:
- PROJECT: the Google Cloud project ID.
- REGION: a region that supports the accelerator type you want to use, for example, us-east4 for L4 GPU.
GKE creates an Autopilot cluster with CPU and GPU nodes as requested by the deployed workloads.

Configure kubectl to communicate with your cluster:

gcloud container clusters get-credentials hdml-gpu-l4 \
  --region=REGION

Standard

In Cloud Shell, run the following command to create a Standard cluster and node pools:
```
gcloud container clusters create hdml-gpu-l4 \
    --location=REGION \
    --num-nodes=1 \
    --machine-type=c3-standard-44 \
    --release-channel=rapid \
    --cluster-version=CLUSTER_VERSION \
    --node-locations=ZONES \
    --project=PROJECT

gcloud container node-pools create gpupool \
    --accelerator type=nvidia-l4,count=2,gpu-driver-version=latest \
    --location=REGION \
    --project=PROJECT \
    --node-locations=ZONES \
    --cluster=hdml-gpu-l4 \
    --machine-type=g2-standard-24 \
    --num-nodes=2
```
Replace the following values:
- CLUSTER_VERSION: the version of your GKE cluster (for example, 1.30.2-gke.1394000).
- REGION: the compute region for the cluster control plane. The region must support the accelerator you want to use, for example us-east4, for L4 GPU. Check which regions the L4 GPUs are available.
- ZONES: the zones in which nodes are created. You can specify as many zones as needed for your cluster. All zones must be in the same region as the cluster's control plane, specified by the --zone flag. For zonal clusters, --node-locations must contain the cluster's primary zone.
- PROJECT: the Google Cloud project ID.
The cluster creation might take several minutes.

Configure kubectl to communicate with your cluster:

gcloud container clusters get-credentials hdml-gpu-l4

Pre-cache data to a Persistent Disk disk image

To use Hyperdisk ML, you pre-cache data in a disk image, and create a Hyperdisk ML volume for read access by your workload in GKE. This approach (also called data hydration) ensures that your data is available when your workload needs it.

To copy the data from Cloud Storage to pre-cache a Persistent Disk disk image, follow these steps:

Create a StorageClass that supports Hyperdisk ML

Save the following StorageClass manifest in a file named hyperdisk-ml.yaml.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
    name: hyperdisk-ml
parameters:
    type: hyperdisk-ml
    provisioned-throughput-on-create: "2400Mi"
provisioner: pd.csi.storage.gke.io
allowVolumeExpansion: false
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
mountOptions:
  - read_ahead_kb=4096

Create the StorageClass by running this command:
```
kubectl create -f hyperdisk-ml.yaml
```

Create a ReadWriteOnce (RWO) PersistentVolumeClaim

Save the following PersistentVolumeClaim manifest in a file named producer-pvc.yaml. You'll use the StorageClass you created earlier. Make sure that your disk has sufficient capacity to store your data.

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: producer-pvc
spec:
  storageClassName: hyperdisk-ml
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 300Gi

Create the PersistentVolumeClaim by running this command:
```
kubectl create -f producer-pvc.yaml
```

Create a Kubernetes Job to populate the mounted Google Cloud Hyperdisk volume

This section shows an example of creating a Kubernetes Job that provisions a disk and downloads the Gemma 7B instruction tuned model from Hugging Face onto the mounted Google Cloud Hyperdisk volume.

To access the Gemma LLM that the examples in this guide uses, create a Kubernetes Secret that contains the Hugging Face token:
```
kubectl create secret generic hf-secret \
    --from-literal=hf_api_token=HF_TOKEN\
    --dry-run=client -o yaml | kubectl apply -f -
```
Replace HF_TOKEN with the Hugging Face token you generated earlier.

Save the following example manifest as producer-job.yaml:

apiVersion: batch/v1
kind: Job
metadata:
  name: producer-job
spec:
  template:  # Template for the Pods the Job will create
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cloud.google.com/compute-class
                operator: In
                values:
                - "Performance"
            - matchExpressions:
              - key: cloud.google.com/machine-family
                operator: In
                values:
                - "c3"
            - matchExpressions:
              - key: topology.kubernetes.io/zone
                operator: In
                values:
                - "ZONE"
      containers:
      - name: copy
        resources:
          requests:
            cpu: "32"
          limits:
            cpu: "32"
        image: huggingface/downloader:0.17.3
        command: [ "huggingface-cli" ]
        args:
        - download
        - google/gemma-1.1-7b-it
        - --local-dir=/data/gemma-7b
        - --local-dir-use-symlinks=False
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-secret
              key: hf_api_token
        volumeMounts:
          - mountPath: "/data"
            name: volume
      restartPolicy: Never
      volumes:
        - name: volume
          persistentVolumeClaim:
            claimName: producer-pvc
  parallelism: 1         # Run 1 Pods concurrently
  completions: 1         # Once 1 Pods complete successfully, the Job is done
  backoffLimit: 4        # Max retries on failure

Replace ZONE with the compute zone where you want the Hyperdisk to be created. If you're using it with the Deployment example, ensure it is a zone that has G2 machine capacity.

Create the Job by running this command:
```
kubectl apply -f producer-job.yaml
```
It might take a few minutes for the Job to finish copying data to the Persistent Disk volume. When the Job completes provisioning, its status is marked "Complete".
To check the progress of your Job status, run the following command:
```
kubectl get job producer-job
```
Once the Job is complete, you can clean up the Job by running this command:
```
kubectl delete job producer-job
```

Create a ReadOnlyMany Hyperdisk ML volume from a pre-existing Google Cloud Hyperdisk

This section covers the steps for creating a ReadOnlyMany (ROM) PersistentVolume and PersistentVolumeClaim pair from a pre-existing Google Cloud Hyperdisk volume. To learn more, see Using pre-existing persistent disks as PersistentVolumes.

In GKE version 1.30.2-gke.1394000 and later, GKE automatically converts the access mode of a READ_WRITE_SINGLE Google Cloud Hyperdisk volume to READ_ONLY_MANY.

If you are using a pre-existing Google Cloud Hyperdisk volume on an earlier version of GKE, you must modify the access mode manually by running the following command:
```
gcloud compute disks update HDML_DISK_NAME \
    --zone=ZONE \
    --access-mode=READ_ONLY_MANY
```
Replace the following values:
- HDML_DISK_NAME: the name of your Hyperdisk ML volume.
- ZONE: the compute zone where the pre-existing Google Cloud Hyperdisk volume is created.

Create a PersistentVolume and PersistentVolumeClaim pair, referencing the disk you previously populated.

Save the following manifest as hdml-static-pv.yaml:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: hdml-static-pv
spec:
  storageClassName: "hyperdisk-ml"
  capacity:
    storage: 300Gi
  accessModes:
    - ReadOnlyMany
  claimRef:
    namespace: default
    name: hdml-static-pvc
  csi:
    driver: pd.csi.storage.gke.io
    volumeHandle: projects/PROJECT/zones/ZONE/disks/DISK_NAME
    fsType: ext4
    readOnly: true
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: topology.gke.io/zone
          operator: In
          values:
          - ZONE
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  namespace: default
  name: hdml-static-pvc
spec:
  storageClassName: "hyperdisk-ml"
  volumeName: hdml-static-pv
  accessModes:
  - ReadOnlyMany
  resources:
    requests:
      storage: 300Gi

Replace the following values:

PROJECT: the project where your GKE cluster is created.
ZONE: the zone where the pre-existing Google Cloud Hyperdisk volume is created.
DISK_NAME: the name of the pre-existing Google Cloud Hyperdisk volume.

Create the PersistentVolume and PersistentVolumeClaim resources by running this command:
```
kubectl apply -f hdml-static-pv.yaml
```

Create a multi-zone ReadOnlyMany Hyperdisk ML volume from a VolumeSnapshot

This section covers the steps for creating a multi-zone Hyperdisk ML volume in ReadOnlyMany access mode. You use a VolumeSnapshot for a pre-existing Persistent Disk disk image. To learn more, see Back up Persistent Disk storage using volume snapshots.

To create the multi-zone Hyperdisk ML volume, follow these steps:

Create a VolumeSnapshot of your disk

Save the following manifest as a file called disk-image-vsc.yaml.

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: disk-image-vsc
driver: pd.csi.storage.gke.io
deletionPolicy: Delete
parameters:
  snapshot-type: images

Create the VolumeSnapshotClass by running the following command:
```
kubectl apply -f disk-image-vsc.yaml
```

Save the following manifest as a file called my-snapshot.yaml. You'll reference the PersistentVolumeClaim you created earlier in Create a ReadWriteOnce (RWO) PersistentVolumeClaim.

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: my-snapshot
spec:
  volumeSnapshotClassName: disk-image-vsc
  source:
    persistentVolumeClaimName: producer-pvc

Create the VolumeSnapshot by running the following command:
```
kubectl apply -f my-snapshot.yaml
```

When the VolumeSnapshot is marked "Ready", run the following command to create the Hyperdisk ML volume:

kubectl wait --for=jsonpath='{.status.readyToUse}'=true \
    --timeout=300s volumesnapshot my-snapshot

Create a multi-zone StorageClass

If you want copies of your data to be accessible in more than one zone, specify the enable-multi-zone-provisioning parameter in your StorageClass, which creates disks in the zones you specified in the allowedTopologies field.

To create the StorageClass, follow these steps:

Save the following manifest as a file called hyperdisk-ml-multi-zone.yaml.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: hyperdisk-ml-multi-zone
parameters:
  type: hyperdisk-ml
  provisioned-throughput-on-create: "4800Mi"
  enable-multi-zone-provisioning: "true"
provisioner: pd.csi.storage.gke.io
allowVolumeExpansion: false
reclaimPolicy: Delete
volumeBindingMode: Immediate
allowedTopologies:
- matchLabelExpressions:
  - key: topology.gke.io/zone
    values:
    - ZONE_1
    - ZONE_2
mountOptions:
  - read_ahead_kb=8192

Replace ZONE_1, ZONE_2, ..., ZONE_N with the zones where your storage can be accessed.

This example sets the volumeBindingMode to Immediate, allowing GKE to provision the PersistentVolumeClaim prior to any consumer referencing it.

Create the StorageClass by running the following command:
```
kubectl apply -f hyperdisk-ml-multi-zone.yaml
```

Create a PersistentVolumeClaim that uses the multi-zone StorageClass

The next step is to create a PersistentVolumeClaim that references the StorageClass.

GKE uses the content of the disk image specified to automatically provision a Hyperdisk ML volume in each zone specified in your snapshot.

To create the PersistentVolumeClaim, follow these steps:

Save the following manifest as a file called hdml-consumer-pvc.yaml.

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: hdml-consumer-pvc
spec:
  dataSource:
    name: my-snapshot
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io
  accessModes:
  - ReadOnlyMany
  storageClassName: hyperdisk-ml-multi-zone
  resources:
    requests:
      storage: 300Gi

Create the PersistentVolumeClaim by running the following command:
```
kubectl apply -f hdml-consumer-pvc.yaml
```

Create a Deployment to consume the Hyperdisk ML volume

When using Pods with PersistentVolumes, we recommend that you use a workload controller (such as a Deployment or StatefulSet).

If you want to use a pre-existing PersistentVolume in ReadOnlyMany mode with a Deployment, refer to Use persistent disks with multiple readers.

To create and test your Deployment, follow these steps:

Save the following example manifest as vllm-gemma-deployment.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-gemma-deployment
spec:
  replicas: 2
  selector:
    matchLabels:
      app: gemma-server
  template:
    metadata:
      labels:
        app: gemma-server
        ai.gke.io/model: gemma-7b
        ai.gke.io/inference-server: vllm
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: security
                  operator: In
                  values:
                  - S2
              topologyKey: topology.kubernetes.io/zone
      containers:
      - name: inference-server
        image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:latest
        resources:
          requests:
            cpu: "2"
            memory: "25Gi"
            ephemeral-storage: "25Gi"
            nvidia.com/gpu: 2
          limits:
            cpu: "2"
            memory: "25Gi"
            ephemeral-storage: "25Gi"
            nvidia.com/gpu: 2
        command: ["python3", "-m", "vllm.entrypoints.api_server"]
        args:
        - --model=$(MODEL_ID)
        - --tensor-parallel-size=2
        env:
        - name: MODEL_ID
          value: /models/gemma-7b
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
        - mountPath: /models
          name: gemma-7b
      volumes:
      - name: dshm
        emptyDir:
            medium: Memory
      - name: gemma-7b
        persistentVolumeClaim:
          claimName: CLAIM_NAME
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-l4
---
apiVersion: v1
kind: Service
metadata:
  name: llm-service
spec:
  selector:
    app: gemma-server
  type: ClusterIP
  ports:
    - protocol: TCP
      port: 8000
      targetPort: 8000

Replace CLAIM_NAME with one of these values:

hdml-static-pvc: if you are using a Hyperdisk ML volume from a existing Google Cloud Hyperdisk.
hdml-consumer-pvc: if you are using a Hyperdisk ML volume from a VolumeSnapshot disk image.

Run the following command to wait for the inference server to be available:

kubectl wait --for=condition=Available --timeout=700s deployment/vllm-gemma-deployment

To test that your vLLM server is up and running, follow these steps:

Run the following command to set up port forwarding to the model:
```
kubectl port-forward service/llm-service 8000:8000
```

Run a curl command to send a request to the model:

USER_PROMPT="I'm new to coding. If you could only recommend one programming language to start with, what would it be and why?"

curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d @- <<EOF
{
    "prompt": "<start_of_turn>user\n${USER_PROMPT}<end_of_turn>\n",
    "temperature": 0.90,
    "top_p": 1.0,
    "max_tokens": 128
}
EOF

The following output shows an example of the model response:

{"predictions":["Prompt:\n<start_of_turn>user\nI'm new to coding. If you could only recommend one programming language to start with, what would it be and why?<end_of_turn>\nOutput:\nPython is often recommended for beginners due to its clear, readable syntax, simple data types, and extensive libraries.\n\n**Reasons why Python is a great language for beginners:**\n\n* **Easy to read:** Python's syntax is straightforward and uses natural language conventions, making it easier for beginners to understand the code.\n* **Simple data types:** Python has basic data types like integers, strings, and lists that are easy to grasp and manipulate.\n* **Extensive libraries:** Python has a vast collection of well-documented libraries covering various tasks, allowing beginners to build projects without reinventing the wheel.\n* **Large supportive community:**"]}

Tune the readahead value

If you have workloads that perform sequential I/O, they may benefit from tuning the readahead value. This typically applies to inference or training workloads that need to load AI/ML model weights into memory. Most workloads with sequential I/O typically see a performance improvement with a readahead value of 1024 KB or higher.

Tune the readahead value for new volumes

You can specify this option by adding read_ahead_kb to the mountOptions field on your StorageClass. The following example shows how you can tune the readahead value to 4096 KB. This will apply to new dynamically provisioned PersistentVolumes created using the hyperdisk-ml StorageClass.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
    name: hyperdisk-ml
parameters:
    type: hyperdisk-ml
provisioner: pd.csi.storage.gke.io
allowVolumeExpansion: false
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
mountOptions:
  - read_ahead_kb=4096

Tune the readahead value for existing volumes

For statically provisioned volumes, or pre-existing PersistentVolumes, you can specify this option by adding read_ahead_kb to the spec.mountOptions field. The following example shows how you can tune the readahead value to 4096 KB.

apiVersion: v1
kind: PersistentVolume
  name: DISK_NAME
spec:
  accessModes:
  - ReadOnlyMany
  capacity:
    storage: 300Gi
  csi:
    driver: pd.csi.storage.gke.io
    fsType: ext4
    readOnly: true
    volumeHandle: projects/PROJECT/zones/ZONE/disks/DISK_NAME
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: topology.gke.io/zone
          operator: In
          values:
          - ZONE
  storageClassName: hyperdisk-ml
  mountOptions:
  - read_ahead_kb=4096

Replace the following values:

DISK_NAME: the name of the pre-existing Google Cloud Hyperdisk volume.
ZONE: the zone where the pre-existing Google Cloud Hyperdisk volume is created.

Test and benchmark your Hyperdisk ML volume performance

This section shows how you can use Flexible I/O Tester (FIO) to benchmark the performance of your Hyperdisk ML volumes for reading pre-existing data . You can use these metrics to evaluate your volume's performance for specific workloads and configurations.

Save the following example manifest as benchmark-job.yaml:

apiVersion: batch/v1
kind: Job
metadata:
  name: benchmark-job
spec:
  template:  # Template for the Pods the Job will create
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cloud.google.com/compute-class
                operator: In
                values:
                - "Performance"
            - matchExpressions:
              - key: cloud.google.com/machine-family
                operator: In
                values:
                - "c3"

      containers:
      - name: fio
        resources:
          requests:
            cpu: "32"
        image: litmuschaos/fio
        args:
        - fio
        - --filename
        - /models/gemma-7b/model-00001-of-00004.safetensors:/models/gemma-7b/model-00002-of-00004.safetensors:/models/gemma-7b/model-00003-of-00004.safetensors:/models/gemma-7b/model-00004-of-00004.safetensors:/models/gemma-7b/model-00004-of-00004.safetensors
        - --direct=1
        - --rw=read
        - --readonly
        - --bs=4096k
        - --ioengine=libaio
        - --iodepth=8
        - --runtime=60
        - --numjobs=1
        - --name=read_benchmark
        volumeMounts:
        - mountPath: "/models"
          name: volume
      restartPolicy: Never
      volumes:
      - name: volume
        persistentVolumeClaim:
          claimName: hdml-static-pvc
  parallelism: 1         # Run 1 Pods concurrently
  completions: 1         # Once 1 Pods complete successfully, the Job is done
  backoffLimit: 1        # Max retries on failure

Replace CLAIM_NAME with the name of your PersistentVolumeClaim (for example, hdml-static-pvc).

Create the Job by running the following command:
```
kubectl apply -f benchmark-job.yaml.
```

Use kubectl logs to view the output of the fio tool:

kubectl logs benchmark-job-nrk88 -f

The output looks similar to the following:

read_benchmark: (g=0): rw=read, bs=4M-4M/4M-4M/4M-4M, ioengine=libaio, iodepth=8
fio-2.2.10
Starting 1 process

read_benchmark: (groupid=0, jobs=1): err= 0: pid=32: Fri Jul 12 21:29:32 2024
read : io=18300MB, bw=2407.3MB/s, iops=601, runt=  7602msec
    slat (usec): min=86, max=1614, avg=111.17, stdev=64.46
    clat (msec): min=2, max=33, avg=13.17, stdev= 1.08
    lat (msec): min=2, max=33, avg=13.28, stdev= 1.06
    clat percentiles (usec):
    |  1.00th=[11072],  5.00th=[12352], 10.00th=[12608], 20.00th=[12736],
    | 30.00th=[12992], 40.00th=[13120], 50.00th=[13248], 60.00th=[13376],
    | 70.00th=[13504], 80.00th=[13632], 90.00th=[13888], 95.00th=[14016],
    | 99.00th=[14400], 99.50th=[15296], 99.90th=[22144], 99.95th=[25728],
    | 99.99th=[33024]
    bw (MB  /s): min= 2395, max= 2514, per=100.00%, avg=2409.79, stdev=29.34
    lat (msec) : 4=0.39%, 10=0.31%, 20=99.15%, 50=0.15%
cpu          : usr=0.28%, sys=8.08%, ctx=4555, majf=0, minf=8203
IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=99.8%, 16=0.0%, 32=0.0%, >=64=0.0%
    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    issued    : total=r=4575/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
    latency   : target=0, window=0, percentile=100.00%, depth=8

Run status group 0 (all jobs):
READ: io=18300MB, aggrb=2407.3MB/s, minb=2407.3MB/s, maxb=2407.3MB/s, mint=7602msec, maxt=7602msec

Disk stats (read/write):
nvme0n2: ios=71239/0, merge=0/0, ticks=868737/0, in_queue=868737, util=98.72%

Monitor throughput or IOPS on a Hyperdisk ML volume

To monitor the provisioned performance of your Hyperdisk ML volume, see Analyze provisioned IOPS and throughput in the Compute Engine documentation.

To update the provisioned throughput or IOPS of an existing Hyperdisk ML volume, or to learn about additional Google Cloud Hyperdisk parameters you can specify in your StorageClass, refer to Scale your storage performance using Google Cloud Hyperdisk.

Troubleshooting

This section provides troubleshooting guidance to resolve issues with Hyperdisk ML volumes on GKE.

The disk access mode cannot be updated

The following error occurs when a Hyperdisk ML volume is already being used by and attached by a node in ReadWriteOnce access mode.

AttachVolume.Attach failed for volume ... Failed to update access mode:
failed to set access mode for zonal volume ...
'Access mode cannot be updated when the disk is attached to instance(s).'., invalidResourceUsage

To resolve this issue, delete all Pods that are referencing the disk using a PersistentVolume in ReadWriteOnce mode. Wait for the disk to be detached, and then re-create the workload that consumes the PersistentVolume in ReadOnlyMany mode.

The disk cannot be attached with `READ_WRITE` mode

The following error indicates that GKE attempted to attach a Hyperdisk ML volume in READ_ONLY_MANY access mode to a GKE node using ReadWriteOnce access mode.

AttachVolume.Attach failed for volume ...
Failed to Attach: failed cloud service attach disk call ...
The disk cannot be attached with READ_WRITE mode., badRequest

GKE automatically updates the Hyperdisk ML volume's accessMode from READ_WRITE_SINGLE to READ_ONLY_MANY, when it is used by a ReadOnlyMany access mode PersistentVolume. However, GKE won't automatically update the access mode from READ_ONLY_MANY to READ_WRITE_SINGLE. This is a safety mechanism to ensure that multi-zone disks are not written to by accident, as this could result in diverging content between multi-zone disks.

To resolve this issue, we recommend that you follow the Pre-cache data to a Persistent Disk disk image workflow if you need updated content. If you need more control over the Hyperdisk ML volume's access mode and other settings, see Modify the settings for a Google Cloud Hyperdisk volume.

Quota exceeded - Insufficient throughput quota

The following error indicates that there was insufficient Hyperdisk ML throughput quota at the time of disk provisioning.

failed to provision volume with StorageClass ... failed (QUOTA_EXCEEDED): Quota 'HDML_TOTAL_THROUGHPUT' exceeded

To resolve this issue, see Disk Quotas to learn more about Hyperdisk quota and how to increase the disk quota in your project.

For additional troubleshooting guidance, refer to Scale your storage performance with Google Cloud Hyperdisk.