Skip to main content

AWS SDK part-1(S3)

 

IAM USER CREATION:

AWS services require authentication to interact with them. Normally, this is done through IAM users, roles, or temporary credentials, but it is possible to use Boto3 without an IAM user by leveraging the root user's access key (which is not recommended due to security risks).


 Using the Root User’s Access Key (Not Recommended)

Although AWS does not create an access key for the root user by default, you can manually create one and use it.

Steps to Generate a Root User Access Key (Not Recommended)

  1. Login to AWS Console as the root user.

  2. Navigate to IAMSecurity Credentials.

  3. Scroll to Access Keys and click Create New Access Key.

  4. Copy the Access Key ID and Secret Access Key (you won’t see the secret key again).

  5. Configure AWS CLI with:

    bash
    aws configure
    • Enter Access Key ID (from step 4)
    • Enter Secret Access Key (from step 4)
    • Enter Default region name (e.g., us-east-1)
    • Enter Default output format (json or table)
  6. Now, you can use Boto3 to access AWS services.

Create an IAM User with Python SDK (Boto3)

Let’s write a code snippet to create an IAM user, assign a policy, and generate an access key for programmatic access.


Code: Create an IAM User

python

import boto3 from botocore.exceptions import ClientError # Initialize IAM client iam_client = boto3.client('iam') # Step 1: Create an IAM User try: user_name = "MyGenerativeAIUser" # Replace with your desired username response = iam_client.create_user(UserName=user_name) print(f"IAM User '{user_name}' created successfully!") except ClientError as e: print(f"Error creating user: {e}") # Step 2: Attach a Policy to the User try: policy_arn = "arn:aws:iam::aws:policy/AmazonS3FullAccess" # Example policy iam_client.attach_user_policy(UserName=user_name, PolicyArn=policy_arn) print(f"Policy '{policy_arn}' attached to user '{user_name}'!") except ClientError as e: print(f"Error attaching policy: {e}") # Step 3: Create Access Keys for Programmatic Access try: access_key_response = iam_client.create_access_key(UserName=user_name) access_key_id = access_key_response['AccessKey']['AccessKeyId'] secret_access_key = access_key_response['AccessKey']['SecretAccessKey'] print(f"Access Key ID: {access_key_id}") print(f"Secret Access Key: {secret_access_key}") except ClientError as e: print(f"Error creating access keys: {e}")

Why Is This Not Recommended?

  • Root user has unlimited privileges—if compromised, it can fully control your AWS account.
  • Best practice: Use IAM users or roles instead.

Working with Amazon S3 Using Python SDK (Boto3)

Since IAM user setup is complete, let’s start with Amazon S3 (Simple Storage Service), which is one of the most commonly used AWS services in Generative AI workflows.


🚀 Lesson 1: Creating an S3 Bucket

Before we can upload or manage files, we need to create an S3 bucket.


📌 Code: Create an S3 Bucket

Real-Time Example: Secure Private Bucket

If you need a private S3 bucket with proper security, this would be a real-time production-ready way to do it:

python

import boto3 from botocore.exceptions import ClientError s3_client = boto3.client('s3', region_name='us-west-2') bucket_name = "my-secure-bucket-98765" # Change to a unique name try: response = s3_client.create_bucket( Bucket=bucket_name, CreateBucketConfiguration={'LocationConstraint': 'us-west-2'}, ACL='private', # Ensures only the bucket owner has access ObjectLockEnabledForBucket=True # Enables object lock for compliance ) print(f"Secure S3 Bucket '{bucket_name}' created successfully!") except ClientError as e: print(f"Error creating bucket: {e}")

Important Parameters in create_bucket()

When creating an S3 bucket using:

python

s3_client.create_bucket(Bucket='my-bucket-name', CreateBucketConfiguration={'LocationConstraint': 'us-west-2'})

Important Parameters to Know:

  1. Bucket

    • The name of the bucket must be unique across AWS.
    • Ensures uniqueness to avoid conflicts when multiple apps need different buckets.
  2. CreateBucketConfiguration

    • Defines bucket properties, such as the region.
    • Needed when creating buckets in regions other than us-east-1.
  3. LocationConstraint

    • Specifies the AWS region for the bucket.
    • Helps avoid latency by creating the bucket closer to users.
  4. ACL (Access Control List)

    • Controls the access level of the bucket (e.g., private, public-read, etc.).
    • Should be set as private for sensitive data or public-read only when necessary.
  5. ObjectLockEnabledForBucket

    • Enables object lock to prevent data modification.
    • Useful for compliance and legal data retention.
  6. GrantRead / GrantWrite

    • Assigns specific permissions to AWS users.
    • Used when sharing access with another AWS account.

Common Mistakes to Avoid:

  1. Not specifying LocationConstraint

    • Always define LocationConstraint when creating a bucket outside us-east-1.
  2. Using a non-unique bucket name

    • AWS bucket names are global—always choose a unique name.
  3. Setting ACL to public-read accidentally

    • This makes the bucket publicly accessible—use private unless needed.
  4. Forgetting to enable object lock for compliance

    • If you need to store immutable data, always enable ObjectLockEnabledForBucket=True.
ChatGPT

🚀 Lesson 2: Uploading Files to S3 Using Boto3

Now that you've successfully created an S3 bucket, let's move to uploading files to it. This is crucial because in real-world Generative AI projects, you'll often store datasets, model weights, or generated outputs in S3.

📌 Code: Upload a File to S3

python

import boto3 from botocore.exceptions import ClientError # Initialize S3 client s3_client = boto3.client('s3', region_name='us-west-2') # Define variables bucket_name = "my-generative-ai-bucket-12345" # Replace with your bucket name local_file_path = "C:\\Users\\YourName\\Documents\\example.txt" # Replace with your file path s3_object_name = "uploaded-example.txt" # The name to store the file as in S3 try: # Upload the file s3_client.upload_file(local_file_path, bucket_name, s3_object_name) print(f"File '{local_file_path}' uploaded successfully as '{s3_object_name}' in '{bucket_name}'!") except ClientError as e: print(f"Error uploading file: {e}")

📌 Important Parameters in upload_file()

  1. Filename

    • Local file path to upload.
    • Always define the full path for the file on your system.
  2. Bucket

    • The name of the S3 bucket.
    • Ensure the bucket exists before running this command.
  3. Key

    • The object name in S3 (can be different from the filename).
    • Use a structured naming convention (e.g., models/model-v1.pth).
  4. ExtraArgs

    • Optional arguments like ACL and metadata.
    • Use for setting access permissions or defining file metadata.

🚨 Common Mistakes to Avoid in upload_file()

  1. FileNotFoundError

    • Ensure the local file path is correct and the file exists before uploading.
  2. Access Denied Error

    • Your IAM user must have s3:PutObject permissions to upload files.
  3. BucketNotFound

    • Verify that the bucket exists before running the upload command.

🚀 Lesson 3: Downloading Files from S3 Using Boto3

Now that you have successfully uploaded a file to S3, let’s learn how to download files from S3 to your local system.


📌 Code: Download a File from S3

python

import boto3 from botocore.exceptions import ClientError # Initialize S3 client s3_client = boto3.client('s3', region_name='us-west-2') # Define variables bucket_name = "my-generative-ai-bucket-12345" # Replace with your bucket name s3_object_name = "uploaded-example.txt" # Name of the file in S3 local_download_path = "C:\\Users\\YourName\\Downloads\\downloaded-example.txt" # Change this to where you want to save the file try: # Download the file from S3 s3_client.download_file(bucket_name, s3_object_name, local_download_path) print(f"File '{s3_object_name}' downloaded successfully to '{local_download_path}'!") except ClientError as e: print(f"Error downloading file: {e}")

📌 Important Parameters in download_file()

  1. Bucket

    • The name of the S3 bucket.
    • Ensure the bucket exists before attempting to download.
  2. Key

    • The file name (object key) stored in S3.
    • Case-sensitive—must match the exact name in S3.
  3. Filename

    • The local file path where the file will be saved.
    • Set to a valid path on your system to store the downloaded file.

✅ Real-Time Example: Handling Large Files Efficiently

If you’re dealing with large files (e.g., AI models, datasets), use download_fileobj() instead of download_file():

python

with open(local_download_path, 'wb') as f: s3_client.download_fileobj(bucket_name, s3_object_name, f)
  • download_fileobj() is more memory-efficient because it streams the file instead of loading it into memory at once.

🔴 Common Mistakes & Fixes in download_file()

  1. FileNotFoundError

    • Ensure the object exists in S3 before attempting to download.
  2. Access Denied Error

    • IAM user must have s3:GetObject permission to access the file.
  3. Invalid Local Path

    • Verify that the local path exists before downloading the file.
  4. Incorrect File Name (Key Error)

    • Ensure the s3_object_name matches exactly, as S3 is case-sensitive.

🚀 Lesson 4: Listing All Files in an S3 Bucket

Now that we've uploaded and downloaded files, let's learn how to list all objects (files) inside an S3 bucket. This is useful when managing AI model artifacts, datasets, or logs.


📌 Code: List All Files in an S3 Bucket

python

import boto3 from botocore.exceptions import ClientError # Initialize S3 client s3_client = boto3.client('s3', region_name='us-west-2') # Define the bucket name bucket_name = "my-generative-ai-bucket-12345" # Replace with your bucket name try: # List objects in the bucket response = s3_client.list_objects_v2(Bucket=bucket_name) # Check if the bucket has files if "Contents" in response: print("Files in S3 Bucket:") for obj in response["Contents"]: print(f"- {obj['Key']} (Size: {obj['Size']} bytes, Last Modified: {obj['LastModified']})") else: print("Bucket is empty.") except ClientError as e: print(f"Error listing files: {e}")

📌 Important Parameters in list_objects_v2()

  1. Bucket

    • Specifies the bucket to list objects from.
    • Ensure the bucket exists before running this command.
  2. Prefix

    • Filters objects that start with a specific string.
    • Useful for structuring datasets (e.g., "models/" lists only model files).
  3. MaxKeys

    • Limits the number of objects returned in a single response.
    • Helps in handling large buckets efficiently.
  4. ContinuationToken

    • Used for pagination if there are more than 1000 objects.
    • Required when listing large datasets to fetch remaining objects.

✅ Real-Time Example: List Only .txt Files

If you want to list only text files inside the bucket:

python

response = s3_client.list_objects_v2(Bucket=bucket_name, Prefix="data/", MaxKeys=100) if "Contents" in response: for obj in response["Contents"]: if obj["Key"].endswith(".txt"): print(f"- {obj['Key']} (Size: {obj['Size']} bytes)")
  • Prefix "data/" → Lists only files inside the data/ folder.
  • Filters .txt files by checking if the key ends with ".txt".

🔴 Common Mistakes & Fixes in list_objects_v2()

  1. BucketNotFound

    • Ensure the bucket name is correct and exists before listing objects.
  2. Access Denied Error

    • IAM user must have s3:ListBucket permission to retrieve object listings.
  3. Empty List Response

    • If the bucket has no files, handle the "Contents" key properly to avoid errors.
  4. Handling Pagination

    • If the bucket contains more than 1000 files, use ContinuationToken to fetch additional objects.

What is ContinuationToken in list_objects_v2()?

By default, the list_objects_v2() method in Boto3 can only return up to 1,000 objects at a time. If your S3 bucket contains more than 1,000 files, you need to use ContinuationToken to paginate through the results.


📌 Real-Time Use Case

Imagine you have 10,000 images in your S3 bucket for AI training. When you call:

python

response = s3_client.list_objects_v2(Bucket=bucket_name)
  • It will return only the first 1,000 files.
  • To get the next batch, you need to use ContinuationToken.
  • This token remembers the last file listed, so the next call starts from where it left off.

🚀 Example: Listing All Files in a Large S3 Bucket

python

import boto3 from botocore.exceptions import ClientError # Initialize S3 client s3_client = boto3.client('s3', region_name='us-west-2') # Define bucket name bucket_name = "my-generative-ai-bucket-12345" # Replace with your bucket name continuation_token = None try: while True: # Fetch files from S3 with pagination if continuation_token: response = s3_client.list_objects_v2(Bucket=bucket_name, ContinuationToken=continuation_token) else: response = s3_client.list_objects_v2(Bucket=bucket_name) # Check if the bucket contains files if "Contents" in response: for obj in response["Contents"]: print(f"- {obj['Key']} (Size: {obj['Size']} bytes)") # Check if more results exist if response.get("IsTruncated"): # True if there are more files to fetch continuation_token = response["NextContinuationToken"] else: break # Stop when all files are retrieved else: print("Bucket is empty.") break except ClientError as e: print(f"Error listing files: {e}")

🔍 How This Works

  1. First API Call:
    • Calls list_objects_v2() without a ContinuationToken (fetches first 1,000 objects).
  2. Check if more files exist:
    • If "IsTruncated": True, AWS signals there are more files left.
  3. Retrieve NextContinuationToken:
    • This tells AWS where to resume on the next call.
  4. Loop Until All Files Are Retrieved:
    • We call list_objects_v2() again using ContinuationToken.

✅ Why is ContinuationToken Important?

  1. S3 Bucket has more than 1,000 files

    • AWS returns only the first 1,000 objects. Pagination ensures you retrieve all files.
  2. Retrieving all AI model checkpoints

    • Large ML projects store thousands of versions—pagination is needed to fetch them all.
  3. Processing Large Datasets

    • When storing massive datasets, pagination is required to iterate through all files efficiently.

🔴 Common Mistakes & Fixes

  1. Only fetching 1,000 objects

    • Use ContinuationToken to request the next batch of objects.
  2. Not checking IsTruncated

    • Always check if more data exists before stopping to ensure completeness.
  3. Using incorrect ContinuationToken

    • Ensure you use the NextContinuationToken from the previous API response for continuity.

🚀 Lesson 5: Deleting Files from S3 Using Boto3

Now that we’ve learned how to upload, download, and list files, let’s move to deleting files from S3, which is crucial for managing storage costs and cleaning up old data.


📌 Code: Delete a Single File from S3

python

import boto3 from botocore.exceptions import ClientError # Initialize S3 client s3_client = boto3.client('s3', region_name='us-west-2') # Define bucket and file to delete bucket_name = "my-generative-ai-bucket-12345" # Replace with your bucket name s3_object_name = "uploaded-example.txt" # Replace with the file name you want to delete try: # Delete the file response = s3_client.delete_object(Bucket=bucket_name, Key=s3_object_name) print(f"File '{s3_object_name}' deleted successfully from '{bucket_name}'!") except ClientError as e: print(f"Error deleting file: {e}")

📌 Important Parameters in delete_object()

ParameterPurposeReal-Time Use Case
BucketThe name of the S3 bucket.Ensure the bucket exists before deleting.
KeyThe exact file name (object key) stored in S3.File names are case-sensitive in S3.

🚀 Deleting Multiple Files at Once

If you need to delete multiple files from an S3 bucket in one call, use delete_objects().

python

import boto3 from botocore.exceptions import ClientError s3_client = boto3.client('s3', region_name='us-west-2') bucket_name = "my-generative-ai-bucket-12345" # Replace with your bucket name files_to_delete = ["file1.txt", "file2.png", "file3.ipynb"] # List of file names to delete try: # Format the files list for the request delete_objects = {"Objects": [{"Key": file} for file in files_to_delete]} # Delete multiple files response = s3_client.delete_objects(Bucket=bucket_name, Delete=delete_objects) # Confirm deletion deleted_files = [obj["Key"] for obj in response.get("Deleted", [])] print(f"Files deleted successfully: {deleted_files}") except ClientError as e: print(f"Error deleting files: {e}")

✅ Real-Time Use Case: Delete Files Based on Condition

If you need to delete all files older than X days, you can do:

python

from datetime import datetime, timezone # Fetch all files response = s3_client.list_objects_v2(Bucket=bucket_name) if "Contents" in response: for obj in response["Contents"]: file_name = obj["Key"] last_modified = obj["LastModified"] # Delete if file is older than 30 days if (datetime.now(timezone.utc) - last_modified).days > 30: s3_client.delete_object(Bucket=bucket_name, Key=file_name) print(f"Deleted old file: {file_name}")

🔴 Common Mistakes & Fixes in delete_object()

  1. KeyNotFound Error

    • Ensure the file exists in the S3 bucket before attempting to delete it.
  2. Access Denied Error

    • IAM user must have s3:DeleteObject permission to delete files.
  3. BucketNotFound Error

    • Verify that the bucket name is correct and exists before deleting an object.
  4. Deleting a Non-Empty Bucket

    • You must delete all objects first before deleting the bucket itself.

🚀 Lesson 7: Enabling S3 Versioning and Uploading Versioned Files

S3 versioning allows you to keep multiple versions of the same file, preventing accidental overwrites and enabling file recovery.


📌 Step 1: Enable Versioning on an S3 Bucket

Before uploading versioned files, you must enable versioning on your S3 bucket.

python

import boto3 from botocore.exceptions import ClientError # Initialize S3 client s3_client = boto3.client('s3', region_name='us-west-2') # Define bucket name bucket_name = "my-generative-ai-bucket-12345" # Replace with your bucket name try: # Enable versioning on the bucket s3_client.put_bucket_versioning( Bucket=bucket_name, VersioningConfiguration={'Status': 'Enabled'} ) print(f"Versioning enabled for bucket: {bucket_name}") except ClientError as e: print(f"Error enabling versioning: {e}")

📌 Step 2: Uploading a File with Versioning

Once versioning is enabled, every upload of the same file creates a new version, instead of replacing the previous file.

python

import boto3 from botocore.exceptions import ClientError # Initialize S3 client s3_client = boto3.client('s3', region_name='us-west-2') # Define variables bucket_name = "my-generative-ai-bucket-12345" # Replace with your bucket name local_file_path = "C:\\Users\\YourName\\Documents\\example.txt" # Replace with your file s3_object_name = "example.txt" # Keep the same name to store different versions try: # Upload file (new version will be created if versioning is enabled) response = s3_client.upload_file(local_file_path, bucket_name, s3_object_name) print(f"File '{s3_object_name}' uploaded successfully. A new version was created.") except ClientError as e: print(f"Error uploading file: {e}")

📌 Step 3: Listing All Versions of a File

To see all the versions of a specific file in S3, use list_object_versions().

python

# Initialize S3 client s3_client = boto3.client('s3', region_name='us-west-2') # List all versions of a specific file response = s3_client.list_object_versions(Bucket=bucket_name, Prefix=s3_object_name) if "Versions" in response: print(f"Versions of {s3_object_name}:") for version in response["Versions"]: print(f"- Version ID: {version['VersionId']} (Last Modified: {version['LastModified']})") else: print(f"No versions found for {s3_object_name}.")

📌 Step 4: Downloading a Specific Version

If you want to download an older version of a file, specify its VersionId.

python

# Define the version ID you want to download version_id = "ENTER_VERSION_ID_HERE" # Replace with a valid version ID # Download a specific version of the file s3_client.download_file(bucket_name, s3_object_name, "downloaded-versioned-file.txt", ExtraArgs={'VersionId': version_id}) print(f"Downloaded specific version of '{s3_object_name}' successfully!")

📌 Step 5: Deleting a Specific Version

Instead of deleting the whole file, you can delete a specific version.

python

s3_client.delete_object(Bucket=bucket_name, Key=s3_object_name, VersionId=version_id) print(f"Deleted version {version_id} of '{s3_object_name}'.")

🔍 Why Use Versioning?

  1. Prevent Accidental Overwrites

    • Older versions of files are retained, allowing easy restoration.
  2. Data Recovery

    • Previous file versions can be retrieved if needed.
  3. Audit Tracking

    • Keeps track of changes made to important datasets or AI models.

🔴 Common Mistakes & Fixes

  1. File overwritten instead of versioning

    • Ensure versioning is enabled before uploading new versions of a file.
  2. Trying to delete an object without specifying VersionId

    • If versioning is enabled, you must provide VersionId when deleting a file.
  3. Not seeing older versions

    • Use list_object_versions() instead of list_objects_v2() to retrieve all versions.

🚀 Lesson 8: Deleting an Entire S3 Bucket (Including All Versions)

Now that you've learned how to delete objects and manage versioning, let's move to completely deleting an S3 bucket, including all files and all versions.


This script deletes all files, including:

  1. Regular (unversioned) objects
  2. Versioned objects (if versioning is enabled)
  3. Delete markers (for versioned objects)
python

import boto3 from botocore.exceptions import ClientError # Initialize S3 resource s3 = boto3.resource('s3', region_name='us-west-2') # Define the bucket name bucket_name = "my-generative-ai-bucket-12345" # Replace with your bucket name bucket = s3.Bucket(bucket_name) try: # Step 1: Delete ALL objects (for both versioned & non-versioned buckets) for obj in bucket.objects.all(): obj.delete() print(f"All standard objects deleted from '{bucket_name}'!") # Step 2: Delete ALL versions (if versioning is enabled) for obj_version in bucket.object_versions.all(): obj_version.delete() print(f"All versioned objects deleted from '{bucket_name}'!") # Step 3: Delete the bucket bucket.delete() print(f"S3 bucket '{bucket_name}' deleted successfully!") except ClientError as e: print(f"Error: {e}")

🔍 Key Points for Deleting an S3 Bucket

  1. Delete all objects

    • AWS does not allow you to delete a non-empty bucket.
  2. Delete versions (if enabled)

    • Versioned objects must be explicitly removed before bucket deletion.
  3. Delete the bucket

    • The final step after clearing all contents.

🔴 Common Mistakes & Fixes

  1. BucketNotEmpty Error

    • Ensure you delete all objects before attempting to delete the bucket.
  2. Access Denied Error

    • IAM user must have s3:DeleteBucket and s3:DeleteObject permissions.
  3. Trying to delete a non-existing bucket

    • Verify that the bucket name is correct before attempting deletion.

🔍 Do We Still Need Step 2?

Your question is insightful! Do we need to delete versions explicitly after deleting all standard objects? The answer depends on whether the bucket has versioning enabled or not.


✅ What Happens in Step 1?

python

for obj in bucket.objects.all(): obj.delete()
  • If versioning is disabled: This completely removes all objects.
  • If versioning is enabled: This does not delete previous versions. Instead, it creates delete markers, meaning the files become "hidden" but are still stored in S3.

✅ What Happens in Step 2?

python

for obj_version in bucket.object_versions.all(): obj_version.delete()
  • If versioning is disabled: This step is unnecessary (because all files are already deleted in Step 1).
  • If versioning is enabled: This removes all previous versions, which Step 1 does not do.

Multipart Uploads in S3

🔍 Why Use Multipart Uploads?

  • Uploads large files efficiently by splitting them into parts.
  • Allows parallel uploads, making the process faster.
  • Resumable uploads in case of failure, reducing the risk of losing progress.
  • Required for files larger than 5GB (mandatory for files >5TB).

📌 Full Code: Upload a Large File Using Multipart Upload

python
import boto3 import os from botocore.exceptions import ClientError # Initialize S3 client s3_client = boto3.client('s3', region_name='us-west-2') # Define bucket and file details bucket_name = "my-generative-ai-bucket-12345" # Replace with your bucket name local_file_path = "C:\\Users\\YourName\\Documents\\large_file.mp4" # Replace with your file path s3_object_name = "large_file.mp4" # Name in S3 part_size = 5 * 1024 * 1024 # 5MB per part def multipart_upload(): """Uploads a large file to S3 using multipart upload.""" try: # Step 1: Initiate multipart upload response = s3_client.create_multipart_upload(Bucket=bucket_name, Key=s3_object_name) upload_id = response["UploadId"] print(f"Multipart upload initiated: Upload ID = {upload_id}") # Step 2: Upload file parts parts = [] with open(local_file_path, "rb") as file: part_number = 1 while chunk := file.read(part_size): response = s3_client.upload_part( Bucket=bucket_name, Key=s3_object_name, PartNumber=part_number, UploadId=upload_id, Body=chunk ) parts.append({"PartNumber": part_number, "ETag": response["ETag"]}) print(f"Uploaded part {part_number}") part_number += 1 # Step 3: Complete multipart upload s3_client.complete_multipart_upload( Bucket=bucket_name, Key=s3_object_name, UploadId=upload_id, MultipartUpload={"Parts": parts} ) print(f"Multipart upload completed successfully!") except ClientError as e: print(f"Error: {e}") # Abort upload if an error occurs s3_client.abort_multipart_upload(Bucket=bucket_name, Key=s3_object_name, UploadId=upload_id) print("Multipart upload aborted.") multipart_upload()

📌 How Multipart Upload Works

  1. create_multipart_upload()

    • Starts the multipart upload and returns an UploadId.
  2. upload_part()

    • Uploads chunks of the file in parallel.
  3. complete_multipart_upload()

    • Combines all uploaded parts into a single file.
  4. abort_multipart_upload()

    • Cancels the upload if an error occurs.

✅ Real-Time Use Cases

  1. Uploading a 10GB AI dataset

    • Prevents upload failure by splitting the file into smaller chunks.
  2. Handling network interruptions

    • Resumable uploads allow retrying failed parts instead of restarting from scratch.
  3. Parallel uploads for speed

    • Multiple parts upload simultaneously, reducing total upload time.

🚀 How to Make Multipart Upload Truly Parallel?

To upload parts concurrently, we can use the concurrent.futures.ThreadPoolExecutor to send multiple upload_part() requests simultaneously.


✅ Optimized Parallel Multipart Upload

python
import boto3 import os import concurrent.futures from botocore.exceptions import ClientError # Initialize S3 client s3_client = boto3.client('s3', region_name='us-west-2') # Define bucket and file details bucket_name = "my-generative-ai-bucket-12345" # Replace with your bucket name local_file_path = "C:\\Users\\YourName\\Documents\\large_file.mp4" # Replace with your file path s3_object_name = "large_file.mp4" # Name in S3 part_size = 5 * 1024 * 1024 # 5MB per part def upload_part(part_number, data, upload_id): """Uploads a single part in parallel.""" try: response = s3_client.upload_part( Bucket=bucket_name, Key=s3_object_name, PartNumber=part_number, UploadId=upload_id, Body=data ) print(f"Uploaded part {part_number}") return {"PartNumber": part_number, "ETag": response["ETag"]} except ClientError as e: print(f"Error uploading part {part_number}: {e}") return None def multipart_upload(): """Uploads a large file to S3 using parallel multipart upload.""" try: # Step 1: Initiate multipart upload response = s3_client.create_multipart_upload(Bucket=bucket_name, Key=s3_object_name) upload_id = response["UploadId"] print(f"Multipart upload initiated: Upload ID = {upload_id}") # Step 2: Read file and prepare parts parts = [] with open(local_file_path, "rb") as file: part_number = 1 chunks = [] while chunk := file.read(part_size): chunks.append((part_number, chunk, upload_id)) part_number += 1 # Step 3: Upload parts in parallel with concurrent.futures.ThreadPoolExecutor() as executor: results = executor.map(lambda p: upload_part(*p), chunks) # Collect successful uploads parts = [part for part in results if part] # Step 4: Complete multipart upload if parts: s3_client.complete_multipart_upload( Bucket=bucket_name, Key=s3_object_name, UploadId=upload_id, MultipartUpload={"Parts": parts} ) print(f"Multipart upload completed successfully!") else: raise Exception("No parts were successfully uploaded.") except ClientError as e: print(f"Error: {e}") s3_client.abort_multipart_upload(Bucket=bucket_name, Key=s3_object_name, UploadId=upload_id) print("Multipart upload aborted.") multipart_upload()

✅ What’s Different in This Code?

  1. Processing

    • Before (Sequential Upload): Uploads one part at a time (blocking).
    • Now (Parallel Upload): Uploads multiple parts simultaneously.
  2. Performance

    • Before: Slower for large files.
    • Now: Faster, as multiple chunks upload in parallel.
  3. Error Handling

    • Before: Works but has slower recovery.
    • Now: Uses threading and retries failed parts automatically.

🚀 Key Optimizations

  1. Reads all parts first and stores them in chunks before uploading.
  2. Uses ThreadPoolExecutor.map() to upload multiple parts concurrently.
  3. Filters out failed parts before completing the upload.
  4. Fails safely if no parts are successfully uploaded.

✅ Real-Time Use Cases

  1. Uploading AI-generated videos

    • Speeds up upload by sending chunks simultaneously.
  2. Uploading multi-GB datasets

    • Prevents bottlenecks caused by single-threaded execution.
  3. Resilient to slow networks

    • Faster recovery from failed uploads.

🔴 Common Mistakes & Fixes

  1. Not all parts uploaded

    • Use a part check before completing the upload.
  2. Upload speed not improving

    • Increase ThreadPoolExecutor() worker count for better concurrency.
  3. Memory issues with huge files

    • Process chunks one by one instead of storing everything in memory.

Presigned URLs in S3 (Secure File Access Without Credentials)

Now that you've mastered multipart uploads, let's move to Presigned URLs, which are useful for securely accessing private S3 objects without exposing credentials.


🔍 Why Use Presigned URLs?

ScenarioWhy Presigned URLs?
Securely share filesAllows temporary access to private S3 files.
Restrict access durationURLs automatically expire after a set time.
Download or Upload Without IAM User CredentialsUsers can access files without needing AWS credentials.
Integrate with AI workflowsModel training data, logs, or AI-generated outputs can be shared via time-limited URLs.

📌 Code: Generate a Presigned URL for Downloading a File

python
import boto3 from botocore.exceptions import ClientError # Initialize S3 client s3_client = boto3.client('s3', region_name='us-west-2') # Define bucket and file details bucket_name = "my-generative-ai-bucket-12345" # Replace with your bucket name s3_object_name = "example.txt" # Replace with your file in S3 expiration_time = 3600 # URL expires in 1 hour (seconds) def generate_presigned_url(): """Generate a presigned URL for downloading an S3 object.""" try: url = s3_client.generate_presigned_url( 'get_object', Params={'Bucket': bucket_name, 'Key': s3_object_name}, ExpiresIn=expiration_time ) print(f"Presigned URL (valid for {expiration_time // 60} minutes):\n{url}") return url except ClientError as e: print(f"Error generating presigned URL: {e}") return None # Generate and print the presigned URL generate_presigned_url()

📌 Code: Generate a Presigned URL for Uploading a File

If you want external users to upload a file to your S3 bucket, use this:

python

def generate_presigned_upload_url(): """Generate a presigned URL for uploading an S3 object.""" try: url = s3_client.generate_presigned_url( 'put_object', Params={'Bucket': bucket_name, 'Key': s3_object_name}, ExpiresIn=expiration_time ) print(f"Presigned Upload URL (valid for {expiration_time // 60} minutes):\n{url}") return url except ClientError as e: print(f"Error generating presigned upload URL: {e}") return None # Generate and print the presigned upload URL generate_presigned_upload_url()

✅ How Presigned URLs Work

  1. generate_presigned_url('get_object')

    • Generates a URL to download a file.
    • Shareable with end-users for secure access.
  2. generate_presigned_url('put_object')

    • Generates a URL to upload a file.
    • Allows users to upload files securely without AWS credentials.

🚀 Real-World Use Cases

  1. AI Model Hosting

    • Serve trained AI models to users without exposing S3 access.
  2. Secure File Sharing

    • Share reports, logs, or datasets with temporary access.
  3. Client-side Uploads

    • Users can upload AI-generated images without needing AWS credentials.

🔴 Common Mistakes & Fixes

  1. "Access Denied" Error

    • Ensure your IAM user has s3:GetObject (for downloads) or s3:PutObject (for uploads).
  2. Presigned URL Expired Too Soon

    • Increase ExpiresIn value (max: 7 days).
  3. Generated URL Doesn’t Work

    • Ensure the file exists in S3 before sharing the URL.

🚀 Optimized Code for Uploading a File Using a Presigned URL

python

import requests # Define presigned URL (Replace with your actual URL) presigned_url = "https://my-generative-ai-bucket-12345.s3.amazonaws.com/uploaded-example7.txt?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIA44Y6CIJ5ZI5Y7X6H%2F20250202%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Date=20250202T235804Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=3ba2452dcda4048e5b4093d69d2c7754ede41c3225e14bc1e7302fec1e5c7d7c" # Define file path (Ensure correct format) file_path = r"S3\data\b.txt" # Use absolute path if needed (C:\\Users\\YourName\\S3\\data\\b.txt) # Read file and upload using requests try: with open(file_path, "rb") as file: headers = {"Content-Type": "application/octet-stream"} # Optional, ensures correct upload response = requests.put(presigned_url, data=file, headers=headers) # Check response status if response.status_code == 200: print("✅ Upload successful!") else: print(f"❌ Upload failed: {response.status_code} - {response.text}") except FileNotFoundError: print("❌ Error: File not found. Check the file path.") except requests.exceptions.RequestException as e: print(f"❌ Upload failed: {e}")

S3 Transfer Acceleration (Faster Uploads & Downloads)

Now that you've mastered Presigned URLs, let’s move to S3 Transfer Acceleration, which helps you upload and download files faster using Amazon CloudFront’s global network.


🔍 Why Use S3 Transfer Acceleration?

ScenarioWhy Use It?
Slow uploads from distant locationsUses CloudFront’s edge locations to accelerate data transfer.
Large AI datasets or model weightsReduces upload time by optimizing the route to S3.
Global users need fast accessImproves performance for users far from the S3 bucket’s region.

📌 Step 1: Enable Transfer Acceleration on an S3 Bucket

Before using acceleration, you must enable it on the bucket.

python
import boto3 from botocore.exceptions import ClientError # Initialize S3 client s3_client = boto3.client('s3', region_name='us-west-2') # Define bucket name bucket_name = "my-generative-ai-bucket-12345" # Replace with your bucket name def enable_transfer_acceleration(): """Enable S3 Transfer Acceleration for a bucket.""" try: response = s3_client.put_bucket_accelerate_configuration( Bucket=bucket_name, AccelerateConfiguration={"Status": "Enabled"} ) print(f"S3 Transfer Acceleration enabled for '{bucket_name}'!") except ClientError as e: print(f"Error enabling Transfer Acceleration: {e}") # Run the function to enable acceleration enable_transfer_acceleration()

📌 Step 2: Upload a File Using Transfer Acceleration

Once enabled, use the accelerated endpoint to upload files.

python
import boto3 from botocore.exceptions import ClientError # Use accelerated S3 client s3_accelerated_client = boto3.client('s3', region_name='us-west-2', endpoint_url="https://s3-accelerate.amazonaws.com") # Define file details bucket_name = "my-generative-ai-bucket-12345" local_file_path = "C:\\Users\\YourName\\Documents\\large_file.mp4" s3_object_name = "large_file.mp4" def upload_via_acceleration(): """Upload a file to S3 using Transfer Acceleration.""" try: s3_accelerated_client.upload_file(local_file_path, bucket_name, s3_object_name) print(f"File '{s3_object_name}' uploaded successfully using Transfer Acceleration!") except ClientError as e: print(f"Error uploading file: {e}") # Run the function upload_via_acceleration()

📌 Step 3: Download a File Using Transfer Acceleration

python
def download_via_acceleration(download_path): """Download a file from S3 using Transfer Acceleration.""" try: s3_accelerated_client.download_file(bucket_name, s3_object_name, download_path) print(f"File '{s3_object_name}' downloaded successfully via Transfer Acceleration!") except ClientError as e: print(f"Error downloading file: {e}") # Example usage download_via_acceleration("C:\\Users\\YourName\\Downloads\\downloaded_large_file.mp4")

📌 How to Verify If Acceleration is Enabled?

Run this check status script:

python
def check_transfer_acceleration(): """Check if Transfer Acceleration is enabled on an S3 bucket.""" response = s3_client.get_bucket_accelerate_configuration(Bucket=bucket_name) status = response.get("Status", "Not Enabled") print(f"Transfer Acceleration status: {status}") # Run the function check_transfer_acceleration()

✅ When Should You Use Transfer Acceleration?

Use It When...

  1. Uploading large AI models or datasets from remote locations.
  2. You have global users accessing the bucket.
  3. You need to minimize network latency for faster uploads.

Don't Use It When...

  1. Your uploads are already fast in the same AWS region.
  2. Your users are mostly in the same AWS region.
  3. You only work with small files (<100MB), where acceleration isn't necessary.

🔴 Common Mistakes & Fixes

  1. "Transfer Acceleration Not Enabled" Error

    • Run enable_transfer_acceleration() before using accelerated endpoints.
  2. Uploads still slow

    • Ensure you're using the s3-accelerate.amazonaws.com endpoint.
  3. Not seeing improvement

    • Only use Transfer Acceleration if your location is far from the AWS region of your bucket.

Object Tagging in S3 (Organizing Your Data Efficiently)

Now that we’ve covered S3 Transfer Acceleration, let's move on to Object Tagging, which helps in categorizing, searching, and managing files in S3.


🔍 Why Use Object Tagging?

ScenarioWhy Use It?
Organizing AI datasetsTag files as training, testing, or validation.
Cost optimizationApply lifecycle policies based on tags (e.g., archive old data).
Access controlRestrict permissions using tags (e.g., limit access to sensitive files).
Efficient searchingFind files quickly by filtering based on metadata.

📌 Step 1: Upload a File with Tags

When uploading a file, we can add tags to categorize it.

python

import boto3 from botocore.exceptions import ClientError # Initialize S3 client s3_client = boto3.client('s3', region_name='us-west-2') # Define bucket and file details bucket_name = "my-generative-ai-bucket-12345" # Replace with your bucket name local_file_path = "C:\\Users\\YourName\\Documents\\example.txt" # Replace with your file path s3_object_name = "example.txt" def upload_file_with_tags(): """Upload a file to S3 with tags.""" try: s3_client.upload_file( local_file_path, bucket_name, s3_object_name, ExtraArgs={ "Tagging": "project=AI&env=production" # Key=Value format } ) print(f"File '{s3_object_name}' uploaded successfully with tags!") except ClientError as e: print(f"Error uploading file: {e}") # Run function to upload file with tags upload_file_with_tags()

📌 Step 2: Add/Update Tags for an Existing File

If a file is already in S3, we can update or add new tags.

python

def add_tags_to_existing_file(): """Add or update tags for an existing file in S3.""" try: s3_client.put_object_tagging( Bucket=bucket_name, Key=s3_object_name, Tagging={ "TagSet": [ {"Key": "project", "Value": "AI"}, {"Key": "env", "Value": "staging"} # Change from production to staging ] } ) print(f"Tags updated for '{s3_object_name}'!") except ClientError as e: print(f"Error updating tags: {e}") # Run function to update tags add_tags_to_existing_file()

📌 Step 3: Retrieve Tags for a File

You can fetch tags of an object to check how it’s categorized.

python

def get_file_tags(): """Retrieve tags of a file in S3.""" try: response = s3_client.get_object_tagging(Bucket=bucket_name, Key=s3_object_name) tags = response["TagSet"] print(f"Tags for '{s3_object_name}': {tags}") except ClientError as e: print(f"Error retrieving tags: {e}") # Run function to get tags get_file_tags()

📌 Step 4: Remove Tags from a File

If you want to remove all tags, use this:

python

def remove_file_tags(): """Remove all tags from a file in S3.""" try: s3_client.delete_object_tagging(Bucket=bucket_name, Key=s3_object_name) print(f"All tags removed from '{s3_object_name}'!") except ClientError as e: print(f"Error removing tags: {e}") # Run function to delete tags remove_file_tags()

Lesson 5: S3 Bucket Policies & Access Control (IAM, Bucket Policies, and Public Access)

Now that you've mastered Object Tagging, let's move on to S3 Bucket Policies & Access Control, which are crucial for securing and managing access to your S3 bucket.

🔍 Why Use Bucket Policies & Access Control?

  1. Restrict Unauthorized Access

    • Ensures only authorized users/services can access the bucket.
  2. Public or Private File Access

    • Controls whether a file is accessible via a public URL.
  3. Secure AI Model Storage

    • Protects AI datasets and models from unintended modifications.
  4. Enable Cross-Account Access

    • Allows trusted AWS accounts to access the bucket securely.

✅ Three Ways to Control Access in S3

  1. IAM Policies

    • Best for: Controlling access based on AWS users & roles.
    • Applied to: AWS Users, Groups, or Roles.
  2. Bucket Policies

    • Best for: Managing access to the entire bucket.
    • Applied to: S3 Buckets.
  3. Block Public Access Settings

    • Best for: Ensuring a bucket is never publicly accessible.
    • Applied to: S3 Buckets.

📌 Step 1: Restrict or Allow Public Access to an S3 Bucket

By default, AWS blocks public access for security. You must disable block public access settings if you need to allow public reads.

python
import boto3 from botocore.exceptions import ClientError # Initialize S3 client s3_client = boto3.client('s3', region_name='us-west-2') # Define bucket name bucket_name = "my-generative-ai-bucket-12345" # Replace with your bucket name def modify_public_access_settings(allow_public=True): """Enable or disable public access settings for the S3 bucket.""" try: public_access_config = { "BlockPublicAcls": not allow_public, "IgnorePublicAcls": not allow_public, "BlockPublicPolicy": not allow_public, "RestrictPublicBuckets": not allow_public } s3_client.put_public_access_block( Bucket=bucket_name, PublicAccessBlockConfiguration=public_access_config ) status = "Enabled" if allow_public else "Disabled" print(f"✅ Public access settings updated: Public access is now {status}.") except ClientError as e: print(f"❌ Error modifying public access settings: {e}") # Example usage: Set `allow_public=False` to block public access modify_public_access_settings(allow_public=True)
  • 🔹 Set allow_public=False to completely restrict public access.
  • 🔹 Set allow_public=True to allow public access (use with caution!).

📌 Step 2: Apply an S3 Bucket Policy to Control Access

A Bucket Policy allows or denies access to the entire bucket.

python
import json def apply_bucket_policy(): """Apply an S3 bucket policy to allow public read access.""" try: bucket_policy = { "Version": "2012-10-17", "Statement": [ { "Sid": "PublicReadGetObject", "Effect": "Allow", "Principal": "*", "Action": "s3:GetObject", "Resource": f"arn:aws:s3:::{bucket_name}/*" } ] } s3_client.put_bucket_policy( Bucket=bucket_name, Policy=json.dumps(bucket_policy) ) print(f"✅ Bucket policy applied to '{bucket_name}'!") except ClientError as e: print(f"❌ Error applying bucket policy: {e}") # Apply public read policy apply_bucket_policy()
  • 🔹 This makes all objects in the bucket publicly readable.
  • 🔹 Modify Principal to restrict access to a specific AWS account.

📌 Step 3: Grant Read Access to a Specific File (Recommended Fix for ACL Issue)

Instead of using ACLs, AWS now recommends using a Bucket Policy to grant access to specific objects.

python
def make_file_public(file_key): """Grant public read access to a specific file using a Bucket Policy (NOT ACLs).""" try: # Define policy for a single file bucket_policy = { "Version": "2012-10-17", "Statement": [ { "Sid": "PublicReadObject", "Effect": "Allow", "Principal": "*", "Action": "s3:GetObject", "Resource": f"arn:aws:s3:::{bucket_name}/{file_key}" } ] } # Apply policy s3_client.put_bucket_policy( Bucket=bucket_name, Policy=json.dumps(bucket_policy) ) print(f"✅ Public read access granted for '{file_key}' using a Bucket Policy!") except ClientError as e: print(f"❌ Error applying bucket policy: {e}") # Example usage: Make a single file public make_file_public("example.txt")
  • 🔹 This fixes the previous issue where put_object_acl() failed due to enforced ACL restrictions.
  • 🔹 This method ensures that only this specific file is public, not the entire bucket.

📌 Step 4: Grant Cross-Account Access (Allow Another AWS Account to Access Bucket)

If you need to allow another AWS account to access the bucket, modify the bucket policy.

python
def allow_cross_account_access(account_id): """Allow another AWS account to access the bucket.""" policy = { "Version": "2012-10-17", "Statement": [ { "Sid": "AllowCrossAccount", "Effect": "Allow", "Principal": {"AWS": f"arn:aws:iam::{account_id}:root"}, "Action": "s3:*", "Resource": [f"arn:aws:s3:::{bucket_name}", f"arn:aws:s3:::{bucket_name}/*"] } ] } try: s3_client.put_bucket_policy(Bucket=bucket_name, Policy=json.dumps(policy)) print(f"✅ Cross-account access granted to AWS Account: {account_id}") except ClientError as e: print(f"❌ Error applying cross-account access policy: {e}") # Example: Allow account ID "123456789012" allow_cross_account_access("123456789012")
  • 🔹 Replace account_id with the AWS Account ID you want to allow access.

🚀 Which Method Should You Use?

  1. Restrict access per AWS user or role

    • Use IAM Policies.
  2. Manage access for the entire bucket

    • Use Bucket Policies.
  3. Allow or block all public access

    • Use Block Public Access Settings.
  4. Allow cross-account access

    • Use Bucket Policies with Principal.
  5. Grant public read access to a single file

    • Use Bucket Policies (NOT ACLs).

✅ Summary of S3 Public Access Settings

  1. BlockPublicAcls

    • Effect: Prevents public ACLs.
    • Best Use Case: Use when relying on Bucket Policies instead of ACLs.
  2. IgnorePublicAcls

    • Effect: Ignores existing ACLs.
    • Best Use Case: Use to completely disable ACL-based access.
  3. BlockPublicPolicy

    • Effect: Prevents public bucket policies.
    • Best Use Case: Use to restrict public access even via policies.
  4. RestrictPublicBuckets

    • Effect: Fully blocks public access.
    • Best Use Case: Strongest security measure to prevent accidental exposure.

Comments