Building an AI Agent That Automates Terraform Infrastructure Provisioning

by Pramodh Kumar M
Pramodh Kumar M
•
Published: June 30, 2026
•
20 min read

AI generated Terraform that skips the plan step has already caused real incidents, from accidental resource replacements to security group deletions, even as AI now produces configuration faster than teams can review it. The safe way to let an agent provision infrastructure is to make the plan and the approval permanent parts of the loop, not optional steps. This tutorial builds that agent in Go: it writes Terraform, validates it, shows you the plan, and applies only after you approve. Everything runs in a browser playground that gives you a sandbox AWS account, so you create real AWS resources without your own account or a bill.

Highlights

The agent enforces plan before apply.
Every apply and destroy waits for your approval.
You provision against a sandbox AWS account at no cost.
The agent writes the HCL itself.
One key reaches the model, with no separate provider account.
Misconfiguration is the real risk, and the plan is where you catch it.

Why build an agent for Terraform provisioning

Terraform became the default syntax for infrastructure as code across AWS, Azure, GCP, and Kubernetes, and the practice is still growing fast. Market research cited in industry roundups put the infrastructure as code market near 908 million dollars in 2023 and on track for roughly 3.3 billion by 2030. That is a large and rising amount of production infrastructure whose entire shape lives in HCL (HashiCorp Configuration Language).

AI changes the economics of writing that HCL, and not entirely for the better. Models now generate Terraform faster than most teams can review it, and the failure mode is specific: configuration that gets applied without a plan. The same trend report notes that AI generated Terraform that skips the plan has produced real incidents, including resource replacements, security group deletions, and IAM policy overwrites. The plan output is precisely what would have caught those changes.

Misconfiguration, not exploitation, is where most cloud damage starts. Security research attributes more than 60 percent of cloud breaches to misconfigurations, and scanners find plenty to flag: Checkov catches an average of 14 high severity issues per 1,000 lines of IaC in teams that were not scanning before. An agent that generates infrastructure has to treat the plan and a review as mandatory, or it just produces those misconfigurations faster.

Go is the natural language for the agent. Terraform and its providers are written in Go, the agent compiles to a single binary you can drop onto any node, and shelling out to the terraform CLI is clean and predictable. If you followed the Kubernetes version of this agent, the skeleton here is identical: the same model client, the same reasoning loop, the same approval gate. Only the tools change, from kubectl to terraform.

What you are building

The agent is a command line program with a control loop. You give it a request in plain language, such as provisioning an S3 bucket for application logs. The program sends that request to a language model along with a description of the tools it may use. The model writes the Terraform configuration, then the agent validates it, runs a plan so the change is visible, asks you to approve, and applies. You can also ask it what it currently manages.

The provider is configured for you. A small provider.tf sets up the AWS provider before the agent runs, and the sandbox supplies the credentials, so the model writes only resource blocks and never a provider block. That keeps the generated configuration focused and stops the model from inventing credentials.

Seven tools define the agent's reach. Five are safe and run without interruption: writing configuration to a file, validating it, running a plan, listing managed state, and showing the details of one resource. Two change real infrastructure, apply and destroy, and both pause for explicit approval. Writing a file is safe because nothing reaches the cloud until apply, which is why only apply and destroy sit behind the gate.

Here is the flow in one pass. A request enters through the command line. The loop calls the model, which returns a tool action as JSON. The executor runs it, appends the result to the conversation, and loops again. The model usually writes config, validates, plans, and then requests apply, at which point the agent stops and waits for your yes. Each resource lands in its own file, so successive requests add to what exists rather than overwriting it. A turn limit caps the loop so a confused agent cannot spin forever.

Choosing the right environment

You need two things: a Terraform workspace that can actually provision resources, and model access for the agent to reason with. A browser based Terraform and AWS playground covers the first, because it ships Terraform pre installed and is wired to a sandbox AWS account, so you create real AWS resources with no account of your own and no cost. A single KodeKey covers the second, so a reader with a KodeKloud account needs no separate provider account.

Generate a KodeKey API key from your KodeKloud account before you start. It is an account credential rather than a session token, and the place you create it is a settings page rather than a playground, so generating the key does not consume your one active playground slot. The same key reaches several models through one OpenAI compatible endpoint. KodeKey is built for learning, with a modest monthly request allowance, which suits a tutorial and a handful of runs rather than heavy iteration.

Install Go and confirm Terraform is present:

# Install a current Go toolchain on the playground node
curl -sSL https://go.dev/dl/go1.23.4.linux-amd64.tar.gz -o go.tar.gz
sudo tar -C /usr/local -xzf go.tar.gz
export PATH=$PATH:/usr/local/go/bin
go version

# Confirm the terraform CLI is available
terraform version

Export your KodeKey so it never touches the source:

export KODEKEY_API_KEY="your-kodekey-here"

One requirement is easy to miss. The agent makes an outbound HTTPS call to the model endpoint on every turn, and terraform init downloads the AWS provider once, so the environment needs egress. Most playgrounds allow outbound traffic. If yours blocks it, run the agent from a machine with internet access.

Step 1: Create the project and the Terraform workspace

Make a directory, initialize a Go module, and write the provider configuration. The agent, its source, and the Terraform files all live in this one directory.

mkdir tfagent && cd tfagent
go mod init tfagent

Create provider.tf. The sandbox account already supplies credentials to the environment, so the file only needs the provider version and a region. If the playground uses a different region, set it here.

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = "us-east-1"
}

Initialize Terraform once to download the provider:

terraform init

Now start main.go with the package declaration, imports, and constants. KodeKey is OpenAI compatible, so the endpoint is a chat completions URL and the model id is a plain name. Pick any id that KodeKey lists; a more capable model follows the multi step write, plan, apply flow more reliably than a small one.

package main

import (
	"bufio"
	"bytes"
	"encoding/json"
	"fmt"
	"io"
	"net/http"
	"os"
	"os/exec"
	"strings"
	"time"
)

const (
	baseURL   = "https://api.ai.kodekloud.com/v1/chat/completions"
	model     = "claude-haiku-4-5"
	maxTokens = 1500
)

Step 2: Model the chat API in Go

The model speaks a simple chat shape: a list of messages, each with a role and text. One extra type, Action, models the JSON contract where every reply is either a tool call or a final answer. Using json.RawMessage for the tool arguments lets each tool decode only the fields it needs.

type ChatMessage struct {
	Role    string `json:"role"`
	Content string `json:"content"`
}

type ChatRequest struct {
	Model     string        `json:"model"`
	Messages  []ChatMessage `json:"messages"`
	MaxTokens int           `json:"max_tokens,omitempty"`
}

type ChatResponse struct {
	Choices []struct {
		Message ChatMessage `json:"message"`
	} `json:"choices"`
	Error *struct {
		Message string `json:"message"`
	} `json:"error,omitempty"`
}

type Action struct {
	Action  string          `json:"action"`
	Tool    string          `json:"tool,omitempty"`
	Args    json.RawMessage `json:"args,omitempty"`
	Message string          `json:"message,omitempty"`
}

Step 3: Call the model through KodeKey

This function marshals the conversation, sets the Authorization header with your KodeKey as a bearer token, posts to the endpoint, and returns the assistant text. Surfacing the raw body on a decode failure saves real debugging time, and a timeout keeps a stalled request from hanging the agent.

func callModel(apiKey string, messages []ChatMessage) (string, error) {
	payload, err := json.Marshal(ChatRequest{Model: model, Messages: messages, MaxTokens: maxTokens})
	if err != nil {
		return "", fmt.Errorf("marshal request: %w", err)
	}
	req, err := http.NewRequest(http.MethodPost, baseURL, bytes.NewReader(payload))
	if err != nil {
		return "", fmt.Errorf("build request: %w", err)
	}
	req.Header.Set("Authorization", "Bearer "+apiKey)
	req.Header.Set("Content-Type", "application/json")

	resp, err := (&http.Client{Timeout: 90 * time.Second}).Do(req)
	if err != nil {
		return "", fmt.Errorf("send request: %w", err)
	}
	defer resp.Body.Close()

	raw, err := io.ReadAll(resp.Body)
	if err != nil {
		return "", fmt.Errorf("read response: %w", err)
	}

	var cr ChatResponse
	if err := json.Unmarshal(raw, &cr); err != nil {
		return "", fmt.Errorf("decode response (HTTP %d): %w (raw: %s)", resp.StatusCode, err, string(raw))
	}
	if cr.Error != nil {
		return "", fmt.Errorf("api error (HTTP %d): %s", resp.StatusCode, cr.Error.Message)
	}
	if resp.StatusCode != http.StatusOK {
		return "", fmt.Errorf("api returned HTTP %d: %s", resp.StatusCode, strings.TrimSpace(string(raw)))
	}
	if len(cr.Choices) == 0 {
		return "", fmt.Errorf("no choices returned (HTTP %d): %s", resp.StatusCode, string(raw))
	}
	return cr.Choices[0].Message.Content, nil
}

Step 4: Describe the tools and the reply contract

The tool catalogue lives in the system prompt, and the model replies in a fixed JSON shape. The prompt spells out each tool, marks which ones mutate the cluster, and states the order of operations: write, validate, plan, then apply. It also forbids two failure modes that matter for infrastructure, claiming a change happened without a confirming result, and skipping the plan. A small parser pulls the JSON object out of the reply even when the model wraps it in prose or a code fence.

const systemPrompt = `You are an infrastructure provisioning agent that manages AWS resources with Terraform.
The AWS provider is already configured in provider.tf to use a sandbox AWS account, so you
write only resource configuration and never a provider block.

Tools you can call:
- write_config(filename, content): write Terraform HCL to a .tf file named after the resource, such as s3.tf or ecr.tf.
- terraform_validate(): check that the configuration is valid. Read only.
- terraform_plan(): preview what Terraform will create, change, or destroy. Read only.
- show_state(): list the resources Terraform currently manages. Read only.
- show_resource(address): show the full attributes of one managed resource, such as aws_s3_bucket.logs. Read only.
- terraform_apply(summary): apply the planned changes. This creates or changes real resources and needs approval.
- terraform_destroy(summary): destroy ALL managed resources, not a single one. This needs approval.

On every turn reply with exactly ONE JSON object and nothing else.
To call a tool: {"action":"tool","tool":"<name>","args":{ ... }}
To finish:      {"action":"final","message":"<your concise answer>"}

Rules you must follow:
- To create or change a resource, your FIRST action is always write_config for that resource. Never validate, plan, or apply for a new resource before its configuration is written.
- Put each resource in its own file named after it, such as s3.tf, ecr.tf, or ebs.tf. Never write a different resource into a file that already holds one you created, because overwriting it makes the next apply destroy that resource. Reuse a file only to correct the resource it already contains.
- After writing, validate, then run a plan so the diff is visible before any apply.
- To change anything you must emit an apply or destroy action. Describing a change is not performing it.
- Only state that infrastructure changed after a TOOL RESULT confirms it. Never invent a result.
- If a plan shows no changes for something you were asked to create, you have not written its configuration yet. Write it with write_config, then plan and apply.
- To remove a single resource, rewrite its file without that resource, leaving a comment if the file would otherwise be empty, then plan and apply. That destroys only that resource. Use terraform_destroy only to tear down everything.
- Write only resource blocks plus any data source you need, since the provider is already configured.
- For globally unique names like S3 buckets, append the AWS account id from a data aws_caller_identity source. This is stable. Never use timestamp() or the random provider for uniqueness.`

func parseAction(reply string) (Action, error) {
	s := strings.TrimSpace(reply)
	if i := strings.Index(s, "{"); i >= 0 {
		s = s[i:]
	}
	var a Action
	err := json.NewDecoder(strings.NewReader(s)).Decode(&a)
	return a, err
}

Step 5: Execute Terraform actions

This is where intent becomes infrastructure. A helper shells out to the terraform CLI and captures output, combining both streams on failure so the model can read the error. The executor dispatches each tool call: writing config is a guarded file write that stays inside the working directory, accepts only .tf files, and refuses to overwrite the provider configuration or write an empty file, the read only commands run directly, and apply and destroy each pause for approval through a shared confirm helper before they touch anything.

func runTerraform(args ...string) (string, error) {
	cmd := exec.Command("terraform", args...)
	var out, errBuf bytes.Buffer
	cmd.Stdout = &out
	cmd.Stderr = &errBuf
	if err := cmd.Run(); err != nil {
		combined := strings.TrimSpace(out.String() + "\n" + errBuf.String())
		return combined, fmt.Errorf("terraform failed: %v", err)
	}
	return strings.TrimSpace(out.String()), nil
}

func executeTool(a Action, reader *bufio.Reader) string {
	run := func(args ...string) string {
		out, err := runTerraform(args...)
		if err != nil {
			if out == "" {
				return "ERROR: " + err.Error()
			}
			return "ERROR: " + out
		}
		if out == "" {
			return "(no output)"
		}
		return out
	}

	confirm := func(verb, summary string) bool {
		fmt.Printf("\n[approval needed] terraform %s. %s\nProceed? (y/n): ", verb, summary)
		answer, _ := reader.ReadString('\n')
		return strings.TrimSpace(strings.ToLower(answer)) == "y"
	}

	switch a.Tool {

	case "write_config":
		var in struct {
			Filename string `json:"filename"`
			Content  string `json:"content"`
		}
		json.Unmarshal(a.Args, &in)
		if in.Filename == "" {
			in.Filename = "main.tf"
		}
		if strings.ContainsAny(in.Filename, `/\`) || !strings.HasSuffix(in.Filename, ".tf") {
			return "ERROR: filename must be a .tf file in the working directory"
		}
		if in.Filename == "provider.tf" {
			return "ERROR: provider.tf is reserved; name the file after the resource instead"
		}
		if strings.TrimSpace(in.Content) == "" {
			return "ERROR: content is required"
		}
		note := ""
		if _, err := os.Stat(in.Filename); err == nil {
			note = " (overwrote an existing file; if it held a different resource, apply will destroy it)"
		}
		if err := os.WriteFile(in.Filename, []byte(in.Content), 0o644); err != nil {
			return "ERROR: " + err.Error()
		}
		return "wrote " + in.Filename + note

	case "terraform_validate":
		return run("validate", "-no-color")

	case "terraform_plan":
		return run("plan", "-input=false", "-no-color")

	case "show_state":
		return run("state", "list")

	case "show_resource":
		var in struct {
			Address string `json:"address"`
		}
		json.Unmarshal(a.Args, &in)
		if in.Address == "" {
			return "ERROR: address is required, for example aws_s3_bucket.logs"
		}
		return run("state", "show", in.Address)

	case "terraform_apply":
		var in struct {
			Summary string `json:"summary"`
		}
		json.Unmarshal(a.Args, &in)
		if !confirm("apply", in.Summary) {
			return "Operator denied the apply. No infrastructure was changed."
		}
		return run("apply", "-input=false", "-auto-approve", "-no-color")

	case "terraform_destroy":
		var in struct {
			Summary string `json:"summary"`
		}
		json.Unmarshal(a.Args, &in)
		if !confirm("destroy", in.Summary) {
			return "Operator denied the destroy. No infrastructure was changed."
		}
		return run("destroy", "-input=false", "-auto-approve", "-no-color")

	default:
		return "unknown tool: " + a.Tool
	}
}

Step 6: Build the agent loop

The loop ties everything together. It seeds the conversation with the system prompt and the request, then on each turn calls the model, records the reply, and parses it. A tool action runs the tool and appends the result as the next user message. A final action prints the answer and stops, unless the model tries to finish before it has called any tool, in which case the loop pushes it to do the work first. That last check is the guardrail against an agent that narrates a result it never produced, the single most important safeguard when a smaller model is tempted to summarize an outcome instead of executing it. The loop adds a check for this agent: for any create or change request it refuses to validate, plan, apply, or finish until write_config has been called, which stops the model from inspecting stale files and then quitting without provisioning anything. A small classifier decides whether a request needs to write configuration, so plain questions and full teardowns skip that requirement. When a reply is not valid JSON, the loop nudges the model back to the contract and retries. The turn limit is a touch higher than the Kubernetes agent, since provisioning takes more steps.

func hasPrefixAny(s string, prefixes ...string) bool {
	for _, p := range prefixes {
		if strings.HasPrefix(s, p) {
			return true
		}
	}
	return false
}

func containsAny(s string, subs ...string) bool {
	for _, sub := range subs {
		if strings.Contains(s, sub) {
			return true
		}
	}
	return false
}

func requestNeedsWrite(goal string) bool {
	lower := strings.ToLower(strings.TrimSpace(goal))
	isQuery := hasPrefixAny(lower, "what", "which", "show", "list", "describe", "how many", "tell me", "do you", "get ", "is ", "are ", "does ")
	isTeardown := containsAny(lower, "everything", "tear down", "teardown", "destroy all", "delete all", "remove all", "all resources", "all infrastructure")
	return !isQuery && !isTeardown
}

func main() {
	apiKey := os.Getenv("KODEKEY_API_KEY")
	if apiKey == "" {
		fmt.Println("set KODEKEY_API_KEY before running the agent")
		os.Exit(1)
	}
	if len(os.Args) < 2 {
		fmt.Println("usage: agent \"your request in plain language\"")
		os.Exit(1)
	}

	goal := strings.Join(os.Args[1:], " ")
	needsWrite := requestNeedsWrite(goal)

	reader := bufio.NewReader(os.Stdin)
	messages := []ChatMessage{
		{Role: "system", Content: systemPrompt},
		{Role: "user", Content: goal},
	}

	toolUsed := false
	wroteConfig := false
	nudgedNoTool := false
	nudgedNoWrite := false
	nudgedIncomplete := false

	for turn := 0; turn < 16; turn++ {
		reply, err := callModel(apiKey, messages)
		if err != nil {
			fmt.Println("error:", err)
			os.Exit(1)
		}
		messages = append(messages, ChatMessage{Role: "assistant", Content: reply})

		action, err := parseAction(reply)
		if err != nil {
			messages = append(messages, ChatMessage{
				Role:    "user",
				Content: "That was not valid JSON. Reply with exactly one JSON object as instructed.",
			})
			continue
		}

		if action.Action == "final" {
			// A final answer with no tool call at all is a fabricated result.
			if !toolUsed && !nudgedNoTool {
				nudgedNoTool = true
				messages = append(messages, ChatMessage{
					Role: "user",
					Content: "You have not called any tools, so nothing has been written, planned, or applied. " +
						"Do not report a result you did not produce. If the request needs infrastructure work, " +
						"start now with write_config, then validate, plan, and apply.",
				})
				continue
			}
			// A create request that never wrote configuration provisioned nothing.
			if needsWrite && !wroteConfig && !nudgedIncomplete {
				nudgedIncomplete = true
				messages = append(messages, ChatMessage{
					Role: "user",
					Content: "You are finishing a create or change request without having written any configuration, " +
						"so nothing was provisioned. Call write_config with the resource HCL, then validate, plan, and apply.",
				})
				continue
			}
			fmt.Println("\n" + action.Message)
			return
		}

		// For a create or change request, write configuration before inspecting or applying.
		if needsWrite && !wroteConfig && !nudgedNoWrite &&
			(action.Tool == "terraform_validate" || action.Tool == "terraform_plan" || action.Tool == "terraform_apply") {
			nudgedNoWrite = true
			messages = append(messages, ChatMessage{
				Role: "user",
				Content: "For a create or change request you must write the configuration first. " +
					"Call write_config with the HCL for the requested resource, then validate, plan, and apply.",
			})
			continue
		}

		fmt.Printf("-> calling %s %s\n", action.Tool, string(action.Args))
		result := executeTool(action, reader)
		toolUsed = true
		if action.Tool == "write_config" && !strings.HasPrefix(result, "ERROR:") {
			wroteConfig = true
		}
		messages = append(messages, ChatMessage{Role: "user", Content: "TOOL RESULT:\n" + result})
	}

	fmt.Println("\nreached the turn limit without a final answer")
}

Step 7: Build and run the agent

Compile the binary in the same directory as provider.tf.

go build -o agent .

Ask it to provision something. The agent writes the HCL, validates it, plans it, and then stops for your approval before anything is created:

./agent "provision an S3 bucket for storing application logs"

A typical run prints each step, shows the plan, and waits:

-> calling write_config {"filename":"s3.tf","content":"resource \"aws_s3_bucket\" \"logs\" {\n  bucket = \"app-logs-${data.aws_caller_identity.current.account_id}\"\n}\n\ndata \"aws_caller_identity\" \"current\" {}\n"}
-> calling terraform_validate {}
-> calling terraform_plan {}
-> calling terraform_apply {"summary":"create one S3 bucket for application logs"}

[approval needed] terraform apply. create one S3 bucket for application logs
Proceed? (y/n): y

Created the application logs bucket as app-logs-058264544314. terraform apply added 1
resource, and aws_s3_bucket.logs is now tracked in state.

Type n at the prompt and the agent reports that the apply was declined and stops, leaving your infrastructure untouched. Ask what it manages and it reads state directly:

./agent "what resources are you managing right now?"

-> calling show_state {}

You currently manage one resource, aws_s3_bucket.logs, the application logs bucket.

To tear everything down at the end, ask the agent to destroy all resources. It calls terraform_destroy, which removes every managed resource after the same approval. To drop a single resource instead, ask the agent to remove it, and the agent rewrites that resource's file so the next apply destroys only that one. You now have an agent that generates infrastructure, always shows you the plan, and changes nothing without your consent.

More requests to try

The agent provisions whatever the model can express in HCL and the sandbox permits, so the same write, plan, approve loop reaches well beyond a single bucket. Try these requests in order, simplest first, and keep in mind that which services succeed depends on the set of AWS services the playground sandbox allows. Each resource lands in its own file, so successive requests accumulate rather than replace each other, and show_state lists everything the agent manages.

Single resource requests are the most reliable, since the model writes one block and plans it:

./agent "create an ECR repository named payment-service for container images"
./agent "create a 10 GB encrypted EBS volume in availability zone us-east-1a for database storage"

Once something exists, ask the agent about it. The show_resource tool reads the live attributes straight from state, so the agent can answer questions about a specific resource:

./agent "what is the repository URL of my ECR repository?"

-> calling show_state {}
-> calling show_resource {"address":"aws_ecr_repository.payment_service"}

Your ECR repository payment-service is ready. Its URL is
<account-id>.dkr.ecr.us-east-1.amazonaws.com/payment-service.

More involved resources work the same way, but they ask more of both the model and the sandbox. An EC2 instance needs an image, so a good request leads the agent to add a data aws_ami lookup for the latest Amazon Linux rather than a fixed id, and the sandbox has to permit EC2, which bills even at the smallest size:

./agent "launch a t2.micro EC2 instance running the latest Amazon Linux"

A DynamoDB table is a good step up that still stays within one provider. The model writes a single block with the key schema and billing mode, and it applies with only the AWS provider already in place:

./agent "create a DynamoDB table named sessions with a string partition key called id"

The agent stops where a request needs more than a declarative AWS resource. A Lambda function is the clear boundary, since it needs a zipped deployment package and the archive provider, which would mean adding a provider to provider.tf and running terraform init again. This agent assumes its providers are already installed and does not run init, so a request to create a Lambda makes the model write an archive_file data source whose provider is missing, and validate fails before anything is provisioned.

Across every one of these, the plan stays your safeguard. Read it before you approve, since a capable model produces clean configuration on the first try and a smaller one may need a correction before the plan looks right.

Hardening this for production

The playground version is honest about what it is, and a few changes turn it into something a team could run.

Move off the sandbox and KodeKey. For real work, point provider.tf at your own cloud account with least privilege credentials, and point the model at a production endpoint. The agent speaks the OpenAI compatible protocol, so that switch is a base URL, a model id, and a key, with no change to the loop or the tools.

Use remote, locked state. Local state is fine for a single learner, but a shared workspace needs a remote backend, such as an S3 bucket with a DynamoDB lock, so two runs cannot corrupt each other and the state is recoverable. Drift makes this matter even more, since around 67 percent of teams report significant drift between their code and their cloud.

Run apply in a pipeline, not from a shell. Applying from a developer machine is an anti pattern at scale. The stronger pattern is to have the agent open a change with its plan attached, run policy checks, and apply through CI after review, which keeps a record and a second set of eyes on every change.

Add a policy layer in front of apply. The human approval gate is a strong first control, but scanners and policy engines hold even when no human is watching. Run Checkov or tfsec on the configuration and Open Policy Agent against the plan, so a public bucket, an open security group, or an over permissive IAM role fails before it is created.

Keep modules and blast radius small. Large modules make plans slow and changes risky, so split configuration into focused units and scope each agent run to one. Smaller surfaces mean faster plans and far less that can go wrong in a single apply.

Record everything. Every tool call, its arguments, the plan, the approval decision, and the apply result should land in an audit log, because the question after any incident is what changed and when.

Build your own or reach for a framework

You do not have to hand write an agent loop, so it helps to know when this approach earns its keep.

Approach	Best when	Trade off
Hand written Go loop	You want full control, a single binary, and tight guardrails on what the agent may apply	You own the loop, retries, and tool plumbing yourself
Agent framework	You want memory, connectors, and orchestration provided and your stack favors that ecosystem	More dependencies and less visibility into how decisions are made
IaC automation platform	You run many provisioning workflows across teams and need state, policy, and drift management at scale	Heavier operational footprint and a longer path to a first result

For provisioning specifically, the hand written Go path is attractive because you can see and constrain every action, the plan is always in the loop, and the binary deploys anywhere. As your needs grow, you can adopt a platform for policy and orchestration without throwing away the guardrails you built here.

Conclusion and next steps

A provisioning agent is not magic. It is a loop that calls a model, a tool catalogue the model can choose from, an executor that runs the terraform CLI, and guardrails that make the plan and the approval mandatory. You built all four in Go, provisioned a real resource in a sandbox AWS account, and kept every apply behind your explicit yes.

From here, extend the agent along the same pattern. Add a tool that runs Checkov or tfsec and feed the findings back so the model can fix its own configuration before applying. Move state to a remote backend and bind least privilege credentials. Add structured audit logging, then teach the agent to manage modules and multiple environments. Each addition is another tool and another guardrail, never a rewrite, which is the quiet advantage of building the loop yourself.

FAQS

Q1: What do I need to know before building an AI agent that provisions Terraform?

You need three foundations. The first is basic Go: structs, functions, error handling, and the standard library packages for HTTP and JSON, all of which this tutorial uses directly. The second is working knowledge of Terraform and HCL, since the agent generates configuration and you need to read a plan to trust what it will do. The third is access to a model, and with a KodeKloud account you can generate a single KodeKey that reaches several models, so you do not need a separate provider account to follow along. You do not need machine learning experience, because the model is a service you call rather than something you train. If you want to firm up the Terraform side first, the Infrastructure as Code learning path sequences HCL, providers, state, and modules from the basics upward.

Q2: Where can I run this agent without a cloud account or API credits?

A browser based Terraform and AWS playground is the simplest option, because it ships Terraform pre installed and is wired to a sandbox AWS account, so you create real AWS resources with no account of your own and no cost. The Terraform and AWS playground is well suited to this tutorial, and if you would rather use a pure local emulator the Terraform and LocalStack playground works too with a LocalStack provider configuration. For the model, generate a key from your KodeKey keyspace, which is an account page rather than a playground, so it does not use up your one active playground slot. Install Go with the tarball commands shown above, write a small provider.tf, run terraform init, export the key as KODEKEY_API_KEY, and build the binary. KodeKey is meant for learning, so its monthly request allowance is modest, and each agent run spends one request per step, which is plenty for a few provisioning runs but not for heavy iteration. Confirm the playground allows outbound HTTPS, since the agent calls the model on every turn and terraform init downloads the provider.

Q3: Should I build my own agent in Go or use an existing agent framework?

It depends on what you value most. Build your own in Go when you want a single self contained binary and tight control over exactly what the agent may apply and how each apply gets approved. That control matters a great deal for provisioning, where an unreviewed change can replace or delete real resources. Reach for a framework when you want memory, connectors, and orchestration provided for you and your team already works in that ecosystem, accepting more dependencies and less visibility into how decisions are made in return. For most provisioning automation, the hand written Go approach starts simpler and keeps the plan and the approval in plain sight, and you can graduate to a platform later without discarding the guardrails you wrote.

Q4: How do I stop the agent from creating something dangerous or expensive?

Use layered controls rather than trusting the model alone. The first layer is the order of operations, since the agent always writes, validates, and plans before it can apply, which means you see the diff first. The second is the human approval gate, which pauses every apply and destroy for an explicit yes. The third is policy as code: run Checkov or tfsec on the configuration and Open Policy Agent against the plan, so an insecure or costly resource fails automatically regardless of what the agent generated. The fourth is least privilege credentials, so the agent's cloud identity can only create the narrow set of resources it is meant to manage. Together these turn a fast generator into a reviewed, bounded provisioning workflow.

Q5: Do I have to use one specific model provider for the agent?

No. The agent talks to an OpenAI compatible endpoint, and a single KodeKey reaches several models, so you can change the model id without touching the rest of the code. Moving to production is the same kind of change: point the base URL at a direct provider or your own gateway, set the production key, and pick a model, while the loop, the tools, and the approval gate stay identical. Treat the model as a swappable component behind one function rather than something baked into the agent. You can also run a local model by pointing the base URL at a compatible server, which keeps the request off the public internet entirely.

Q6: Does this work with OpenTofu and against my own AWS account, not just the sandbox?

Yes on both counts. OpenTofu is a drop in fork of Terraform that reads the same HCL and uses a compatible state format, so you can switch the agent by changing the binary it calls from terraform to tofu, with no other change to the tools. The sandbox already provisions real AWS resources, and to point at your own account instead you supply real credentials and a least privilege role rather than relying on the sandbox, while the same write, plan, approve, apply loop applies your changes. Practicing the switch is easy, since KodeKloud offers an OpenTofu playground alongside the Terraform and AWS playground, and the Infrastructure as Code learning path covers the provider and state concepts you will lean on as you scale up.