AI 4 Coding

Day 1: Common Workflows and Data Safeguarding Practices

Distributional Impact of Policies. Fiscal Policy and Growth Department

2026-04-20

Why structure your AI workflow?

AI coding assistants are probabilistic — output quality depends on context quality.

A structured workflow narrows the AI’s degrees of freedom step by step, so it has enough context to generate correct, reproducible code.

Without structure:

  • Vague prompts → generic, unusable output
  • Silent errors in data transformations
  • Hard to reproduce or audit

With structure:

  • Each step builds on the previous
  • Misunderstandings caught early
  • Reproducible, documented results

Eight-step workflow

  1. Set objective & constraints — state task, target spec, and language up front
  2. Define inputs & outputs — unit of observation, file format, ID conventions, parameterized paths
  3. Create a data dictionary — metadata from raw files; AI never sees actual microdata
  4. Gather external metadata — codebooks, catalogs, survey documentation
  5. Verify understanding — mapping table: target variable → source → transformation → open questions
  6. Develop code iteratively — master script first, then one module at a time; test and iterate
  7. Independent verification — fresh AI session audits code and summary statistics
  8. Document — objective, decisions, limitations, reproducibility checklist

Note

Steps 3–4 involve raw data. Review Safeguarding Data before proceeding.

References and verification are non-negotiable

Always give AI direct links to sources

  • AI can fetch live documentation (codebooks, catalogs, DIME Wiki pages) if you provide URLs
  • Without references, AI invents plausible-sounding but wrong variable definitions
  • Paste the relevant section if web access is unavailable

Always verify independently

  • Step 7 must run in a separate AI session — the original agent is biased toward confirming its own work
  • Provide concrete summary statistics to the auditor, not just code
  • Resolve every issue raised before finalizing

Tip

The mapping table from Step 5 is the single most important artifact. Review every row before any code is written.

Data safeguarding practices while using AI

If sensitive or confidential data leaks to AI models, it can lead to data breaches and privacy violations bearing substantial legal and reputational risks.

It is our responsibility to:

  1. Control what data is shared with the AI.
  2. Review and curate the context AI uses.
  3. Properly anonymize, aggregate, or simulate data before sharing with AI.
  4. Follow organizational policies and legal requirements for responsible AI use.

Familiarize yourself with the World Bank’s AI Use Policy and Responsible AI Guidelines.

In Positron Assistant, each model provider has different policies on data usage, retention, and training. See: Positron Assistant Provider Privacy | Copilot Responsible Use | Copilot Chat Responsible Use

Key data exposure channels

Normally, AI does not load raw data into context because it operates with textual data. However, it can propose actions that may lead to data exposure:

  1. 💬 User-written prompts that include sensitive information

  2. 🖨️ AI suggests code that prints raw data in the console

  3. 📂 AI instructs the IDE to read text data files (.csv) into context

  4. 🔍 AI proposes shell/command line commands that load data files into context (e.g. .csv, .dta)

In any of these cases, sensitive data becomes part of the context and there is always a risk of data leakage.

Every chat session is an isolated instance. Data may not be used for training, but may be retained temporarily by the provider.

Remedy 1: Be careful about what you write in prompts

Warning

  • Never share passwords, personally identifiable information (PII), or any other sensitive data in prompts.
  • Never upload data to AI.

Write instead:

Example prompt
"Using `/file.md` as the data dictionary for my household survey, identify
which variables may contain sensitive information.

Then write Stata code to load `data/raw/survey_2024.dta` that:
- Drops variables `a`, `b`, and `c` immediately after loading
- Never prints or displays these variables at any point"

Remedy 2: Instruct AI to NOT read specific files

Use system instructions to prevent AI from accessing sensitive data files:

Step 1 — Create instruction files:

  • .github/copilot-instructions.md (for Copilot)
  • claude.md (for Claude Code)

Step 2 — Add ignore rules:

.github/copilot-instructions.md
NEVER read, index, or propose shell commands
that would reveal the contents of:

- `data/raw/`
- `data/raw/survey_2024.dta`

If reading is absolutely necessary:
1. Ask for permission first.
2. Explain why you need it.
3. Wait for confirmation.

Step 3 — Add .gitignore:

.gitignore
# Raw and confidential data
data/raw/
data/raw/survey_2024.dta

Caveat

This is not absolute. AI can forget instructions, especially in long sessions. Repeat in prompts when working with sensitive projects.

Remedy 3: Instruct AI to NOT print raw data

Add to your system instructions (.github/copilot-instructions.md or claude.md):

System instruction
You are not allowed to print raw data in the console. You must produce code
that does not print raw data. It is allowed however to present aggregated data,
summary statistics, regression results and any other data that makes it
impossible to identify individual observations.
If in doubt that some code may print raw data, ask for permission before
proposing this code.

Code example that may show data and is not allowed:

```stata
use data/raw/survey_2024.dta, clear
list var1 var2 var3
```

Tip

Add this into the same system instruction files from Remedy 2. It will apply to every chat session in your project automatically.

Remedy 4: Use metadata, move data outside the project

The idea: generate lightweight metadata files describing each dataset. Keep raw data outside the working folder. Share only metadata with AI.

Process:

  1. Write code (Stata/R/Python) that produces one .md file per data file containing data description and codebook information.

  2. Execute it locally saving metadata into the project; raw data stays outside.

  3. Tell AI: “Use only the metadata files to understand the data structure.”

Resulting folder structure:

C:\WBG\ai\
├── my-project\          ← AI works here
|   ├── .gitignore       ← Excl. version
                         System instructions:
|   ├── .github/copilot-instructions.md
|   ├── claude.md
│   ├── data/metadata/   ← metadata `.md`
│   └── code/
└── confidential-data\   ← AI cannot see
    └── survey_2024.dta

Caveat: If you allow AI to run code against external data, it may print results in the console — which re-enters context. Control execution carefully.

Remedy 5: Data anonymization

When confidential microdata must be analyzed but cannot be shared in its original form, anonymization transforms the dataset so that re-identification is no longer reasonably possible.

  • Suppression — remove direct identifiers (names, SSNs)
  • Generalization — reduce granularity (age → age group)
  • Top/bottom coding — cap extreme values (income outliers)
  • Noise addition — add random noise to continuous quasi-identifiers
  • Shuffling — randomly reorder values to break correlations
  • Pseudonymization — replace with codes/hashes (names, IDs, emails)
  • Statistical privacy models: k-anonymity (indistinguishable groups), l-diversity (diverse sensitive values), t-closeness (distribution similarity).

Tools: R: sdcMicro, deident | Python: pycanon, Faker | Stata: manual or via R/Python.