Day 1: Common Workflows and Data Safeguarding Practices
Distributional Impact of Policies. Fiscal Policy and Growth Department
2026-04-20
AI coding assistants are probabilistic — output quality depends on context quality.
A structured workflow narrows the AI’s degrees of freedom step by step, so it has enough context to generate correct, reproducible code.
Without structure:
With structure:
Note
Steps 3–4 involve raw data. Review Safeguarding Data before proceeding.
Always give AI direct links to sources
Always verify independently
Tip
The mapping table from Step 5 is the single most important artifact. Review every row before any code is written.
If sensitive or confidential data leaks to AI models, it can lead to data breaches and privacy violations bearing substantial legal and reputational risks.
It is our responsibility to:
Familiarize yourself with the World Bank’s AI Use Policy and Responsible AI Guidelines.
In Positron Assistant, each model provider has different policies on data usage, retention, and training. See: Positron Assistant Provider Privacy | Copilot Responsible Use | Copilot Chat Responsible Use
Normally, AI does not load raw data into context because it operates with textual data. However, it can propose actions that may lead to data exposure:
💬 User-written prompts that include sensitive information
🖨️ AI suggests code that prints raw data in the console
📂 AI instructs the IDE to read text data files (.csv) into context
🔍 AI proposes shell/command line commands that load data files into context (e.g. .csv, .dta)
In any of these cases, sensitive data becomes part of the context and there is always a risk of data leakage.
Every chat session is an isolated instance. Data may not be used for training, but may be retained temporarily by the provider.
Yan et al. (2024). On Protecting the Data Privacy of Large Language Models (LLMs): A Survey. arXiv:2403.05156
Warning
Write instead:
Example prompt
"Using `/file.md` as the data dictionary for my household survey, identify
which variables may contain sensitive information.
Then write Stata code to load `data/raw/survey_2024.dta` that:
- Drops variables `a`, `b`, and `c` immediately after loading
- Never prints or displays these variables at any point"Use system instructions to prevent AI from accessing sensitive data files:
Step 1 — Create instruction files:
.github/copilot-instructions.md (for Copilot)claude.md (for Claude Code)Step 2 — Add ignore rules:
Add to your system instructions (.github/copilot-instructions.md or claude.md):
System instruction
You are not allowed to print raw data in the console. You must produce code
that does not print raw data. It is allowed however to present aggregated data,
summary statistics, regression results and any other data that makes it
impossible to identify individual observations.
If in doubt that some code may print raw data, ask for permission before
proposing this code.
Code example that may show data and is not allowed:
```stata
use data/raw/survey_2024.dta, clear
list var1 var2 var3
```Tip
Add this into the same system instruction files from Remedy 2. It will apply to every chat session in your project automatically.
The idea: generate lightweight metadata files describing each dataset. Keep raw data outside the working folder. Share only metadata with AI.
Process:
Write code (Stata/R/Python) that produces one .md file per data file containing data description and codebook information.
Execute it locally saving metadata into the project; raw data stays outside.
Tell AI: “Use only the metadata files to understand the data structure.”
Resulting folder structure:
C:\WBG\ai\
├── my-project\ ← AI works here
| ├── .gitignore ← Excl. version
System instructions:
| ├── .github/copilot-instructions.md
| ├── claude.md
│ ├── data/metadata/ ← metadata `.md`
│ └── code/
└── confidential-data\ ← AI cannot see
└── survey_2024.dta
Caveat: If you allow AI to run code against external data, it may print results in the console — which re-enters context. Control execution carefully.
When confidential microdata must be analyzed but cannot be shared in its original form, anonymization transforms the dataset so that re-identification is no longer reasonably possible.
Tools: R: sdcMicro, deident | Python: pycanon, Faker | Stata: manual or via R/Python.
worldbank.github.io/ai4coding © 2026 World Bank.