Metadata Reviewer

6. Metadata Reviewer#

MetadataReviewerClient uses a multi-agent LLM pipeline to scan dataset series metadata for quality issues: incorrect content, inconsistencies, typos, missing fields, and more. It returns a ranked list of detected issues with suggested corrections.

For installation, API usage, the review board, and troubleshooting, see the Metadata Reviewer User Manual.

6.1. The Problem#

Metadata quality assurance at scale is difficult. Manually reviewing hundreds of metadata records for typos, inconsistencies, incorrect values, and missing fields is time-consuming and inconsistent. Reviewers applying the same criteria across many records will drift in their judgments over time, and the process leaves no structured audit trail.

LLM-only approaches reduce the labor cost but do not by themselves address persistence, governance, or auditability. A multi-agent pipeline addresses these gaps by decomposing the review task into specialized, sequential agents—each with a defined role—so that detection, filtering, classification, and scoring are handled separately and consistently.

6.2. How It Works#

Each metadata submission is processed by a five-agent pipeline:

Agent	Role
primary	Initial scan for issues: typos, inconsistencies, missing or redundant fields
secondary	Independent re-scan to catch issues the primary agent missed
critic	Filters false positives by applying exclusion rules to the combined issue list
categorizer	Assigns one of six issue categories to each confirmed issue
severity_scorer	Assigns a severity score (1–5) to each confirmed issue

The six issue categories assigned by the categorizer are:

Typo / Language
Formatting / Structure
Missing / Redundant Information
Inconsistency / Conflict
Incorrect / Invalid Content
Ambiguity / Unclear

6.3. Output Schema#

Each detected issue is returned as a structured record:

Field	Description
`detected_issue`	Description of the problem identified
`issue_category`	One of the six category labels above
`issue_severity`	Integer 1–5 (see scale below)
`current_metadata`	The problematic field and its current value
`suggested_metadata`	Proposed correction for the field

Severity scale:

Score	Label	Meaning
1	Trivial	Minor cosmetic issue; no practical impact
2	Low	Small error unlikely to cause confusion
3	Moderate	Noticeable issue that may mislead users
4	High	Significant error that affects usability or trust
5	Critical	Incorrect or missing information that renders the metadata unreliable

6.4. LLM-Agnostic Design#

The metadata reviewer is designed to be independent of any specific LLM provider. The MetadataReviewerCore holds a model_client — any object implementing AutoGen’s ChatCompletionClient protocol — and passes it directly to each agent in the pipeline. No provider-specific logic lives inside the reviewer itself.

This means the same five-agent pipeline runs identically regardless of whether the underlying model is a hosted API, an Azure deployment, or a locally served model. Swapping providers requires only changing how the client is constructed.

6.4.1. Factory Classmethods#

The four built-in factory classmethods cover the most common providers. Each uses a lazy import: the provider package is only imported when that classmethod is called, so installing only autogen-ext[openai] will not cause import errors when from_anthropic is never used.

Provider	Classmethod	Key Parameters
OpenAI	`MetadataReviewerClient.from_openai(...)`	`model`, `api_key`
Azure OpenAI	`MetadataReviewerClient.from_azure(...)`	`model`, `azure_endpoint`, `azure_deployment`, `api_version`, `azure_ad_token_provider`, `azure_ad_token`
Anthropic Claude	`MetadataReviewerClient.from_anthropic(...)`	`model`, `api_key`
Ollama (local)	`MetadataReviewerClient.from_ollama(...)`	`model`, `port` (default `11434`)

6.5. Installation#

Install with the extras for your chosen LLM provider:

pip install ai4data[metadata-reviewer,openai]      # OpenAI
pip install ai4data[metadata-reviewer,anthropic]   # Anthropic
pip install ai4data[metadata-reviewer,azure]       # Azure OpenAI
pip install ai4data[metadata-reviewer,ollama]      # Ollama (local)

Requirements: Python >= 3.11, autogen-agentchat, and credentials for your chosen provider.

6.6. Quick Start#

OpenAI

from ai4data.metadata.reviewer import MetadataReviewerClient

client = MetadataReviewerClient.from_openai(model="gpt-4o", api_key="sk-...")

Anthropic

client = MetadataReviewerClient.from_anthropic(model="claude-sonnet-4-6", api_key="sk-ant-...")

Azure OpenAI

from azure.identity import DefaultAzureCredential, get_bearer_token_provider

token_provider = get_bearer_token_provider(
    DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default"
)
client = MetadataReviewerClient.from_azure(
    model="gpt-4o",
    azure_endpoint="https://<resource>.openai.azure.com/",
    azure_deployment="<deployment>",
    api_version="2024-02-01",
    azure_ad_token_provider=token_provider,
)

Ollama (local)

client = MetadataReviewerClient.from_ollama(model="llama3.2", port=11434)

Submit and retrieve results

# Submit metadata for review — returns immediately
job = client.submit(metadata_dict)

# Block until complete
result = job.wait_sync(timeout=300)

# Each item is a detected issue
for issue in result:
    print(f"[{issue['issue_severity']}/5] {issue['detected_issue']}")
    print(f"  Category: {issue['issue_category']}")
    print(f"  Suggested fix: {issue['suggested_metadata']}")

client.submit() returns a Job handle immediately; the pipeline runs in a background thread. Use job.wait_sync(timeout) to block, or await job.wait(timeout) in an async context.

6.6.1. Bringing Your Own Client#

For full control over model client configuration — or for any AutoGen-compatible provider not covered by the factory classmethods — construct the client yourself and pass it directly to the constructor:

from ai4data.metadata.reviewer import MetadataReviewerClient

client = MetadataReviewerClient(model_client=your_model_client)

Any object that conforms to AutoGen’s ChatCompletionClient interface works. For example, using AzureOpenAIChatCompletionClient with a static token instead of a token provider:

from autogen_ext.models.openai import AzureOpenAIChatCompletionClient
from ai4data.metadata.reviewer import MetadataReviewerClient

model_client = AzureOpenAIChatCompletionClient(
    model="gpt-4o",
    azure_endpoint="https://<resource>.openai.azure.com/",
    azure_deployment="<deployment>",
    api_version="2024-02-01",
    azure_ad_token="<static-token>",
)

client = MetadataReviewerClient(model_client=model_client)