Provide Context

querychat automatically gathers schema information about your tables — column names, types, numerical ranges, and categorical values — and makes it available to the LLM on demand via the querychat_get_schema tool. The LLM calls this tool before writing SQL to understand the structure of the tables it’s querying.

Importantly, we are not sending your raw data to the LLM and asking it to do complicated math. The LLM only needs to understand the structure and schema of your data in order to write SQL queries.

You can get even better results by providing additional context:

  1. Add a data dictionary to describe tables, columns, relationships, and domain terminology (recommended)
  2. Add a data description for a simpler alternative when working with a single, straightforward table
  3. Add custom instructions to guide the LLM’s behavior
  4. Use a fully custom prompt template if you want complete control
library(querychat)
library(palmerpenguins)

Data dictionary

A data dictionary is a YAML file that describes your tables, columns, relationships, and domain-specific terminology. It’s the recommended way to provide context, especially when working with multiple tables or when your data has domain-specific meaning that isn’t obvious from column names alone.

qc <- QueryChat$new(
  orders, "orders",
  data_dict = "data-dict.yaml"
)
qc$add_table(customers, "customers")

Format

A data dictionary has three top-level sections: tables, relationships, and glossary.

# data-dict.yaml
version: "0.2.0"

tables:
  orders:
    description: One row per customer order.
    columns:
      - name: order_id
        type: number(id)
        constraints: [primary_key]
        description: Unique order identifier.
      - name: customer_id
        type: number(id)
        constraints: [foreign_key]
        description: References customers.id.
      - name: total
        type: number(quantity)
        description: Order total in USD.
      - name: status
        type: enum
        values: [pending, shipped, delivered, cancelled]
        description: Current order status.

  customers:
    description: One row per customer.
    columns:
      - name: id
        type: number(id)
        constraints: [primary_key]
        description: Unique customer identifier.
      - name: name
        type: string
        description: Full name.
      - name: region
        type: string
        description: Geographic sales region.

relationships:
  - description: Each order belongs to one customer.
    cardinality: many-to-one
    join: orders.customer_id = customers.id

glossary:
  AOV: Average order value — total revenue divided by number of orders.
  churn: A customer who has not placed an order in the last 90 days.

Tables

Each entry under tables describes one table. The key must match the table name you pass to QueryChat$new() or $add_table().

  • description: What this table represents (one sentence is usually enough).
  • columns: A list of column annotations. Each column can have:
    • name: Column name (must match the actual column)
    • type: Semantic type hint — string, number, number(id), number(quantity), date, enum
    • constraints: Optional list — primary_key, foreign_key
    • description: What this column means in plain English
    • values: For enum columns, the list of possible values

Columns listed in the data dictionary are excluded from the auto-generated schema (since your description supersedes the auto-detected metadata). Columns not listed are still auto-detected as usual.

Relationships

The relationships section tells the LLM how to join tables. Each entry has:

  • description: A plain-English description of the relationship
  • cardinality: one-to-one, one-to-many, or many-to-one
  • join: The join condition (e.g., orders.customer_id = customers.id)

Glossary

The glossary section defines domain-specific terms that users might use in their questions. This helps the LLM translate business language into correct SQL.

Data description

For simple single-table use cases where a full data dictionary would be overkill, you can provide a data description — a free-form markdown file or string that describes what the data represents. Use the data_description parameter:

qc <- QueryChat$new(
  penguins,
  data_description = "data_description.md"
)

querychat doesn’t need this in any particular format — just provide what a human would find helpful:

<!-- data_description.md -->

This dataset contains information about Palmer Archipelago penguins.

- species: Penguin species (Adelie, Chinstrap, Gentoo)
- island: Island where observed (Torgersen, Biscoe, Dream)
- bill_length_mm: Bill length in millimeters
- body_mass_g: Body mass in grams

Additional instructions

You can add custom instructions to guide the LLM’s behavior using the extra_instructions parameter:

qc <- QueryChat$new(
  penguins,
  extra_instructions = "instructions.md"
)

Or as a string:

instructions <- "
- Use British spelling conventions
- Stay on topic and only discuss the data
- Refuse to answer unrelated questions
"

qc <- QueryChat$new(
  penguins,
  extra_instructions = instructions
)

LLMs may not always follow your instructions perfectly. Test extensively when changing instructions or models.

Default prompt

For full visibility into the system prompt that querychat generates for the LLM, you can inspect the system_prompt field. This is useful for debugging and understanding exactly what context the LLM is working with:

qc <- QueryChat$new(penguins)
cat(qc$system_prompt)

Custom template

If you want more control over the system prompt, you can provide a custom prompt template using the prompt_template parameter. This is for more advanced users who want to fully customize the LLM’s behavior. See the QueryChat reference for details on the available template variables.