Skip to content

QAIngredient

ingredients

Usage

LLMQA

Bases: QAIngredient

from_args(model=None, few_shot_examples=None, context_formatter=lambda df: df.to_markdown(index=False), list_options_in_prompt=True, k=None) classmethod

Creates a partial class with predefined arguments.

Parameters:

Name Type Description Default
few_shot_examples Optional[List[dict]]

A list of Example dictionaries for few-shot learning. If not specified, will use default_examples.json as default.

None
context_formatter Callable[[DataFrame], str]

A callable that formats a pandas DataFrame into a string. Defaults to a lambda function that converts the DataFrame to markdown without index. k: Determines number of few-shot examples to use for each ingredient call. Default is None, which will use all few-shot examples on all calls. If specified, will initialize a haystack-based DPR retriever to filter examples.

lambda df: to_markdown(index=False)

Returns:

Type Description

Type[QAIngredient]: A partial class of QAIngredient with predefined arguments.

Examples:

from blendsql import blend, LLMQA
from blendsql.ingredients.builtin import DEFAULT_QA_FEW_SHOT

ingredients = {
    LLMQA.from_args(
        few_shot_examples=[
            *DEFAULT_QA_FEW_SHOT,
            {
                "question": "Which weighs the most?",
                "context": {
                    {
                        "Animal": ["Dog", "Gorilla", "Hamster"],
                        "Weight": ["20 pounds", "350 lbs", "100 grams"]
                    }
                },
                "answer": "Gorilla",
                # Below are optional
                "options": ["Dog", "Gorilla", "Hamster"]
            }
        ],
        # Will fetch `k` most relevant few-shot examples using embedding-based retriever
        k=2,
        # Lambda to turn the pd.DataFrame to a serialized string
        context_formatter=lambda df: df.to_markdown(
            index=False
        )
    )
}
smoothie = blend(
    query=blendsql,
    db=db,
    ingredients=ingredients,
    default_model=model,
)

run(model, question, context_formatter, few_shot_retriever=None, options=None, list_options_in_prompt=None, modifier=None, output_type=None, context=None, value_limit=None, long_answer=False, **kwargs)

Parameters:

Name Type Description Default
question str

The question to map onto the values. Will also be the new column name

required
context Optional[DataFrame]

Table subset to use as context in answering question

None
model Model

The Model (blender) we will make calls to.

required
context_formatter Callable[[DataFrame], str]

Callable defining how we want to serialize table context.

required
few_shot_retriever Callable[[str], List[AnnotatedQAExample]]

Callable which takes a string, and returns n most similar few-shot examples

None
options Optional[Collection[str]]

Optional collection with which we try to constrain generation.

None
list_options_in_prompt bool

Defines whether we include options in the prompt for the current inference example

None
modifier ModifierType

If we expect an array of scalars, this defines the regex we want to apply. Used directly for constrained decoding at inference time if we have a guidance model.

None
output_type Optional[Union[DataType, str]]

In the absence of example_outputs, give the Model some signal as to what we expect as output.

None
regex

Optional regex to constrain answer generation.

required
value_limit Optional[int]

Optional limit on how many rows from context we use

None
long_answer bool

If true, we more closely mimic long-form end-to-end question answering. If false, we just give the answer with no explanation or context

False

Returns:

Type Description
Union[str, int, float]

Union[str, int, float, tuple] containing the response from the model. Response will only be a tuple if modifier is not None.

Description

Sometimes, simply selecting data from a given database is not enough to sufficiently answer a user's question.

The QAIngredient is designed to return data of variable types, and is best used in cases when we either need: 1) Unstructured, free-text responses ("Give me a summary of all my spending in coffe") 2) Complex, unintuitive relationships extracted from table subsets ("How many consecutive days did I spend in coffee?") 3) Multi-hop reasoning from unstructured data, grounded in a structured schema (using the options arg)

Formally, this is an aggregate function which transforms a table subset into a single value.

The following query demonstrates usage of the builtin LLMQA ingredient.

{{
    LLMQA(
        'How many consecutive days did I buy stocks in Financials?', 
        (
            SELECT account_history."Run Date", account_history.Symbol, constituents."Sector"
              FROM account_history
              LEFT JOIN constituents ON account_history.Symbol = constituents.Symbol
              WHERE Sector = "Financials"
              ORDER BY "Run Date" LIMIT 5
        )
    )
}} 
This is slightly more complicated than the rest of the ingredients.

Behind the scenes, we wrap the call to LLMQA in a SELECT clause, ensuring that the ingredient's output gets returned.

SELECT {{QAIngredient}}
The LLM gets both the question asked, alongside the subset of the SQL database fetched by our subquery.

"Run Date" Symbol Sector
2022-01-14 HBAN Financials
2022-01-20 AIG Financials
2022-01-24 AIG Financials
2022-01-24 NTRS Financials
2022-01-25 HBAN Financials

From examining this table, we see that we bought stocks in the Financials sector 2 consecutive days (2022-01-24, and 2022-01-25). The LLM answers the question in an end-to-end manner, returning the result 2.

The QAIngredient can be used as a standalone end-to-end QA tool, or as a component within a larger BlendSQL query.

For example, the BlendSQL query below translates to the valid (but rather confusing) question:

"Show me stocks in my portfolio, whose price is greater than the number of consecutive days I bought Financial stocks multiplied by 10. Only display those companies which offer a media streaming service."

 SELECT Symbol, "Last Price" FROM portfolio WHERE "Last Price" > {{
  LLMQA(
        'How many consecutive days did I buy stocks in Financials?', 
        (
            SELECT account_history."Run Date", account_history.Symbol, constituents."Sector"
              FROM account_history
              LEFT JOIN constituents ON account_history.Symbol = constituents.Symbol
              WHERE Sector = "Financials"
              ORDER BY "Run Date" LIMIT 5
        )
    )
  }} * 10
  AND {{LLMMap('Offers a media streaming service?', 'portfolio::Description')}} = 1

Constrained Decoding with options

Perhaps we want the answer to the above question in a different format. We call our LLM ingredient in a constrained setting by passing a options argument, where we provide either semicolon-separated options, or a reference to a column.

{{
    LLMQA(
        'How many consecutive days did I buy stocks in Financials?', 
        (
            SELECT account_history."Run Date", account_history.Symbol, constituents."Sector"
              FROM account_history
              LEFT JOIN constituents ON account_history.Symbol = constituents.Symbol
              WHERE Sector = "Financials"
              ORDER BY "Run Date" LIMIT 5
        )
        options='one consecutive day!;two consecutive days!;three consecutive days!'
    )
}}

Running the above BlendSQL query, we get the output two consecutive days!.

This options argument can also be a reference to a given column.

For example (from the HybridQA dataset):

 SELECT capacity FROM w WHERE venue = {{
        LLMQA(
            'Which venue is named in honor of Juan Antonio Samaranch?',
            (SELECT title, content FROM documents WHERE content MATCH 'venue'),
            options='w::venue'
        )
}}

Or, from our running example:

{{
  LLMQA(
      'Which did i buy the most?',
      (
        SELECT account_history."Run Date", account_history.Symbol, constituents."Sector"
          FROM account_history
          LEFT JOIN constituents ON account_history.Symbol = constituents.Symbol
          WHERE Sector = "Financials"
          ORDER BY "Run Date" LIMIT 5
      )
      options='account_history::Symbol'
  )
}}

The above BlendSQL will yield the result AIG, since it appears in the Symbol column from account_history.