QAIngredient

ingredients

Description

Sometimes, simply selecting data from a given database is not enough to sufficiently answer a user's question.

The QAIngredient is designed to return data of variable types, and is best used in cases when we either need: 1) Unstructured, free-text responses ("Give me a summary of all my spending in coffe") 2) Complex, unintuitive relationships extracted from table subsets ("How many consecutive days did I spend in coffee?") 3) Multi-hop reasoning from unstructured data, grounded in a structured schema (using the options arg)

Formally, this is an aggregate function which transforms a table subset into a single value.

The following query demonstrates usage of the builtin LLMQA ingredient.

{{
    LLMQA(
        'How many consecutive days did I buy stocks in Financials?', 
        (
            SELECT account_history."Run Date", account_history.Symbol, constituents."Sector"
              FROM account_history
              LEFT JOIN constituents ON account_history.Symbol = constituents.Symbol
              WHERE Sector = "Financials"
              ORDER BY "Run Date" LIMIT 5
        )
    )
}}

This is slightly more complicated than the rest of the ingredients.

Behind the scenes, we wrap the call to LLMQA in a SELECT clause, ensuring that the ingredient's output gets returned.

SELECT {{QAIngredient}}

The LLM gets both the question asked, alongside the subset of the SQL database fetched by our subquery.

"Run Date"	Symbol	Sector
2022-01-14	HBAN	Financials
2022-01-20	AIG	Financials
2022-01-24	AIG	Financials
2022-01-24	NTRS	Financials
2022-01-25	HBAN	Financials

From examining this table, we see that we bought stocks in the Financials sector 2 consecutive days (2022-01-24, and 2022-01-25). The LLM answers the question in an end-to-end manner, returning the result 2.

The QAIngredient can be used as a standalone end-to-end QA tool, or as a component within a larger BlendSQL query.

For example, the BlendSQL query below translates to the valid (but rather confusing) question:

"Show me stocks in my portfolio, whose price is greater than the number of consecutive days I bought Financial stocks multiplied by 10. Only display those companies which offer a media streaming service."

 SELECT Symbol, "Last Price" FROM portfolio WHERE "Last Price" > {{
  LLMQA(
        'How many consecutive days did I buy stocks in Financials?', 
        (
            SELECT account_history."Run Date", account_history.Symbol, constituents."Sector"
              FROM account_history
              LEFT JOIN constituents ON account_history.Symbol = constituents.Symbol
              WHERE Sector = "Financials"
              ORDER BY "Run Date" LIMIT 5
        )
    )
  }} * 10
  AND {{LLMMap('Offers a media streaming service?', 'portfolio::Description')}} = 1

Constrained Decoding with `options`

Perhaps we want the answer to the above question in a different format. We call our LLM ingredient in a constrained setting by passing a options argument, where we provide either semicolon-separated options, or a reference to a column.

{{
    LLMQA(
        'How many consecutive days did I buy stocks in Financials?', 
        (
            SELECT account_history."Run Date", account_history.Symbol, constituents."Sector"
              FROM account_history
              LEFT JOIN constituents ON account_history.Symbol = constituents.Symbol
              WHERE Sector = "Financials"
              ORDER BY "Run Date" LIMIT 5
        )
        options='one consecutive day!;two consecutive days!;three consecutive days!'
    )
}}

Running the above BlendSQL query, we get the output two consecutive days!.

This options argument can also be a reference to a given column.

For example (from the HybridQA dataset):

 SELECT capacity FROM w WHERE venue = {{
        LLMQA(
            'Which venue is named in honor of Juan Antonio Samaranch?',
            (SELECT title, content FROM documents WHERE content MATCH 'venue'),
            options='w::venue'
        )
}}

Or, from our running example:

{{
  LLMQA(
      'Which did i buy the most?',
      (
        SELECT account_history."Run Date", account_history.Symbol, constituents."Sector"
          FROM account_history
          LEFT JOIN constituents ON account_history.Symbol = constituents.Symbol
          WHERE Sector = "Financials"
          ORDER BY "Run Date" LIMIT 5
      )
      options='account_history::Symbol'
  )
}}

The above BlendSQL will yield the result AIG, since it appears in the Symbol column from account_history.

`QAProgram`

Bases: Program

Source code in blendsql/ingredients/builtin/qa/main.py

class QAProgram(Program):
    def __call__(
        self,
        model: Model,
        question: str,
        context: Optional[pd.DataFrame] = None,
        options: Optional[Set[str]] = None,
        long_answer: Optional[bool] = False,
        table_title: Optional[str] = None,
        max_tokens: Optional[int] = None,
        **kwargs,
    ) -> Tuple[str, str]:
        if isinstance(model, LocalModel):
            m: guidance.models.Model = model.model_obj
        else:
            m: str = ""
        serialized_db = context.to_string() if context is not None else ""
        with guidance.system():
            m += "Answer the question for the table. "
            options_alias_to_original = {}
            if long_answer:
                m += "Make the answer as concrete as possible, providing more context and reasoning using the entire table.\n"
            else:
                m += "Keep the answers as short as possible, without leading context. For example, do not say 'The answer is 2', simply say '2'.\n"
        if options is not None:
            # Add in title case, since this helps with selection
            options_with_aliases = copy.deepcopy(options)
            # Below we check to see if our options have a unique first word
            # sometimes, the model will generate 'Frank' instead of 'Frank Smith'
            # We still want to align that, in this case
            add_first_word = False
            if len(set([i.split(" ")[0] for i in options])) == len(options):
                add_first_word = True
            for option in options:
                option = str(option)
                for option_alias in [option.title(), option.upper()]:
                    options_with_aliases.add(option_alias)
                    options_alias_to_original[option_alias] = option
                if add_first_word:
                    option_alias = option.split(" ")[0]
                    options_alias_to_original[option_alias] = option
                    options_with_aliases.add(option_alias)
        with guidance.user():
            m += f"\n\nQuestion: {question}"
            if table_title is not None:
                m += f"\n\nContext: \n Table Description: {table_title} \n {serialized_db}"
            else:
                m += f"\n\nContext: \n {serialized_db}"
            if options and not isinstance(model, LocalModel):
                m += f"\n\nFor your answer, select from one of the following options: {options}"
            m += "\n\nAnswer:\n"
        if isinstance(model, LocalModel):
            with guidance.assistant():
                prompt = m._current_prompt()
                if options is not None:
                    response = (
                        m
                        + guidance.capture(
                            guidance.select(options=options_with_aliases),
                            name="response",
                        )
                    )._variables["response"]
                else:
                    response = (
                        m
                        + guidance.capture(
                            guidance.gen(max_tokens=max_tokens or 50), name="response"
                        )
                    )._variables["response"]
        else:
            prompt = m
            if model.tokenizer is not None:
                max_tokens = (
                    max(
                        [
                            len(model.tokenizer.encode(alias))
                            for alias in options_alias_to_original
                        ]
                    )
                    if options
                    else max_tokens
                )
            response = generate(
                model,
                prompt=prompt,
                options=options,
                max_tokens=max_tokens,
                stop_at=["\n"],
            )
        # Map from modified options to original, as they appear in DB
        response: str = options_alias_to_original.get(response, response)
        if options and response not in options:
            print(
                Fore.RED
                + f"Model did not select from a valid option!\nExpected one of {options}, got {response}"
                + Fore.RESET
            )
        return (response, prompt)

`LLMQA`

Bases: QAIngredient

Source code in blendsql/ingredients/builtin/qa/main.py

class LLMQA(QAIngredient):
    DESCRIPTION = """
    If mapping to a new column still cannot answer the question with valid SQL, turn to an end-to-end solution using the aggregate function:
        `{{LLMQA('question', (blendsql))}}`
        Optionally, this function can take an `options` argument to restrict its output to an existing SQL column.
        For example: `... WHERE column = {{LLMQA('question', (blendsql), options='table::column)}}`
    """

    def run(
        self,
        model: Model,
        question: str,
        options: Optional[Set[str]] = None,
        context: Optional[pd.DataFrame] = None,
        value_limit: Optional[int] = None,
        table_to_title: Optional[Dict[str, str]] = None,
        long_answer: bool = False,
        **kwargs,
    ) -> Union[str, int, float]:
        if model is None:
            raise IngredientException(
                "LLMQA requires a `Model` object, but nothing was passed!\nMost likely you forgot to set the `default_model` argument in `blend()`"
            )
        if context is not None:
            if value_limit is not None:
                context = context.iloc[:value_limit]
        result = model.predict(
            program=QAProgram,
            options=options,
            question=question,
            context=context,
            long_answer=long_answer,
            table_title=None,
            **kwargs,
        )
        # Post-process language model response
        return "'{}'".format(single_quote_escape(result.strip()))

`run(model, question, options=None, context=None, value_limit=None, table_to_title=None, long_answer=False, **kwargs)`

Source code in blendsql/ingredients/builtin/qa/main.py

def run(
    self,
    model: Model,
    question: str,
    options: Optional[Set[str]] = None,
    context: Optional[pd.DataFrame] = None,
    value_limit: Optional[int] = None,
    table_to_title: Optional[Dict[str, str]] = None,
    long_answer: bool = False,
    **kwargs,
) -> Union[str, int, float]:
    if model is None:
        raise IngredientException(
            "LLMQA requires a `Model` object, but nothing was passed!\nMost likely you forgot to set the `default_model` argument in `blend()`"
        )
    if context is not None:
        if value_limit is not None:
            context = context.iloc[:value_limit]
    result = model.predict(
        program=QAProgram,
        options=options,
        question=question,
        context=context,
        long_answer=long_answer,
        table_title=None,
        **kwargs,
    )
    # Post-process language model response
    return "'{}'".format(single_quote_escape(result.strip()))

QAIngredient

Description

Constrained Decoding with options

QAProgram

LLMQA

run(model, question, options=None, context=None, value_limit=None, table_to_title=None, long_answer=False, **kwargs)

Constrained Decoding with `options`

`QAProgram`

`LLMQA`

`run(model, question, options=None, context=None, value_limit=None, table_to_title=None, long_answer=False, **kwargs)`