QAIngredient
Usage
LLMQA
Bases: QAIngredient
from_args(model=None, few_shot_examples=None, context_formatter=lambda df: df.to_markdown(index=False), list_options_in_prompt=True, k=None)
classmethod
Creates a partial class with predefined arguments.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
few_shot_examples
|
Optional[List[dict]]
|
A list of Example dictionaries for few-shot learning. If not specified, will use default_examples.json as default. |
None
|
context_formatter
|
Callable[[DataFrame], str]
|
A callable that formats a pandas DataFrame into a string. Defaults to a lambda function that converts the DataFrame to markdown without index. k: Determines number of few-shot examples to use for each ingredient call. Default is None, which will use all few-shot examples on all calls. If specified, will initialize a haystack-based DPR retriever to filter examples. |
lambda df: to_markdown(index=False)
|
Returns:
Type | Description |
---|---|
Type[QAIngredient]: A partial class of QAIngredient with predefined arguments. |
Examples:
from blendsql import blend, LLMQA
from blendsql.ingredients.builtin import DEFAULT_QA_FEW_SHOT
ingredients = {
LLMQA.from_args(
few_shot_examples=[
*DEFAULT_QA_FEW_SHOT,
{
"question": "Which weighs the most?",
"context": {
{
"Animal": ["Dog", "Gorilla", "Hamster"],
"Weight": ["20 pounds", "350 lbs", "100 grams"]
}
},
"answer": "Gorilla",
# Below are optional
"options": ["Dog", "Gorilla", "Hamster"]
}
],
# Will fetch `k` most relevant few-shot examples using embedding-based retriever
k=2,
# Lambda to turn the pd.DataFrame to a serialized string
context_formatter=lambda df: df.to_markdown(
index=False
)
)
}
smoothie = blend(
query=blendsql,
db=db,
ingredients=ingredients,
default_model=model,
)
run(model, question, context_formatter, few_shot_retriever=None, options=None, list_options_in_prompt=None, modifier=None, output_type=None, context=None, value_limit=None, long_answer=False, **kwargs)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
question
|
str
|
The question to map onto the values. Will also be the new column name |
required |
context
|
Optional[DataFrame]
|
Table subset to use as context in answering question |
None
|
model
|
Model
|
The Model (blender) we will make calls to. |
required |
context_formatter
|
Callable[[DataFrame], str]
|
Callable defining how we want to serialize table context. |
required |
few_shot_retriever
|
Callable[[str], List[AnnotatedQAExample]]
|
Callable which takes a string, and returns n most similar few-shot examples |
None
|
options
|
Optional[Collection[str]]
|
Optional collection with which we try to constrain generation. |
None
|
list_options_in_prompt
|
bool
|
Defines whether we include options in the prompt for the current inference example |
None
|
modifier
|
ModifierType
|
If we expect an array of scalars, this defines the regex we want to apply. Used directly for constrained decoding at inference time if we have a guidance model. |
None
|
output_type
|
Optional[Union[DataType, str]]
|
In the absence of example_outputs, give the Model some signal as to what we expect as output. |
None
|
regex
|
Optional regex to constrain answer generation. |
required | |
value_limit
|
Optional[int]
|
Optional limit on how many rows from context we use |
None
|
long_answer
|
bool
|
If true, we more closely mimic long-form end-to-end question answering. If false, we just give the answer with no explanation or context |
False
|
Returns:
Type | Description |
---|---|
Union[str, int, float]
|
Union[str, int, float, tuple] containing the response from the model.
Response will only be a tuple if |
Description
Sometimes, simply selecting data from a given database is not enough to sufficiently answer a user's question.
The QAIngredient
is designed to return data of variable types, and is best used in cases when we either need:
1) Unstructured, free-text responses ("Give me a summary of all my spending in coffe")
2) Complex, unintuitive relationships extracted from table subsets ("How many consecutive days did I spend in coffee?")
3) Multi-hop reasoning from unstructured data, grounded in a structured schema (using the options
arg)
Formally, this is an aggregate function which transforms a table subset into a single value.
The following query demonstrates usage of the builtin LLMQA
ingredient.
{{
LLMQA(
'How many consecutive days did I buy stocks in Financials?',
(
SELECT account_history."Run Date", account_history.Symbol, constituents."Sector"
FROM account_history
LEFT JOIN constituents ON account_history.Symbol = constituents.Symbol
WHERE Sector = "Financials"
ORDER BY "Run Date" LIMIT 5
)
)
}}
Behind the scenes, we wrap the call to LLMQA
in a SELECT
clause, ensuring that the ingredient's output gets returned.
SELECT {{QAIngredient}}
"Run Date" | Symbol | Sector |
---|---|---|
2022-01-14 | HBAN | Financials |
2022-01-20 | AIG | Financials |
2022-01-24 | AIG | Financials |
2022-01-24 | NTRS | Financials |
2022-01-25 | HBAN | Financials |
From examining this table, we see that we bought stocks in the Financials sector 2 consecutive days (2022-01-24, and 2022-01-25).
The LLM answers the question in an end-to-end manner, returning the result 2
.
The QAIngredient
can be used as a standalone end-to-end QA tool, or as a component within a larger BlendSQL query.
For example, the BlendSQL query below translates to the valid (but rather confusing) question:
"Show me stocks in my portfolio, whose price is greater than the number of consecutive days I bought Financial stocks multiplied by 10. Only display those companies which offer a media streaming service."
SELECT Symbol, "Last Price" FROM portfolio WHERE "Last Price" > {{
LLMQA(
'How many consecutive days did I buy stocks in Financials?',
(
SELECT account_history."Run Date", account_history.Symbol, constituents."Sector"
FROM account_history
LEFT JOIN constituents ON account_history.Symbol = constituents.Symbol
WHERE Sector = "Financials"
ORDER BY "Run Date" LIMIT 5
)
)
}} * 10
AND {{LLMMap('Offers a media streaming service?', 'portfolio::Description')}} = 1
Constrained Decoding with options
Perhaps we want the answer to the above question in a different format. We call our LLM ingredient in a constrained setting by passing a options
argument, where we provide either semicolon-separated options, or a reference to a column.
{{
LLMQA(
'How many consecutive days did I buy stocks in Financials?',
(
SELECT account_history."Run Date", account_history.Symbol, constituents."Sector"
FROM account_history
LEFT JOIN constituents ON account_history.Symbol = constituents.Symbol
WHERE Sector = "Financials"
ORDER BY "Run Date" LIMIT 5
)
options='one consecutive day!;two consecutive days!;three consecutive days!'
)
}}
Running the above BlendSQL query, we get the output two consecutive days!
.
This options
argument can also be a reference to a given column.
For example (from the HybridQA dataset):
SELECT capacity FROM w WHERE venue = {{
LLMQA(
'Which venue is named in honor of Juan Antonio Samaranch?',
(SELECT title, content FROM documents WHERE content MATCH 'venue'),
options='w::venue'
)
}}
Or, from our running example:
{{
LLMQA(
'Which did i buy the most?',
(
SELECT account_history."Run Date", account_history.Symbol, constituents."Sector"
FROM account_history
LEFT JOIN constituents ON account_history.Symbol = constituents.Symbol
WHERE Sector = "Financials"
ORDER BY "Run Date" LIMIT 5
)
options='account_history::Symbol'
)
}}
The above BlendSQL will yield the result AIG
, since it appears in the Symbol
column from account_history
.