In [2]:

Copied!





import pandas as pd
import nest_asyncio
nest_asyncio.apply()

from blendsql.db import Pandas
from blendsql.common.utils import tabulate
import blendsql
import pandas as pd
import nest_asyncio
nest_asyncio.apply()

from blendsql.db import Pandas
from blendsql.common.utils import tabulate
import blendsql

BlendSQL by Example¶

This notebook introduces BlendSQL, and some of the usecases it can support.

Importantly, the novelty of BlendSQL isn't from the ability to constrain language models according to some regular expression or context-free grammar. We can credit projects like guidance and outlines for that. Instead, the novelty of BlendSQL is its ability to infer these constraints according to the surrounding SQL syntax and closely align generation to the structure of the database.

SQL, as a grammar, has a lot of rules. Just take these SQLite syntax diagrams for example. These rules include things like, IN statement should be followed by a list of items, <, >, should contain numerics, but = could contain any datatype, etc. We can use these to inform language-model functions, which we call 'ingredients', and denote in double curly brackets ({{ and }}).

A Note on Models¶

This demo, unless noted otherwise, uses the amazing Azure AI with serverside Guidance integration, described here. It allows us to access a Phi-3.5-mini on Azure, and utilize it in a constrained setting (i.e. have it follow a regular expression pattern, interleave text with generation calls, etc.)

If you don't have an Azure access key, you can swap out the model below for any of the other model integrations that BlendSQL supports.

To begin, let's set up a local database using from blendsql.db import Pandas.

In [3]:

Copied!





people_db = Pandas(
    {
        "People": pd.DataFrame(
            {
               'Name': [
                   'George Washington', 
                   'John Quincy Adams', 
                   'Thomas Jefferson', 
                   'James Madison', 
                   'James Monroe', 
                   'Alexander Hamilton',
                   'Sabrina Carpenter',
                   'Charli XCX',
                   'Elon Musk',
                   'Michelle Obama',
                   'Elvis Presley',
               ],
               'Known_For': [
                   'Established federal government, First U.S. President',
                   'XYZ Affair, Alien and Sedition Acts',
                   'Louisiana Purchase, Declaration of Independence',
                   'War of 1812, Constitution',
                   'Monroe Doctrine, Missouri Compromise',
                   'Created national bank, Federalist Papers',
                   'Nonsense, Emails I Cant Send, Mean Girls musical',
                   'Crash, How Im Feeling Now, Boom Clap',
                   'Tesla, SpaceX, Twitter/X acquisition',
                   'Lets Move campaign, Becoming memoir',
                   '14 Grammys, King of Rock n Roll'
               ]
            }
        ),
        "Eras": pd.DataFrame(
            {
                'Years': [
                    '1800-1900',
                    '1900-2000',
                    '2000-Now'
                ]
            }
        )
    }
)
# Print the tables in our database
for tablename in people_db.tables():
    print(tablename)
    print(tabulate(people_db.execute_to_df(f"SELECT * FROM {tablename};")))
people_db = Pandas(
    {
        "People": pd.DataFrame(
            {
               'Name': [
                   'George Washington', 
                   'John Quincy Adams', 
                   'Thomas Jefferson', 
                   'James Madison', 
                   'James Monroe', 
                   'Alexander Hamilton',
                   'Sabrina Carpenter',
                   'Charli XCX',
                   'Elon Musk',
                   'Michelle Obama',
                   'Elvis Presley',
               ],
               'Known_For': [
                   'Established federal government, First U.S. President',
                   'XYZ Affair, Alien and Sedition Acts',
                   'Louisiana Purchase, Declaration of Independence',
                   'War of 1812, Constitution',
                   'Monroe Doctrine, Missouri Compromise',
                   'Created national bank, Federalist Papers',
                   'Nonsense, Emails I Cant Send, Mean Girls musical',
                   'Crash, How Im Feeling Now, Boom Clap',
                   'Tesla, SpaceX, Twitter/X acquisition',
                   'Lets Move campaign, Becoming memoir',
                   '14 Grammys, King of Rock n Roll'
               ]
            }
        ),
        "Eras": pd.DataFrame(
            {
                'Years': [
                    '1800-1900',
                    '1900-2000',
                    '2000-Now'
                ]
            }
        )
    }
)
# Print the tables in our database
for tablename in people_db.tables():
    print(tablename)
    print(tabulate(people_db.execute_to_df(f"SELECT * FROM {tablename};")))

Eras
┌───────────┐
│ Years     │
├───────────┤
│ 1800-1900 │
│ 1900-2000 │
│ 2000-Now  │
└───────────┘
People
┌────────────────────┬──────────────────────────────────────────────────────┐
│ Name               │ Known_For                                            │
├────────────────────┼──────────────────────────────────────────────────────┤
│ George Washington  │ Established federal government, First U.S. President │
│ John Quincy Adams  │ XYZ Affair, Alien and Sedition Acts                  │
│ Thomas Jefferson   │ Louisiana Purchase, Declaration of Independence      │
│ James Madison      │ War of 1812, Constitution                            │
│ James Monroe       │ Monroe Doctrine, Missouri Compromise                 │
│ Alexander Hamilton │ Created national bank, Federalist Papers             │
│ Sabrina Carpenter  │ Nonsense, Emails I Cant Send, Mean Girls musical     │
│ Charli XCX         │ Crash, How Im Feeling Now, Boom Clap                 │
│ Elon Musk          │ Tesla, SpaceX, Twitter/X acquisition                 │
│ Michelle Obama     │ Lets Move campaign, Becoming memoir                  │
│ Elvis Presley      │ 14 Grammys, King of Rock n Roll                      │
└────────────────────┴──────────────────────────────────────────────────────┘

In [4]:

Copied!





# Define a utility function to make query execution easier
blend = lambda query, *args, **kwargs: blendsql.blend(
    query,
    db=kwargs.get("db", people_db),
    ingredients={blendsql.LLMQA, blendsql.RAGQA, blendsql.LLMMap, blendsql.LLMJoin},
    # This model can be changed, according to what your personal setup is
    default_model=kwargs.get("model", blendsql.models.AzurePhiModel(env="..", caching=False)),
    verbose=True
)
# Define a utility function to make query execution easier
blend = lambda query, *args, **kwargs: blendsql.blend(
    query,
    db=kwargs.get("db", people_db),
    ingredients={blendsql.LLMQA, blendsql.RAGQA, blendsql.LLMMap, blendsql.LLMJoin},
    # This model can be changed, according to what your personal setup is
    default_model=kwargs.get("model", blendsql.models.AzurePhiModel(env="..", caching=False)),
    verbose=True
)

The Elephant in the Room - Aren't LLM Functions in SQL Super Slow?¶

Short answer - compared to nearly all native SQL operations, yes.

However, when using remote APIs like OpenAI or Anthropic, we can dramatically speed up processing times by batching async requests. Below demonstrates that, for a table with 17,686 rows and 1,165 unique values in the column we process, it takes only about 6.5 seconds to run our query with gpt-4o-mini (or about 0.005 seconds per value).

By default, we allow 10 concurrent async requests. Depending on your own quotas set by the API provider, you may be able to increase this number using:

import blendsql

# Set the limit for max async calls at a given time below
blendsql.config.set_async_limit(20)

A Note on Query Optimizations¶

Because LLM functions are relatively slow compared to other SQL functions, when we perform query optimizations behind the scenes, we make sure to execute all native SQL functions before any LLM-based functions. This ensures the language model only receives the smallest set of data it needs to faithfully evaluate a given SQL expression

In [56]:

Copied!

db = blendsql.db.SQLite(blendsql.common.utils.fetch_from_hub("california_schools.db"))
print("{} total rows in the table".format(db.execute_to_list("SELECT COUNT(*) FROM schools LIMIT 10;")[0]))
print("{} total unique values in the 'City' column".format(db.execute_to_list("SELECT COUNT(DISTINCT City) FROM schools LIMIT 10;")[0]))
db = blendsql.db.SQLite(blendsql.common.utils.fetch_from_hub("california_schools.db"))
print("{} total rows in the table".format(db.execute_to_list("SELECT COUNT(*) FROM schools LIMIT 10;")[0]))
print("{} total unique values in the 'City' column".format(db.execute_to_list("SELECT COUNT(DISTINCT City) FROM schools LIMIT 10;")[0]))

17686 total rows in the table
1165 total unique values in the 'City' column

In [63]:

Copied!





smoothie = blend(
    """
    SELECT City, {{LLMMap('Is this in the Bay Area?', 'schools::City', options='t;f')}} AS 'In Bay Area?' FROM schools;
    """,
    # Override the default database and model arguments
    db=db,
    model=blendsql.models.OpenaiLLM('gpt-4o-mini', caching=False, env='..')
)
print(f"Finished in {smoothie.meta.process_time_seconds} seconds")
print(tabulate(smoothie.df.head(10)))
smoothie = blend(
    """
    SELECT City, {{LLMMap('Is this in the Bay Area?', 'schools::City', options='t;f')}} AS 'In Bay Area?' FROM schools;
    """,
    # Override the default database and model arguments
    db=db,
    model=blendsql.models.OpenaiLLM('gpt-4o-mini', caching=False, env='..')
)
print(f"Finished in {smoothie.meta.process_time_seconds} seconds")
print(tabulate(smoothie.df.head(10)))

Executing `SELECT * FROM schools` and setting to `32d0_schools_0`...
CREATE TEMP TABLE "32d0_schools_0" (
	"CDSCode" TEXT, 
	"NCESDist" TEXT, 
	"NCESSchool" TEXT, 
	"StatusType" TEXT, 
	"County" TEXT, 
	"District" TEXT, 
	"School" TEXT, 
	"Street" TEXT, 
	"StreetAbr" TEXT, 
	"City" TEXT, 
	"Zip" TEXT, 
	"State" TEXT, 
	"MailStreet" TEXT, 
	"MailStrAbr" TEXT, 
	"MailCity" TEXT, 
	"MailZip" TEXT, 
	"MailState" TEXT, 
	"Phone" TEXT, 
	"Ext" TEXT, 
	"Website" TEXT, 
	"OpenDate" TEXT, 
	"ClosedDate" TEXT, 
	"Charter" FLOAT, 
	"CharterNum" TEXT, 
	"FundingType" TEXT, 
	"DOC" TEXT, 
	"DOCType" TEXT, 
	"SOC" TEXT, 
	"SOCType" TEXT, 
	"EdOpsCode" TEXT, 
	"EdOpsName" TEXT, 
	"EILCode" TEXT, 
	"EILName" TEXT, 
	"GSoffered" TEXT, 
	"GSserved" TEXT, 
	"Virtual" TEXT, 
	"Magnet" FLOAT, 
	"Latitude" FLOAT, 
	"Longitude" FLOAT, 
	"AdmFName1" TEXT, 
	"AdmLName1" TEXT, 
	"AdmEmail1" TEXT, 
	"AdmFName2" TEXT, 
	"AdmLName2" TEXT, 
	"AdmEmail2" TEXT, 
	"AdmFName3" TEXT, 
	"AdmLName3" TEXT, 
	"AdmEmail3" TEXT, 
	"LastUpdate" TEXT
)
Executing  `{{LLMMap('Is this in the Bay Area?', 'schools::City', options='t;f')}}`...
Using options '['t', 'f']'
Making calls to Model with batch_size 5: |          | 234/? [00:00<00:00, 30475.61it/s]
LLMMap with OpenaiLLM(gpt-4o-mini) only returned 1165 out of 1166 values
Finished LLMMap with values:
{
    "Hayward": true,
    "Newark": true,
    "Oakland": true,
    "Berkeley": true,
    "San Leandro": true,
    "-": false,
    "Dublin": true,
    "Fremont": false,
    "Sacramento": true,
    "Alameda": null
}
Combining 1 outputs for table `schools`
CREATE TEMP TABLE "32d0_schools" (
	"CDSCode" TEXT, 
	"NCESDist" TEXT, 
	"NCESSchool" TEXT, 
	"StatusType" TEXT, 
	"County" TEXT, 
	"District" TEXT, 
	"School" TEXT, 
	"Street" TEXT, 
	"StreetAbr" TEXT, 
	"City" TEXT, 
	"Zip" TEXT, 
	"State" TEXT, 
	"MailStreet" TEXT, 
	"MailStrAbr" TEXT, 
	"MailCity" TEXT, 
	"MailZip" TEXT, 
	"MailState" TEXT, 
	"Phone" TEXT, 
	"Ext" TEXT, 
	"Website" TEXT, 
	"OpenDate" TEXT, 
	"ClosedDate" TEXT, 
	"Charter" FLOAT, 
	"CharterNum" TEXT, 
	"FundingType" TEXT, 
	"DOC" TEXT, 
	"DOCType" TEXT, 
	"SOC" TEXT, 
	"SOCType" TEXT, 
	"EdOpsCode" TEXT, 
	"EdOpsName" TEXT, 
	"EILCode" TEXT, 
	"EILName" TEXT, 
	"GSoffered" TEXT, 
	"GSserved" TEXT, 
	"Virtual" TEXT, 
	"Magnet" FLOAT, 
	"Latitude" FLOAT, 
	"Longitude" FLOAT, 
	"AdmFName1" TEXT, 
	"AdmLName1" TEXT, 
	"AdmEmail1" TEXT, 
	"AdmFName2" TEXT, 
	"AdmLName2" TEXT, 
	"AdmEmail2" TEXT, 
	"AdmFName3" TEXT, 
	"AdmLName3" TEXT, 
	"AdmEmail3" TEXT, 
	"LastUpdate" TEXT, 
	"Is this in the Bay Area?" BOOLEAN
)
Final Query:
SELECT "32d0_schools".City AS City,  "32d0_schools"."Is this in the Bay Area?"  AS "In Bay Area?" FROM "32d0_schools"

Finished in 6.575707912445068 seconds
┌─────────────┬────────────────┐
│ City        │   In Bay Area? │
├─────────────┼────────────────┤
│ Hayward     │              1 │
│ Newark      │              1 │
│ Oakland     │              1 │
│ Berkeley    │              1 │
│ Oakland     │              1 │
│ Oakland     │              1 │
│ Oakland     │              1 │
│ Hayward     │              1 │
│ San Leandro │              1 │
│ Hayward     │              1 │
└─────────────┴────────────────┘

Classification with 'LLMMap' and GROUP BY', Constrained by a Column's Values¶

Below, we set up a BlendSQL query leveraging the LLMMap ingredient. This is a unary function similar to the LENGTH or ABS functions in standard SQLite. It takes a single argument (a value from a column) and returns a transformed output, which is then assigned to a new column.

Below, we set up a language-model function which takes in the values from the Name column of the People table, and outputs a value exclusively selected from the Eras::Years column.

In [66]:

Copied!





smoothie = blend("""
SELECT GROUP_CONCAT(Name, ', ') AS 'Names',
{{LLMMap('In which time period did the person live?', 'People::Name', options='Eras::Years')}} AS "Lived During Classification"
FROM People
GROUP BY "Lived During Classification"
""")
print(smoothie.df)
smoothie = blend("""
SELECT GROUP_CONCAT(Name, ', ') AS 'Names',
{{LLMMap('In which time period did the person live?', 'People::Name', options='Eras::Years')}} AS "Lived During Classification"
FROM People
GROUP BY "Lived During Classification"
""")
print(smoothie.df)

Executing  `{{LLMMap('In which time period did the person live?', 'People::Name', options='Eras::Years')}}`...
Using options '['2000-Now', '1900-2000', '1800-1900']'
Making calls to Model with batch_size 5: |          | 3/? [00:01<00:00,  1.85it/s]    
Finished LLMMap with values:
{
    "Elvis Presley": "1900-2000",
    "John Quincy Adams": "1800-1900",
    "James Monroe": "1800-1900",
    "Elon Musk": "2000-Now",
    "George Washington": "1800-1900",
    "Alexander Hamilton": "1800-1900",
    "James Madison": "1800-1900",
    "Sabrina Carpenter": "2000-Now",
    "Thomas Jefferson": "1800-1900",
    "Charli XCX": "2000-Now"
}
Combining 1 outputs for table `People`
Created temp table e88a_People
Final Query:
SELECT GROUP_CONCAT(Name, ', ') AS "Names",  "e88a_People"."In which time period did the person live?"  AS "Lived During Classification" FROM "e88a_People" GROUP BY "Lived During Classification"

┌───────────────────────────────────────────────────────┬───────────────────────────────┐
│ Names                                                 │ Lived During Classification   │
├───────────────────────────────────────────────────────┼───────────────────────────────┤
│ George Washington, John Quincy Adams, Thomas Jeffe... │ 1800-1900                     │
│ Sabrina Carpenter, Charli XCX, Elon Musk              │ 2000-Now                      │
│ Michelle Obama, Elvis Presley                         │ 1900-2000                     │
└───────────────────────────────────────────────────────┴───────────────────────────────┘

Constrained Decoding - The Presidents Challenge¶

Why does constrained decoding matter? Imagine we want to select all the information we have in our table about the first 3 presidents of the U.S. In the absence of relevant data stored in our database, we turn to our language model. But one thing thwarts our plans - the language model doesn't know that we've stored the 2nd president's name in our database as 'John Quincy Adams', not 'John Adams'.

In [23]:

Copied!





# Setting `infer_gen_constraints=False` - otherwise, this counter-example would work
smoothie = blend("""
SELECT * FROM People
WHERE People.Name IN {{LLMQA('First 3 presidents of the U.S?', output_type='List[str]')}}
""", infer_gen_constraints=False)
# The final query 'SELECT * FROM People WHERE Name IN  ('George Washington','John Adams','Thomas Jefferson')' only yields 2 rows
print(smoothie.df)
# Setting `infer_gen_constraints=False` - otherwise, this counter-example would work
smoothie = blend("""
SELECT * FROM People
WHERE People.Name IN {{LLMQA('First 3 presidents of the U.S?', output_type='List[str]')}}
""", infer_gen_constraints=False)
# The final query 'SELECT * FROM People WHERE Name IN  ('George Washington','John Adams','Thomas Jefferson')' only yields 2 rows
print(smoothie.df)

Executing  `{{LLMQA('First 3 presidents of the U.S?', output_type='List[str]')}}`...
Final Query:
SELECT * FROM People WHERE People.Name IN  ('George Washington','John Adams','Thomas Jefferson')

┌───────────────────┬───────────────────────────────────────────────────────┐
│ Name              │ Known_For                                             │
├───────────────────┼───────────────────────────────────────────────────────┤
│ George Washington │ Established federal government, First U.S. Preside... │
│ Thomas Jefferson  │ Louisiana Purchase, Declaration of Independence       │
└───────────────────┴───────────────────────────────────────────────────────┘

Constrained decoding comes to our rescue. By specifying infer_gen_constraints=True (which is the default), BlendSQL infers from the surrounding SQL syntax that we expect a value from People.Name, and we force the generation to only select from values present in the Name column - which leads to the expected response.

In [38]:

Copied!





smoothie = blend("""
SELECT * FROM People
WHERE People.Name IN {{LLMQA('First 3 presidents of the U.S?')}}
""", infer_gen_constraints=True)
print(smoothie.df)
smoothie = blend("""
SELECT * FROM People
WHERE People.Name IN {{LLMQA('First 3 presidents of the U.S?')}}
""", infer_gen_constraints=True)
print(smoothie.df)

Executing  `{{LLMQA('First 3 presidents of the U.S?')}}`...
Using options '{'George Washington', 'James Monroe', 'Thomas Jefferson', 'James Madison', 'John Quincy Adams', 'Michelle Obama', 'Elon Musk', 'Charli XCX', 'Elvis Presley', 'Alexander Hamilton', 'Sabrina Carpenter'}'
Final Query:
SELECT * FROM People WHERE People.Name IN  ('George Washington','John Quincy Adams','James Monroe')

┌───────────────────┬───────────────────────────────────────────────────────┐
│ Name              │ Known_For                                             │
├───────────────────┼───────────────────────────────────────────────────────┤
│ George Washington │ Established federal government, First U.S. Preside... │
│ John Quincy Adams │ XYZ Affair, Alien and Sedition Acts                   │
│ James Monroe      │ Monroe Doctrine, Missouri Compromise                  │
└───────────────────┴───────────────────────────────────────────────────────┘

Constrained Decoding - The Alphabet Challenge¶

In BlendSQL, we can utilize the power of constrained decoding to guide a language model's generation towards the structure we expect. In other words, rather than taking a "prompt-and-pray" approach in which we meticulously craft a natural language prompt which (hopefully) generates a list of 3 strings, we can interact with the logit space to ensure this is the case¹.

[!NOTE]
These guarantees are only made possible with open models, i.e. where we can access the underlying logits. For closed-models like OpenAI and Anthropic, we rely on prompting (i.e. 'Datatype: List[str]') and make predictions "optimistically"

To demonstrate this, we can use the LLMQA ingredient. This ingredient optionally takes in a table subset as context, and returns either a scalar value or a list of scalars.

Since BlendSQL can infer the shape of a valid generation according to the surrounding SQL syntax, when we use the LLMQA ingredient in a VALUES or IN clause, it will generate a list by default.

In [26]:

Copied!





smoothie = blend("""
SELECT * FROM ( VALUES {{LLMQA('What are the first letters of the alphabet?')}} )
""")
print(smoothie.df)
smoothie = blend("""
SELECT * FROM ( VALUES {{LLMQA('What are the first letters of the alphabet?')}} )
""")
print(smoothie.df)

Executing  `{{LLMQA('What are the first letters of the alphabet?')}}`...
Final Query:
SELECT * FROM (VALUES ( 'A' ))

┌────────┐
│ col0   │
├────────┤
│ A      │
└────────┘

Ok, so we were able to generate the first letter of the alphabet... what if we want more?

Rather than modify the prompt itself (which can be quite finicky), we can leverage the regex-inspired quantifier argument. This will take either the strings '*' (zero-or-more) or '+' (one-or-more), in addition to tighter bounds of '{3}' (exactly 3) or '{1,6}' (between 1 and 6).

In [27]:

Copied!





smoothie = blend("""
SELECT * FROM ( VALUES {{LLMQA('What are the first letters of the alphabet?', quantifier='{3}')}} )
""")
print(smoothie.df)
smoothie = blend("""
SELECT * FROM ( VALUES {{LLMQA('What are the first letters of the alphabet?', quantifier='{3}')}} )
""")
print(smoothie.df)

Executing  `{{LLMQA('What are the first letters of the alphabet?', quantifier='{3}')}}`...
Final Query:
SELECT * FROM (VALUES ( 'A','B','C' ))

┌────────┬────────┬────────┐
│ col0   │ col1   │ col2   │
├────────┼────────┼────────┤
│ A      │ B      │ C      │
└────────┴────────┴────────┘

What if we want to generate the letters of a different alphabet? We can use the options argument for this, which takes either a reference to another column in the form 'tablename::columnname', or a set of semicolon-separated strings.

In [28]:

Copied!





smoothie = blend("""
SELECT * FROM ( VALUES {{LLMQA('What are the first letters of the alphabet?', options='α;β;γ;δ', quantifier='{3}')}} )
""")
print(smoothie.df)
smoothie = blend("""
SELECT * FROM ( VALUES {{LLMQA('What are the first letters of the alphabet?', options='α;β;γ;δ', quantifier='{3}')}} )
""")
print(smoothie.df)

Executing  `{{LLMQA('What are the first letters of the alphabet?', options='α;β;γ;δ', quantifier='{3}')}}`...
Using options '{'γ', 'δ', 'β', 'α'}'
Final Query:
SELECT * FROM (VALUES ( 'α','β','γ' ))

┌────────┬────────┬────────┐
│ col0   │ col1   │ col2   │
├────────┼────────┼────────┤
│ α      │ β      │ γ      │
└────────┴────────┴────────┘

Agent-Based Inference with CTE Expressions¶

The above example opens up the opportunity to rewrite the query as more of an agent-based flow. SQL is a bit odd in that it's executed bottom-up, i.e. to execute the following query:

SELECT the_answer FROM final_table WHERE final_table.x IN 
    (SELECT some_field FROM initial_table)

...We first gather some_field from initial_table, and then go and fetch the_answer, despite the author (human or AI) having written the second step, first. This is similar to the point made by Google in the pipe-syntax paper about how SQL syntactic clause order doesn't match semantic evaluation order.

At the end of the day, we have two agents performing the following tasks -

Brainstorm some greek letters
Using the output of the previous task, select only the first 3

With BlendSQL, we can use common table expressions (CTEs) to more closely mimic this order of 'agents'.

In [29]:

Copied!





smoothie = blend("""
WITH letter_agent_output AS (
    SELECT * FROM (VALUES {{LLMQA('List some greek letters')}})
) SELECT {{
    LLMQA(
        'What is the first letter of the alphabet?', 
        options=(SELECT * FROM letter_agent_output)
    )}}
""")
print(smoothie.df)
smoothie = blend("""
WITH letter_agent_output AS (
    SELECT * FROM (VALUES {{LLMQA('List some greek letters')}})
) SELECT {{
    LLMQA(
        'What is the first letter of the alphabet?', 
        options=(SELECT * FROM letter_agent_output)
    )}}
""")
print(smoothie.df)

Executing  `{{ LLMQA( 'What is the first letter of the alphabet?', options=(SELECT * FROM letter_agent_output) )}}`...
Executing `SELECT * FROM (VALUES ({{LLMQA('List some greek letters')}}))` and setting to `letter_agent_output`
Executing  `{{LLMQA('List some greek letters')}}`...
Final Query:
SELECT * FROM (VALUES ( 'alpha','beta','gamma','delta','epsilon','zeta','eta','theta','iota','kappa','lambda','mu','nu','xi','omicron','pi','rho','sigma','tau','upsilon','phi','chi','psi','omega' ))
Created temp table letter_agent_output
No BlendSQL ingredients found in query:
SELECT * FROM letter_agent_output
Executing as vanilla SQL...
Using options '{'mu', 'sigma', 'kappa', 'lambda', 'eta', 'rho', 'gamma', 'theta', 'nu', 'phi', 'iota', 'zeta', 'psi', 'omega', 'upsilon', 'beta', 'xi', 'tau', 'pi', 'delta', 'chi', 'alpha', 'omicron', 'epsilon'}'
Final Query:
SELECT  'alpha'

┌───────────┐
│ 'alpha'   │
├───────────┤
│ alpha     │
└───────────┘

Using `output_type` to Influence Generation¶

BlendSQL does its best to infer datatytpes given surrounding syntax. Sometimes, though, the user may want to override those assumptions, or inject new ones that were unable to be inferred.

The output_type argument takes a Python-style type annotation like int, str, bool or float. Below we use that to guide the generation towards one-or-more integer.

In [30]:

Copied!





smoothie = blend("""
SELECT * FROM ( VALUES {{LLMQA('Count up, starting from 1', output_type='int', quantifier='+')}} )
""")
print(smoothie.df)
smoothie = blend("""
SELECT * FROM ( VALUES {{LLMQA('Count up, starting from 1', output_type='int', quantifier='+')}} )
""")
print(smoothie.df)

Executing  `{{LLMQA('Count up, starting from 1', output_type='int', quantifier='+')}}`...
Using regex '(\d{1,18})'
Final Query:
SELECT * FROM (VALUES ( '1','2' ))

┌────────┬────────┐
│   col0 │   col1 │
├────────┼────────┤
│      1 │      2 │
└────────┴────────┘

RAG for Unstructured Reasoning¶

In addition to using the LLMQA ingredient as a method for generating with tight syntax-aware constraints, we can also relax a bit and let the model give us an unstructured generation for things like summarization.

Also, we can use the context argument to provide relevant table context. This allows us to condition generation on a curated set of data (and do cool stuff with nested reasoning).

In [5]:

Copied!





# Give a short summary of the person who had a musical by Lin-Manuel Miranda written about them
smoothie = blend("""
SELECT {{
    LLMQA(
        'Give me a very short summary of this person', 
        context=(
            SELECT * FROM People 
            WHERE People.Name = {{LLMQA('Who has a musical by Lin-Manuel Miranda written about them?', options='People::Name')}}
        )
    )
}} AS "Summary"
""")
print(smoothie.df)
# Give a short summary of the person who had a musical by Lin-Manuel Miranda written about them
smoothie = blend("""
SELECT {{
    LLMQA(
        'Give me a very short summary of this person', 
        context=(
            SELECT * FROM People 
            WHERE People.Name = {{LLMQA('Who has a musical by Lin-Manuel Miranda written about them?', options='People::Name')}}
        )
    )
}} AS "Summary"
""")
print(smoothie.df)

Executing  `{{ LLMQA( 'Give me a very short summary of this person', context=( SELECT * FROM People WHERE People.Name = {{LLMQA('Who has a musical by Lin-Manuel Miranda written about them?', options='People::Name')}} ) ) }}`...
Executing  `{{LLMQA ( 'Who has a musical by Lin-Manuel Miranda written about them?' , options= 'People::Name' ) }}`...
Using options '{'Elon Musk', 'James Madison', 'Elvis Presley', 'Thomas Jefferson', 'Sabrina Carpenter', 'Michelle Obama', 'John Quincy Adams', 'Alexander Hamilton', 'James Monroe', 'Charli XCX', 'George Washington'}'
Final Query:
SELECT * FROM People WHERE People.Name =  'Alexander Hamilton'

Warning: can't backtrack over " Question‧:"; this may confuse the model

Final Query:
SELECT  'Founder of national bank, author of Federalist Papers '  AS "Summary"

┌───────────────────────────────────────────────────────┐
│ Summary                                               │
├───────────────────────────────────────────────────────┤
│ Founder of national bank, author of Federalist Pap... │
└───────────────────────────────────────────────────────┘

In [35]:

Copied!





# A two-step reasoning problem:
#   1) Identify who, out of the table, is a singer using `LLMMap`
#   2) Where the previous step yields `TRUE`, select the one that wrote the song Espresso.
smoothie = blend("""
WITH Musicians AS 
(SELECT Name FROM People WHERE {{LLMMap('Is a singer?', 'People::Name')}} = TRUE)
SELECT Name AS "Espresso Singer" FROM Musicians WHERE Musicians.Name = {{LLMQA('Who wrote the song Espresso?')}}
""")
print(smoothie.df)
# A two-step reasoning problem:
#   1) Identify who, out of the table, is a singer using `LLMMap`
#   2) Where the previous step yields `TRUE`, select the one that wrote the song Espresso.
smoothie = blend("""
WITH Musicians AS 
(SELECT Name FROM People WHERE {{LLMMap('Is a singer?', 'People::Name')}} = TRUE)
SELECT Name AS "Espresso Singer" FROM Musicians WHERE Musicians.Name = {{LLMQA('Who wrote the song Espresso?')}}
""")
print(smoothie.df)

Executing  `{{LLMQA('Who wrote the song Espresso?')}}`...
Executing `SELECT Name AS Name FROM People WHERE {{LLMMap('Is a singer?', 'People::Name')}} = TRUE` and setting to `Musicians`
Executing  `{{LLMMap('Is a singer?', 'People::Name')}}`...
When inferring `options` in infer_gen_kwargs, encountered a column node with no table specified!
Should probably mark `schema_qualify` arg as True
Using regex '(t|f)'
Making calls to Model with batch_size 5: |          | 3/? [00:01<00:00,  2.69it/s]    
Finished LLMMap with values:
{
    "Elvis Presley": true,
    "Michelle Obama": false,
    "George Washington": false,
    "Alexander Hamilton": false,
    "John Quincy Adams": false,
    "James Monroe": false,
    "Elon Musk": false,
    "James Madison": false,
    "Sabrina Carpenter": true,
    "Thomas Jefferson": false
}
Combining 1 outputs for table `People`
Created temp table 9f31_People
Final Query:
SELECT Name AS Name FROM "9f31_People" WHERE  "9f31_People"."Is a singer?"  = TRUE
Created temp table Musicians
Using options '{'Sabrina Carpenter', 'Charli XCX', 'Elvis Presley'}'
Final Query:
SELECT Musicians.Name AS "Espresso Singer" FROM Musicians WHERE Musicians.Name =  'Sabrina Carpenter'

┌───────────────────┐
│ Espresso Singer   │
├───────────────────┤
│ Sabrina Carpenter │
└───────────────────┘

Internet-Connected RAG¶

So we know how to use a table subset as a context, by writing subqueries. But what if the knowledge we need to answer a question isn't present in the universe of our table?

For this, we have the RAGQA ingredient (retrieval-augmented generation question-answering). Currently it only supports Bing via Azure as a source, but the idea is that in the future, it will support more forms of unstructured retrieval.

Let's ask a question that requires a bit more world-knowledge to answer.

In [91]:

Copied!





smoothie = blend("""
SELECT {{LLMQA("Who's birthday is June 28, 1971?")}} AS "Answer"
""")
print(smoothie.df)
smoothie = blend("""
SELECT {{LLMQA("Who's birthday is June 28, 1971?")}} AS "Answer"
""")
print(smoothie.df)

Executing  `{{LLMQA("Who's birthday is June 28, 1971?")}}`...

Warning: can't backtrack over "\n"; this may confuse the model

Final Query:
SELECT  'Not specified in the context'  AS "Answer"

┌──────────────────────────────┐
│ Answer                       │
├──────────────────────────────┤
│ Not specified in the context │
└──────────────────────────────┘

Ok, that's fair.

Now let's try again, using constrained decoding via options and using the RAGQA ingredient to fetch relevant context via a Bing web search first.

In [92]:

Copied!





smoothie = blend("""
SELECT * FROM People WHERE Name = {{RAGQA("Who's birthday is June 28, 1971?", source='bing', options='People::Name')}}
""")
print(smoothie.df)
smoothie = blend("""
SELECT * FROM People WHERE Name = {{RAGQA("Who's birthday is June 28, 1971?", source='bing', options='People::Name')}}
""")
print(smoothie.df)

Unpacked alias `{{RAGQA("Who's birthday is June 28, 1971?", source='bing', options='People::Name')}}` to `
{{
    LLMQA(
        "Who's birthday is June 28, 1971?", 
        (
            SELECT {{
                BingWebSearch("Who's birthday is June 28, 1971?")
            }} AS "Search Results"
        ), options='People::Name'
    )
}}
`
Executing  `{{RAGQA("Who's birthday is June 28, 1971?", source='bing', options='People::Name')}}`...
When inferring `options` in infer_gen_kwargs, encountered a column node with no table specified!
Should probably mark `schema_qualify` arg as True
Executing  `{{ BingWebSearch ( "Who's birthday is June 28, 1971?" ) }}`...
Final Query:
SELECT  '## DOCUMENT 1

Elon Reeve Musk was born on June 28, 1971, in Pretoria, South Africa''s administrative capital. [7] [8] He is of British and Pennsylvania Dutch ancestry.[9] [10] His mother, Maye (née Haldeman), is a model and dietitian born in Saskatchewan, Canada, and raised in South Africa.[11] [12] [13] His father, Errol Musk, is a South African electromechanical engineer, pilot, sailor, consultant ...

## DOCUMENT 2

Weekday: June 28th, 1971 was a Monday. People born on June 28th, 1971 turned 53 this year (2024). Birthdays of famous people, actors, celebrities and stars on June 28th. With 365 days 1971 is a normal year and no leap year.'  AS "Search Results"
Using options '{'Thomas Jefferson', 'Charli XCX', 'James Madison', 'Sabrina Carpenter', 'Michelle Obama', 'John Quincy Adams', 'James Monroe', 'George Washington', 'Elvis Presley', 'Elon Musk', 'Alexander Hamilton'}'
Final Query:
SELECT * FROM People WHERE Name =  'Elon Musk'

┌───────────┬──────────────────────────────────────┐
│ Name      │ Known_For                            │
├───────────┼──────────────────────────────────────┤
│ Elon Musk │ Tesla, SpaceX, Twitter/X acquisition │
└───────────┴──────────────────────────────────────┘

Nice! Elon Musk was indeed born on June 28th, 1971. You can check out the BlendSQL logs above to validate this given the web context.

In [ ]: