Technical walkthrough

All the below logic can be found in the blend() function from blendsql/blendsql.py.

Example

We can take the following query as an example.

--- 'Show me dividends from tech companies that manufacture cell phones'
SELECT "Run Date", Account, Action, ROUND("Amount ($)", 2) AS 'Total Dividend Payout ($$)', Name
    FROM account_history
    LEFT JOIN constituents ON account_history.Symbol = constituents.Symbol
    WHERE constituents.Sector = 'Information Technology'
    AND {{
           LLM(
               'does this company manufacture cell phones?',
               'constituents::Name',
            )
    }} = 1
    AND lower(account_history.Action) like "%dividend%"

1) Generate a Session UUID

This uuid allows us to create new temporary tables containing the output of our BlendSQL functions rather than overwriting the original, underlying SQL tables.

import uuid 

session_uuid = str(uuid.uuid4())[:5]

2) Identify All Subqueries

Here, we define subqueries as any select statement from a single table. These may or may not have their own SELECT clause.

In our example, we only have a single query.

3) For Each Table, Generate Select Statements

Iterating through our subqueries, we now need to write each table reference as its own SELECT statement.

To do this, we iterate through each table and get the following queries.

Notice how in the account_history query, we change the specific columns to the *, so we get everything.

SELECT * FROM account_history
    WHERE lower(account_history.Action) like "%dividend%"

SELECT * FROM constituents
    WHERE constituents.Sector = 'Information Technology'
    AND {{
           LLM(
               'does this company manufacture cell phones?',
               'constituents::Name',
            )
    }} = 1

4) Abstract Away Selects

In our constituents subquery, we have an expensive LLM operation.

In order to make sure we pass minimal required data to our BlendSQL ingredients while still honoring the SQL logic, we abstract away external functions to True to calculate the theoretical upper bound of data that might get returned.

-- Abstracted query
SELECT * FROM constituents
    WHERE constituents.Sector = 'Information Technology'
    AND TRUE

We execute each of these queries and assign them to new temporary tables, f"{session_uuid}_{tablename}_{subquery_idx}".

4) Execute BlendSQL Ingredients on our New Tables

Now we can execute some external functions.

For example, if we have a session_id of '1234':

SELECT Symbol FROM "1234_constituents_0" 
    WHERE sector = 'Information Technology' 
    AND {{
            LLM(
                'does this company manufacture cell phones?', 
                'constituents::Name'
            )
        }} = 1

The table "1234_constituents_0" now only has those entries where sector = 'Information Technology'. This minimizes the data that the {{LLM()}} call actually needs to process.

Once we've executed our functions, we now have a table with a new column, 'does this company manufacture cell phones?'.

We do a left join with the original constituents table to create the new session table "1234_constituents".

Then, we can move to the next subquery. In this case, there is no BlendSQL ingredient, so we're done with our processing.

5) Execute our Final SQL Query

At the end of our processing, we have the underlying SQL tables and query in a state that we can execute it like any other SQLite script.

SELECT "Run Date", Account, Action, ROUND("Amount ($)", 2) AS 'Total Dividend Payout ($$)', Name
    FROM account_history
    LEFT JOIN '1c0b_constituents' ON account_history.Symbol = '1c0b_constituents'.Symbol
    WHERE '1c0b_constituents'.Sector = 'Information Technology'
    AND {{
           LLM(
               'does this company manufacture cell phones?',
               'constituents::Name',
            )
    }} = 1
    AND lower(account_history.Action) like "%dividend%"

Notice how account_history was not modified by a session_id, but constituents was. This is because the logic of filtering by LIKE %dividend% can just be done using raw SQL on the original table, we don't need any complicated BlendSQL processing.