Language Models, SQL, and Types, Oh My!

The purpose of this post is to:

1. Document some of my thoughts on combining language models with structured data

2. Run some benchmarks with my new RTX 5080 🙂

3. Extend an invitation to folks interested in database systems and language models to collaborate on BlendSQL - reach out at parkervg5@gmail.com!

Introduction
Tables + LLMs: A Working Example
- A Quick Note on BlendSQL
Query Optimizations
Type Constraints
Benchmarking
- A Note on Accuracy
- Experimental Setup
Conclusion

Intro

Combining language models and structured data has been a widely studied topic, with many recent papers exploring the topic. Perhaps the most well-known example is from the Binding Language Models in Symbolic Languages paper. To the best of my knowledge, the EHRXQA paper was the first to propose the idea of putting calls to an ML model into a SQL query.

Love it or hate it, SQL is used everywhere. Because it is used everywhere, language models see a lot of it during pre-training. Even thinking outside of this complacent, hands thrown in the air ‘‘well, there’s nothing we can do to stop SQL…’’ mindset, I genuinely believe it is a better basis for an intermediate representation than other custom APIs.

In compiler design, an intermediate representation could be an abstract syntax tree (AST) that later gets transformed into machine code. In NLP, especially ‘reasoning’ tasks, languages like Lean have been increasingly popular as an intermediate representation: prior to immediately solving a complex math problem, can we synthesize an executable Lean program to assist us? (LINC, Kimina-Prover, DeepSeek-Prover)

These ideas can be extrapolated to non-math domains, like multi-hop reasoning over tables and text. Instead of directly eliciting a response from a language model with a context of 400K tables and 5M documents, the idea is to write a program that interleaves SQL and language model calls into an optimized intermediate representation.

What makes a ‘better intermediate representation’ here? To me, it is the following:

Token Efficiency
- How can we maximize the information density of our tokens? Can we achieve the same execution output with fewer tokens?
Downstream Performance
- For a given task, does our intermediate representation execute to a result that is accurate?
Ease of Acquisition
- Is our representation easy for a user - whether language model or human - to learn?¹

What do the above look like for different approaches?

A Quick Note on BlendSQL

BlendSQL is a query language I’ve been working on for combining SQL with language models across structured and unstructured data. I’ve been shifting recently to focus on local models with constrained decoding. v0.0.40 was just released - more info there!

Tables + LLMs: A Working Example

We can use the following from the TAG dataset as an example.

How many test takers are there at the school/s in a county with population over 2 million?

The two tables in the database that we care about are taken from the BIRD-SQL dataset, and are decently large.

SELECT COUNT(*) FROM schools
--- 17686

SELECT COUNT(*) FROM satscores
--- 2269

-- We only display a subset of columns below
SELECT * FROM schools LIMIT 2

CDSCode	StatusType	County	District	School	Street
33672156107957	Active	Riverside	Riverside Unified	William Howard Taft Elementary	959 Mission Grove Parkway North
19647336120489	Active	Los Angeles	Los Angeles Unified	Para Los Niños Charter	1617 East Seventh Street

-- We only display a subset of columns below
SELECT * FROM satscores LIMIT 2

cds	dname	NumTstTakr	AvgScrRead	AvgScrMath
30664643036001	Capistrano Unified	406	537	530
54105460000000	Tulare County Office of Education	28	541	498

The gist of the above tables: they have many columns, none of which contain the information about population of Counties. Hence, the premise is, we need to use a language model as a sort of search engine to fill-in-the-blanks of the database for us.

In LOTUS, a pandas-like API, the question ‘How many test takers are there at the school/s in a county with population over 2 million?’ is represented as the program below:

scores_df = pd.read_csv("../pandas_dfs/california_schools/satscores.csv")
schools_df = pd.read_csv("../pandas_dfs/california_schools/schools.csv")
unique_counties = pd.DataFrame(schools_df["County"].unique(), columns=["County"])
unique_counties = unique_counties.sem_map(
    "What is the population of {County} in California? Answer with only the number without commas. Respond with your best guess."
)
counties_over_2m = set()
for _, row in unique_counties.iterrows():
    try:
        if int(re.findall(r"\d+", row._map)[-1]) > 2000000:
            counties_over_2m.add(row.County)
    except:
        pass

schools_df = schools_df[schools_df["County"].isin(counties_over_2m)]
merged_df = pd.merge(scores_df, schools_df, left_on="cds", right_on="CDSCode")
prediction = int(merged_df["NumTstTakr"].sum())

In BlendSQL, we can write this as:

SELECT SUM(ss.NumTstTakr) AS TotalTestTakers
FROM satscores ss
JOIN schools s ON s.CDSCode = ss.cds
WHERE {{
    LLMMap(
	    'Approximately, what is the population of this California county?',
	    s.County
	)
}} > 2000000

Why do the differences between the two programs matter?

Query Optimizations

In the pandas-style API, the user has to take special care in writing an optimized version of the structured operations. In fact, the code as-written is not optimized: since the pd.merge happens after the call to the language model in sem_map, we invoke the language model on rows of data that could have been filtered out ahead of time in the merge/JOIN operation! As a result, the pandas-style intermediate representation passes 58 County values to the language model, while the auto-optimizing BlendSQL passes the minimum-necessary 49.

SQL query to find all the extra values we don't need to pass to the LM

SELECT COUNT(DISTINCT s."County")
FROM schools s
WHERE s."County" NOT IN (
    SELECT DISTINCT sc."County"
    FROM schools sc
    JOIN satscores sat ON sc."CDSCode" = sat.cds
)
ORDER BY s."County";
-- 9

By building off of well-studied query optimizations in SQL, the responsibility for query optimizations falls to the execution engine, not the user. We follow a naive but effective heuristic for cost estimation during query planning: set all language model calls to a cost of ∞, and native SQL calls to 0.

ast-example

This isn’t quite as absurd as it seems. Since BlendSQL compiles to SQL at the end of the day, we get to rely on the amazing optimizations that folks at PostgreSQL, SQLite, and DuckDB have worked hard on - and integrate with many different DBMS. When we execute the above BlendSQL script, we get the following logging messages (enabled via verbose=True):

from blendsql import BlendSQL
from blendsql.models import LlamaCpp

bsql = BlendSQL(
    # Below could also be a PostgreSQL connection,
    #   or Dict[str, pd.DataFrame]
    'path_to_sqlite.db',
    model=LlamaCpp(
        "Meta-Llama-3.1-8B-Instruct.Q6_K.gguf",
        "QuantFactory/Meta-Llama-3.1-8B-Instruct-GGUF",
    ),
    verbose=True
)

result = bsql.execute("""
SELECT SUM(ss.NumTstTakr) AS TotalTestTakers
FROM satscores ss
JOIN schools s ON s.CDSCode = ss.cds
WHERE {{
    LLMMap(
        'Approximately, what is the population of this California county?',
        s.County
    )
}} > 2000000
""")

blendsql-logs

This gives us a peek into the query optimizations going on behind the scenes. First, we execute the much less expensive JOIN clause, transforming the SUM(ss.NumTstTakr) AS TotalTestTakers node in the FROM clause to "schools"."County". Why do we do this? In an effort to leverage the highly optimized SQLite database management system as much as possible, there’s no need to eagerly write any other columns to a temporary table - the language model only needs access to the County column of the schools table. The temporary table creation allows us to filter down the subset of data we ultimately pass to the language model functions, by creating a UUID prefix we reference throughout the execution session.

Additionally, when we execute the LLMMap function, it will handle the DISTINCT logic for us (or, the .unique() bit in the pandas code). This ensures we don’t, for example, ask the language model for the population of Los Angeles more than once.

Type Constraints

See the bit in the logs above that says Using regex '(\d+)'? This is telling us something about the type of constrained decoding we’re performing.

What is constrained decoding? In short, it is a set of techniques that limits a language model’s generation at the decoding level, such that it adheres to a structure we’ve defined.² Importantly, the novelty of BlendSQL isn’t from the ability to constrain language models according to some regular expression or context-free grammar - we can credit projects like guidance and outlines for that. Instead, the novelty of BlendSQL is its ability to infer these constraints according to the surrounding SQL syntax.

SQL, as a grammar, has a lot of rules. Just take these SQLite syntax diagrams, for example. These rules include things like, IN statement should be followed by a tuple of items, <, >, should contain numerics or dates, but = could contain any datatype, etc. We can use these to inform language-model functions, which we call ‘ingredients’, and denote in double curly brackets ({{ and }}).

As a result, the output of the scalar LLMMap function is restricted to a string matching the regular expression (\d+). We know that this is the datatype we expect, since after traversing the AST of the query, we see a ‘greater-than’ predicate with the integer 2000000 on the right-hand side. In order for this to be a grammatically valid SQLite predicate, the left-hand side of the > expression should match this type.³

What does the actual prompt look like that we pass to the language model? Something like this, where {constrained_gen()} is a placeholder for a batched call to a constrained generation function using guidance.

Complete the docstring for the provided Python function.

def f(s: str) -> int:
    """Approximately, what is the population of this California county?

    Args:
        s (str): Value from the "schools"."County" column in a SQL database.

    Returns:
        int: Answer to the above question for each value `s`.

    Examples:
        ```python
        # f() returns the output to the question 'Approximately, what is the population of this California county?'
       	f("Los Angeles") == {constrained_gen()}
       	f("Alameda") == {constrained_gen()}
       	f("Calaveras") == {constrained_gen()}
       	```
      """
      ...

Benchmarking

How well do these two intermediate representations actually perform on a real dataset? To try and answer this question, I use 60 questions from the TAG-Bench dataset. As mentioned before, this takes the databases from the BIRD-SQL datsaset and annotates some new question-answer pairs that might require reasoning beyond what standard text-to-SQL can provide.

A Note on Accuracy

An important note on accuracy below: The TAG dataset has many questionable annotations, which I’ve described in this GitHub issue. For example:

Of the top 10 players taller than 180 ordered by average heading accuracy descending, what are the top 3 most unique sounding names?
- Where ‘Per Mertesacker’ is a more unique name than ‘Miroslav Klose’, according to some unknown criteria
Among the magnet schools with SAT test takers of over 500, which school name sounds most futuristic?
- In which the language model must hold the opinion that ‘Polytechnic High’ is ‘more futuristic’ than ‘Millikan High’
Of the schools with the top 3 SAT excellence rate, order their counties by academic reputation from strongest to weakest.
- I think Los Angeles might take issue with the fact that Santa Clara has a ‘stronger academic reputation’ than them in the ground truth. (not taking sides - go UCSB!)

I don’t think we should put too much emphasis on accuracy here without some extra clarification by the authors on how these subjective questions were annotated. I primarily view this dataset as a way to benchmark latency for intermediate representations over structured data with language models.

Experimental Setup

I use the LOTUS queries written by the authors of the TAG-Benchmark paper here. I use the BlendSQL queries for the dataset here. The BlendSQL queries were executed against a SQLite database - for these sorts of analytical queries, though, I’d imagine using DuckDB would be faster? An experiment for the future.

As shown below, using a single consumer-grade GPU, BlendSQL can match the performance of a 70b model hosted on 8 A100s, in 1/3 of the time. I measure the average tokens per program using the meta-llama/Llama-3.1-8B-Instruct tokenizer as well.⁴

	Model	Hardware	Accuracy (↑)	Execution Time (s) (↓)	Avg. Tokens per Program (↓)
LOTUS⁵	Llama-3.1-70b-instruct	8 A100, 80GB	0.55	3.0475	127
BlendSQL	Llama-3.1-8B-Instruct.Q6_K	1 RTX 5080, 16GB	0.55	0.935	76

Now, for a more granular latency test on my RTX 5080. I use Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf, hosted via Ollama for LOTUS and Llama.cpp for BlendSQL, in the test below.

Note: one LOTUS program (pipeline_38()) took 58 seconds to run. To me, it looks like an incorrect program (the question mentions Lewis Hamilton, but the code calls the LM for all drivers), and to try and be as fair as possible, I’ve removed it from the latency comparisons below.

latency-plot

Using a local model, BlendSQL programs execute in 1/2 time time it takes their LOTUS, pandas-style counterparts on average.

Conclusion

There’s a lot more to BlendSQL I haven’t discussed here - and, a lot more work to be done in improving BlendSQL as an intermediate representation. I’m more than happy to talk over some thoughts and share ideas regarding combining language models with structured data - please reach out if interested!

I think about the language strangeness budget a lot here. ↩
I highly recommend this .txt blog for a more in-depth explanation. ↩
SQLite uses type affinity, meaning that in the clause WHERE an_int_column > '3000', the string '3000' will get coerced into a integer, and all will be well. But, if the language model outputs something like 'Answer: 3,000' (or even just '3,000'), then type affinity can’t save us. ↩
To try and make this an even comparison, I normalized the LOTUS df = pd.from_csv("some/long/path.csv) bits using this code. ↩
These numbers are taken from Table 1 of their paper, ignoring the ‘Aggregation’ questions ↩