Let's try out Pydough!

Marco Gorelli Explores Pydough and SQL Generation

Picture of Marco Gorelli

Marco Gorelli

Like many data practitioners, I have a love-hate relationship with SQL. I like how it’s mostly standardised, portable, popular, and lets you express complex logic. But I also dislike how it can sometimes feel clunky and hard to read.

When a colleague suggested I look into Pydough as a way of generating SQL, I was instantly interested. I’m aware of ORMs (object-relational mapping), and how my own project Narwhals has a way of generating SQL from dataframe code. My experience with asking LLMs to generate SQL is that they’re not too bad at it either. So, what else might this Pydough project offer that’s not already covered?

The answer is that Pydough is neither an ORM nor is it a dataframe API. Instead, it provides its own way of encoding knowledge which allows you to generate SQL whilst keeping your queries simple and human-readable. I think Pydough offers an interesting take on SQL generation, and once the tool becomes more widespread and adopted, I can see myself using it more!

Motivation: SQL example

Let’s look at an example. Say we have tables:

  • students
  • grades

…and I want to find the average grade for each student (across all the exams they took).

In SQL, this would require a join. I would need to remember which key from “students” corresponds with which one in “grades”, and the keys may not have human-readable names. The resulting SQL may look a bit like this:

				
					```sql
select
    students.name, avg(grades.grade)
from students s
join grades g
    on s.student_key = g.student_key
group by students.name
```

				
			

This is very explicit, which can be a good thing. But it also feels a bit verbose for being a translation of something as simple as “find the average grade for each student”.

Does Pydough offer a simpler way?

The Pydough solution

In this case, the Pydough solution is as simple as

				
					```python
students.CALCULATE(name, x=AVG(grades.grade))
```

				
			

Semantic layer

A core concept to Pydough is to decouple business logic from infrastructure debt via the metadata graph. Rather than manually repeating how tables map onto each other in each SQL query, the idea is that you encode this information once in a metadata graph and then express your queries in a more human-readable manner.

In this case, our metadata graph captures:

  • That we have two tables: “students” and “grades”.
  • How two tables map onto each other: each student appears only once in “student”, and each student-subject combination only appears once in “grades”.

See demo_graph.json to see exactly what it looks like.

And voilà, that’s all we need! We can now express our logic in a very terse and human-readable way. To get the average grade per student, all we need is:

				
					```python
students.CALCULATE(name, x=AVG(grades.grade))
```

				
			

…and Pydough will take care of doing all the appropriate joins based on the information encoded in the metadata graph!

This is a more pure-logic solution which is free from some of the noise of the SQL one. As our queries involve more and more complicated cross-table comparisons, I can see how this will really start making the actual business logic simpler and easier to follow than having to make sense of endless joins and human-unreadable primary keys!

"I could have used an ORM or an LLM"

I can already anticipate the objections. It’s true that SQL generation isn’t a new concept, and that well-tested solutions exist. However:

  • ORMs still require knowledge of SQL, and still require writing out many joins by hand.
  • LLMs are infamous for hallucinations.

A Pydough solution, on the other hand, can keep the user focused on business logic without having to fear LLMs dreaming up non-existent syntax.

What about LLMs?

LLMs are pretty good at generating SQL these days. But can they generate Pydough code?

At the moment, the answer appears to be “no” – GPT 5.2 generates queries like

				
					```
students.select(
    key,
    name,
    avg_grade = grades.avg(grade)
)
```

				
			

which aren’t actually valid. However, Pydough is a new project, and its usage is not yet widespread. I think it’s reasonable to assume that, once it becomes more popular, LLM generation may become very good indeed, especially given how conceptually simple the Pydough queries look. For the time being, think of it as a safe target for expressing business logic in natural-looking language which then gets translated – hallucination-free! – to SQL.

Conclusion

We’ve looked at how Pydough can be used to generate SQL in a way which is both human-readable and easy to reason about. Although the project is new, it looks promising, and my personal experience has been that the team has been very responsive when I’ve asked questions on their Slack channel. I’ll be keeping an eye on it, and suggest you do the same!

Share:

Related Articles

Share Your Feedback

Share your feedback on Collab and Nexus. We read every response.

Collab (Desktop App)

Nexus (Intelligence Hub)

Closing