Solutions

Our Capabilities

Rooted in open source expertise

Nebari

Build your private Intelligence Hub

Python Security Remediation

Bring accountability to your AI stack.

Use Cases

Industry & Science

Biotech

AI powering life sciences

Defense

Secure AI for defense

Aerospace

AI built for aerospace

Energy

Optimizing energy with AI

Strategy & Intelligence

Artificial Intelligence

Real-world AI in action

Strategic Intelligence

Intelligence for strategic decisions

Advertising

AI-driven campaign performance

Investment Management

Smarter data-driven decisions
Resources

Resources

OpenTeams Blog

Engineering Blog

Case Studies
Company

Company

About Us

Press Room

Careers

December 2, 2025

The Hidden Risk in AI Code Generation: Why “Almost Correct” Isn’t Enough

Marco Gorelli explores, can an LLM translate Polars to SQL?

Marco Gorelli

Structured Query Language (SQL) is the lingua franca of the data world. Whether a bank is calculating liquidity or a retailer is auditing inventory, the road eventually leads to a SQL database. However, a new generation of data scientists is moving toward “dataframe” tools like Polars and pandas, which offer a more expressive way to manipulate data.

This creates a “translation gap” in the enterprise: engineers write in Python, but the data lives in SQL.

Naturally, the industry has turned to Generative AI to bridge this gap. If an LLM can translate French to English, surely it can translate Polars to SQL? We tested this hypothesis using the industry leader, GPT-5.1, against top-tier open source challengers.

The Experiment: The “Null” Trap

To test the models, we presented a common data engineering scenario: calculating unique values in a dataset that contains “null” (missing) data.

We fed a simple Polars snippet to three models:

1. GPT-5.1 (Proprietary, OpenAI)

2. DeepSeek V3.1 (Open Source, MIT License)

3. Qwen3 Coder (Open Source, Alibaba)

The prompt:

				
					> Given a table `df` with values
> 
> ```
> {'price': [1, 4, 2, 3], 'vendor': ['a', 'a', None, 'b']}
> ```
> 
> can you translate this Polars code to SQL
> 
> ```py
> print(df.select(pl.col('price') - pl.col('price').mean()))
> print(df.select(pl.col('vendor').n_unique()))
> ```
>
> ?

The task was to translate a command counting unique vendors, including those where the vendor name was missing, into SQL.

The Results: a Collective Hallucination

On the surface, the AI models performed. They instantly generated valid SQL syntax. A junior engineer copying this code would see no error messages.

However, the logic was flawed across the board.

Model	Syntax CheckSolution	Logic CheckSolution	The Failure
GPT-5.1	✅ Pass	❌ Fail	Excluded “null” values from the count.
DeepSeek	✅ Pass	❌ Fail	Excluded “null” values from the count.
Qwen3	✅ Pass	❌ Fail	Excluded “null” values from the count.

In Polars, the function n_unique counts missing values by default. In SQL, the standard COUNT(DISTINCT) ignores them.

The AI models failed to account for this nuance. If this code were deployed in a logistics environment, a warehouse with 1,000 unlabelled items (nulls) would be reported as empty. In financial reporting, this “silent error” could lead to material misstatements of assets.

Open Source Performance

The open source models (DeepSeek and Qwen) failed in the exact same way as the expensive proprietary model. When we adjusted the prompt to explicitly warn the models about the null value behavior, all three corrected themselves immediately.

This suggests that for enterprise infrastructure, open source models you run on your own private servers are now a viable, cost-effective alternative to proprietary APIs. They are just as capable, and just as fallible, as their closed-source counterparts.

The Fix: Deterministic Guardrails

The failure of these models highlights that Generative AI is probabilistic, not deterministic. It guesses the most likely next word. It does not “know” math or logic.

To rely on AI for important data-translation, enterprises need a “human in the loop” or, better yet, a “compiler in the loop.”

Prompt Engineering: As our test showed, adding context (“Remember, Polars counts nulls”) fixes the error. However, relying on users to know every edge case is not a scalable strategy.

Deterministic Layers: The robust solution is to use software libraries designed for translation, such as the open source project Narwhals. Unlike an LLM, which guesses the translation, tools like Narwhals use strict logic rules to transpile code.

Conclusion

The dream of an AI that can autonomously manage enterprise data pipelines is enticing, but we’re not there yet.

Our benchmark proves that while AI is an incredible accelerator for drafting code, it lacks the precision required for execution without oversight. The future of data engineering is wrapping AI in deterministic guardrails to ensure that when we ask for a count, we get the right answer.

What Hardware Do I Get!?

June 15, 2026

I’d like to run things locally, but what hardware do I need to ” – I still love those …

Our Capabilities

Nebari

Python Security Remediation

Biotech

Defense

Aerospace

Energy

Artificial Intelligence

Strategic Intelligence

Advertising

Investment Management

OpenTeams Blog

Engineering Blog

Case Studies

About Us

Press Room

Careers