The Hidden Risk in AI Code Generation: Why “Almost Correct” Isn’t Enough

Marco Gorelli explores, can an LLM translate Polars to SQL?

Structured Query Language (SQL) is the lingua franca of the data world. Whether a bank is calculating liquidity or a retailer is auditing inventory, the road eventually leads to a SQL database. However, a new generation of data scientists is moving toward “dataframe” tools like Polars and pandas, which offer a more expressive way to manipulate data.

This creates a “translation gap” in the enterprise: engineers write in Python, but the data lives in SQL.

Naturally, the industry has turned to Generative AI to bridge this gap. If an LLM can translate French to English, surely it can translate Polars to SQL? We tested this hypothesis using the industry leader, GPT-5.1, against top-tier open source challengers.

The Experiment: The “Null” Trap

To test the models, we presented a common data engineering scenario: calculating unique values in a dataset that contains “null” (missing) data.

We fed a simple Polars snippet to three models:

1. GPT-5.1 (Proprietary, OpenAI)

2. DeepSeek V3.1 (Open Source, MIT License)

3. Qwen3 Coder (Open Source, Alibaba)

The prompt:

				
					> Given a table `df` with values
> 
> ```
> {'price': [1, 4, 2, 3], 'vendor': ['a', 'a', None, 'b']}
> ```
> 
> can you translate this Polars code to SQL
> 
> ```py
> print(df.select(pl.col('price') - pl.col('price').mean()))
> print(df.select(pl.col('vendor').n_unique()))
> ```
>
> ?
				
			

The task was to translate a command counting unique vendors, including those where the vendor name was missing, into SQL.

The Results: a Collective Hallucination

On the surface, the AI models performed. They instantly generated valid SQL syntax. A junior engineer copying this code would see no error messages.

However, the logic was flawed across the board.

In Polars, the function n_unique counts missing values by default. In SQL, the standard COUNT(DISTINCT) ignores them.

The AI models failed to account for this nuance. If this code were deployed in a logistics environment, a warehouse with 1,000 unlabelled items (nulls) would be reported as empty. In financial reporting, this “silent error” could lead to material misstatements of assets.

Open Source Performance

The open source models (DeepSeek and Qwen) failed in the exact same way as the expensive proprietary model. When we adjusted the prompt to explicitly warn the models about the null value behavior, all three corrected themselves immediately.

This suggests that for enterprise infrastructure, open source models you run on your own private servers are now a viable, cost-effective alternative to proprietary APIs. They are just as capable, and just as fallible, as their closed-source counterparts.

The Fix: Deterministic Guardrails

The failure of these models highlights that Generative AI is probabilistic, not deterministic. It guesses the most likely next word. It does not “know” math or logic.

To rely on AI for important data-translation, enterprises need a “human in the loop” or, better yet, a “compiler in the loop.”

  • Prompt Engineering: As our test showed, adding context (“Remember, Polars counts nulls”) fixes the error. However, relying on users to know every edge case is not a scalable strategy.
  • Deterministic Layers: The robust solution is to use software libraries designed for translation, such as the open source project Narwhals. Unlike an LLM, which guesses the translation, tools like Narwhals use strict logic rules to transpile code.

Conclusion

The dream of an AI that can autonomously manage enterprise data pipelines is enticing, but we’re not there yet.

Our benchmark proves that while AI is an incredible accelerator for drafting code, it lacks the precision required for execution without oversight. The future of data engineering is wrapping AI in deterministic guardrails to ensure that when we ask for a count, we get the right answer.

Share:

Related Articles

Let’s try out Pydough!

Like many data practitioners, I have a love-hate relationship with SQL. I like how it’s mostly standardised, portable, popular, and lets you express complex logic. But I also dislike how it can sometimes feel clunky and hard to read.

Read More