How We Taught an AI to Navigate a Database That Forgot Its Own Map
- Ritvik Mandyam
- Jan 22
- 6 min read
Let’s talk about the specific kind of dread you feel when you open a new database schema for the first time.
You know the feeling. You’ve been promised a "structured data warehouse." You have visions of clean Third Normal Form, beautiful foreign key constraints, and column names that actually describe what’s inside them. You fire up your connection, query the information schema, and…
Silence.
Zero foreign keys. No primary key constraints. Table names like xref_p10_not_deleted and litscan_statistics. It’s not a database; it’s a data dump. A terabyte-sized pile of digital spaghetti.
This was exactly where we found ourselves when we tried to use the RNACentral database with Factal. RNACentral is an incredible resource - a massive, unified atlas of RNA sequences. But like many high-performance scientific databases, it drops database-level constraints to maximize write speed.
For a human analyst, this is annoying. For an LLM trying to generate SQL? It’s a lobotomy. Without foreign keys, the LLM is flying blind. It doesn't know that urs_taxid links to rna or that job_id links to litscan_jobs. It hallucinates joins, invents relationships, and generally queries like a drunk junior dev at 3 AM. We watched in horror as our LLM tried to use the quality assurance data to figure out taxonomic classifications.
We realized we couldn't just "prompt harder." We needed to build a machine that could reconstruct the lost map.
Here is how we built a generic, neurosymbolic (i.e. LLM creativity + deterministic programming reliability) engine to reverse-engineer database schemas - and how it turned rnacen.pipeline_tracking_genome_mapping into something a human can actually read.
The "Cold Start" Problem (or: Why RAG Wasn't Enough)
Our first thought was, "Let's just use RAG! We'll feed the documentation to the LLM."
That works for one database. But we’re building Factal to work with your database. We can't scour the internet for the documentation of your company's legacy ERP system from 2004. We needed a system that could look at raw schema metadata - just table names and columns - and figure out how the world fits together.
We started with a simple idea: The Schema Graph.
We loaded the tables as nodes. We looked for Foreign Keys to create edges. And since RNACentral (like many data warehouses) doesn't enforce FKs, we got… 166 nodes and 0 edges.
A graph with no edges isn't a graph. It's just a list.
Phase 1: The Heuristic Waterfall (aka "Guessing Responsibly")
If the database won't tell us where the keys are, we have to guess. But we wanted to guess scientifically.
We built a heuristic inference engine. It doesn’t use AI; it uses old-school, deterministic rules. It scans every column in the database and runs a "Waterfall" of checks:
The "Explicit ID" Match: Does table_a have a column named id? If so, that’s the Primary Key. (You’d be surprised how often this works).
The "Name Match": Does table_a have a column named table_a_id? That’s also a PK.
The "Natural Key" Match: In scientific data, the ID is often the thing itself. In the accession table, the PK is the accession column.
Once we identified the Primary Keys, we could hunt for the Foreign Keys. We looked for columns ending in _id and matched them to tables. We looked for direct name matches.
Suddenly, our graph had edges! We saw connections!
We ran a community detection algorithm (Louvain Modularity) to see the structure of the database. We expected beautiful clusters of related tables.
Instead, we got "The Archipelago."
We saw dozens of tiny islands. The xref table - which is huge - was floating in the void. The litscan tables were off having a party by themselves.
Why? Because bioinformatics is weird. They don't link xref to rna using rna_id. They use upi. They don't use taxonomy_id; they use taxid. As if THAT wasn't bad enough, taxid points to the rnc_taxonomy table, but urs_taxid links to rna , too. Our heuristic waterfall was too rigid. It missed the domain-specific glue.
Phase 2: Neurosymbolic Repair (The "Human in the Loop")

We realized we couldn't hard-code every possible naming convention in the universe. We also didn't want to just ask an LLM to "fix the graph" because it would hallucinate connections that don't exist. We also really didn't want to link tables in the graph to each other by hand - we'd used enough database design tools to know that it's error-prone and just generally not much fun.
We needed a middle ground: Neurosymbolic Rules.
We built a DSL (Domain Specific Language) that allowed a user to define high-level connection logic using rules like:
"Any column named taxid links to the taxonomy table."
"Columns named upi or urs link to the rna table."
"If a column implies a 'database source', link it to rnc_database."
We then built a Hybrid Executor:
The Compiler: An LLM translates natural language into structured JSON rules.
The Machine: Deterministic code applies the "Exact" rules across thousands of columns in milliseconds.
The Agent: For "Semantic" rules (like "implies a database source"), the code asks the LLM to verify specific column candidates.
This changed everything. We could feed the system a generic set of rules ("Link _id columns to their tables") AND a specific set of bio-rules ("Link upi to rna").
Suddenly, the "Archipelago" merged. The xref table connected to rna. litscan connected to jobs. We went from 50 disjoint islands to one giant, interconnected continent. All of a sudden, our graph of the database looked just like the one on the RNACentral website.
Phase 3: Topology-Aware Labeling (The "Graph Doctor")

Now that we had a connected graph, we could finally figure out what these tables actually did.
We used a technique we call "Hub-and-Spoke Semantic Propagation."
If you ask an LLM "What is the xref_p10 table?", it hallucinates. It doesn't know. But if you look at the graph, you see that xref_p10 is a satellite node connected to the massive xref hub, which is connected to rna.
We topologically sorted the graph. We identified the "Hubs" (tables with the most connections, like rna and publications). We asked the LLM to label those first, giving it all our painfully-gleaned metadata.
LLM: "rna contains RNA Sequences."
LLM: "publications contains Academic Papers."
Then, we moved to the satellites. When we asked the LLM to label xref_p10, we passed it the context of its neighbors: "This table is a partition of xref. It connects to rna (RNA Sequences) and publications (Academic Papers)."
The Result: The LLM correctly inferred: "This is a partition of the Cross-Reference table, linking specific RNA sequences to external academic publications."
It worked. We converted rnacen.ensembl_compara - a truly terrifying name - into "Ensembl RNA Homology and Comparative Data." We didn't write that label. The machine did.
The "Hairball" vs. The "Archipelago"
One of the coolest things we discovered during implementation is that not all databases are broken in the same way.
The Archipelago (RNACentral): Missing keys mean everything is disconnected.
Fix: Use Weakly Connected Components to find the islands, then use rules to build bridges.
The Hairball (Salesforce/Enterprise SQL): Too many keys mean everything is connected to everything (usually via a User or Log table).
Fix: Use Louvain Community Detection to slice the hairball into functional clusters ("Sales", "Inventory", "Support").
Our engine now automatically detects the topology. If it sees islands, it builds bridges. If it sees a hairball, it gets the scissors.
Why This Matters (Beyond Bioinformatics)
We used RNACentral as a stress test because it is objectively a nightmare for automated tools. But the problems we solved are universal.
Every company has that one legacy PostgreSQL database that runs the billing system. The one where the original architect left three years ago, foreign keys were "temporarily disabled" in 2019, and the table names are acronyms only the CEO understands. The data lake that everyone gets their reports from but no one remembers the schema for.
Our approach - Heuristic Inference + Neurosymbolic Rules + Topological Context - means you don't need to manually document that mess. You point Factal at it.
It finds the hidden skeleton of your data. It labels the organs. And most importantly, it gives your AI the context it needs to write SQL that actually works.
We’re a pretty cynical bunch at Factal. We’ve seen a lot of "AI Magic" demos that break the moment you use real data. But watching our system crawl through a 166-table scientific database, infer the missing relationships, and confidently label a table "RNA Secondary Structure Pipeline Tracking"?
That felt like magic. But the best kind of magic: the kind you can debug. Not to mention the kind of magic that lets us ask Factal to help us "pinpoint genomic 'hotspots' that may indicate dense clusters of transcriptional activity" - a phrase that we, as people WITHOUT bioinformatics backgrounds, barely understand ourselves - and have it give us the correct answer.




Comments