The Impact of Schema Representation in the Text2Cypher Task

Makbule Gulcin Ozsoy

Machine Learning Engineer, Neo4j

Learn the impact of different schema formats on the Text2Cypher task, how to achieve them, and why they matter

The Text2Cypher task focuses on converting natural language questions into Cypher queries. In late 2024, we shared our Neo4j Text2Cypher (2024) dataset, along with our analysis of the performance of both baseline and initial fine-tuned models.

During these experiments, we made an interesting observation: The way the schema is presented in prompts significantly influences performance. This blog post dives into the impact of different schema formats on the Text2Cypher task and why they matter.

In the next sections, we’ll use one of the Neo4j Demonstration databases (‘companies’) as the example database.

The companies graph example schema with companies, articles, people, industries, and locations

Schema Representation Matters

Complex database schemas can pose challenges while using LLMs for natural language to query translation, including:

  • Increased token usage due to prompt size with large schemas
  • Lack of relevance of schema elements for the current query
  • LLM is overwhelmed by schema size and complexity (similar to the needle-in-the-haystack problem)

To address this, schema linking connects natural language queries with the corresponding database elements (see Re-examining the Role of Schema Linking in Text-to-SQL).

Schematic filtering has emerged as a common technique to improve accuracy and reduce noise in prompts. This approach selectively presents only the most relevant schema information, enhancing query generation quality while also reducing token costs (not sending the whole schema for every question) — an especially important factor in the era of LLM-powered applications.

In our previous experiments, we used the following prompts for the Text2Cypher task.

System and user prompt for Text2Cypher task

As you may notice, the main body of the prompt is relatively short — about 150 tokens when tokenized. However, once a schema is added, the prompt size increases significantly, reaching more than 2,700 tokens.

While recent LLM models can handle large contexts, this increase in size may still be a concern when resources are limited or when considering the cost of accessing these models. In such cases, schema filtering could offer a practical solution.

In this blog post, we explore straightforward schema filtering techniques and measure their impact on Text2Cypher performance. Stay tuned for our next post on findings and insights on creating more efficient and cost-effective Cypher generation models.

Schema Linking Techniques

In the Text2Cypher task, the schema information is already included in the Neo4j Text2Cypher (2024) dataset, alongside natural language questions and ground-truth Cypher queries. While building the dataset, we combined multiple data sources. For the schema field, if the information is directly available, we use it as is. However, if no schema is provided, we extract it from a running Neo4j database.

In this work, we begin with the initial implementation of schema extraction for such cases, followed by the application of our schema filtering techniques.

Enhanced Schema

This is the initial approach used for schema extraction. We used the Enhanced Schema approach provided by the Neo4j GraphRAG Python package:

The output schema contains nodes, relationships, and their properties. The property information for nodes and relationships includes example data values and types. For instance, if the property is the name of the Actor node, examples might include [‘Tom Hanks’, ‘Julia Roberts’, …]. An example output is shown below (newlines added for readability):

Base Schema

This approach follows the same method as the Enhanced Schema but without the enhancements (i.e., is_enhanced=False). The output schema will be similar to the one above, except it won’t include examples values of properties, and formatting will differ. An example output is shown below (newlines added for readability):

Pruned By Exact-Match Schema

This approach dynamically prunes schema based on the input natural language question. After extracting the structured schema (using the Neo4j GraphRAG Python package), it’s further refined using regular expression searches on node and relationship labels. Specifically, for each node (or relationship), the label and its properties are searched for in the input question. Additionally, the metadata and indexing fields are removed from the schema if they exist.

An example output is shown below (newlines added for readability):

NER Masked and Pruned By Exact-Match Schema

This approach dynamically prunes base schema based on the input natural language question. In this approach, named entities are identified using the spaCy framework and masked with their respective entity types (e.g., “Neo4j” becomes “ORGANIZATION”). Then, the same method as in the Pruned By Exact-Match Schema is applied. The goal is to reduce confusion between values and schema fields. An example output is shown below (newlines added for readability):

Note that the masked question will look like this:
“… mention the organization ORGANIZATION”

Pruned By Similarity Schema

This approach dynamically prunes base schema based on the input natural language question. It’s similar to the Pruned By Exact-Match Schema, but instead of relying on exact token matches, it uses embedding similarity. For similarity computations, we used the spaCy framework with a medium-size model and limited embedding size. After calculating the similarities, only the most similar properties or labels are retained.

An example output is shown below (newlines added for readability):

There are many additional heuristics that can be used to further prune schemas. For example, one promising approach is to collect only relevant nodes and relationships after extracting named entities. However, this would require more advanced Cypher queries, potentially increasing complexity. Additionally, various model-based or LLM-questioning approaches for schema filtering could be considered for future work.

Experimental Setup and Results

For our experiments, we used the Neo4j Text2Cypher (2024) dataset. To focus on the schema filtering process, we only included rows where database access was available. This resulted in 22,093 instances for the training set and 2,471 for the test set.

For evaluation, we used the unsloth/Meta-Llama-3.1–8B-Instruct-bnb-4bit foundational model to explore and understand the impact of different schema filtering approaches.

Impact on Token Distribution

Since we’re using a foundational model for these experiments, we present the token distribution over the test set. The distributions of the training and test sets are quite similar. The table below presents the token statistics for each schema filtering approach.

Token count distributions

Additionally, the following figure highlights token distributions, with the p95 value marked.

Token count distributions, with the p95 value marked

The results show that the Enhanced Schema produces significantly longer prompts than other approaches. Even switching to the Base Schema reduces the p95 token length to nearly one-third of the original. Further pruning via exact match (with or without Named Entity Recognition (NER) masking) reduces the p95 token length by nearly six times (to about one-sixth of the original). Using embedding similarity instead of exact matching increases the schema length but still reduces the p95 token length to about one-fourth of the original.

Reducing token lengths directly translates to lower costs (both for training and inference). To illustrate, let’s consider a scenario with 20,000 instances (entries) where the average token length per instance matches the median values in the table above. We assume that output token lengths remain the same and only input token lengths are factored into the cost calculations. Below is a cost comparison across different models, based on pricing from vendor websites as of Feb. 27, 2025.

As expected, cost scales linearly with token usage. In practice, factors like output token count, caching, and batch processing can further impact costs.

However, the table highlights an obvious key takeaway: Shorter prompts lead to significant cost reductions.

While pruning reduces token length and cost, it may introduce some computational overhead. Unlike Enhanced Schema and Base Schema, which can be extracted once and cached, pruning must be performed dynamically for each question by filtering based on words in the input. However, this additional step is unlikely to cause significant computational latency, especially for simpler methods like the Pruned by Exact-Match approach, which relies on regular expression matching.

Impact on Performance

As described in previous blog posts about Neo4j’s efforts on Text2Cypher (i.e., Results from baseline and fine-tuned models), we used several evaluation metrics to assess performance. Continuing with the same approach from those works, we conducted two types of evaluation procedures.

Evaluation techniques Illustration
  • Translation (Lexical)-based evaluation: This approach compares the generated Cypher queries with the ground-truth queries textually. In this post, we present the Google BLEU score as our metric for translation-based evaluation.
  • Execution-based evaluation: In this procedure, both the generated and ground-truth Cypher queries are executed on the target database. The outputs are then compared using the same evaluation metrics applied in the translation-based evaluation. In this post, we present the Exact-Match score as our metric for this kind of evaluation.

The performance of our target model, unsloth/Meta-Llama-3.1–8B-Instruct-bnb-4bit, with different schema filtering approaches is shown below.

The evaluation results show that longer prompts actually produce lower performance. The best performance result is obtained when Pruned By Exact-Match Schema is used. Using NER masking or similarity-based matching didn’t improve performance, but these approaches might be useful for different datasets. Therefore, further exploration of schema filtering methods is recommended for specific datasets or practical applications.

Summary and Next Steps

Our exploration of the impact of the schema field on Text2Cypher performance has highlighted the value of simple, efficient schema filtering techniques. By applying these methods, we successfully reduced prompt token length by up to six times, achieving cost reduction and performance improvement.

While schema filtering has helped with cost and performance, further experimentation is required to make conclusive decisions. As noted in the academic literature, schema filtering may not always enhance performance when paired with more advanced LLMs [1,4,5]. In the future, we plan to explore the effectiveness of these schema filtering approaches on a range of LLM models, from smaller to larger ones. We also aim to explore more advanced techniques, like model-based or multi-step approaches, as suggested in the literature.

Stay tuned as we continue to uncover insights for optimizing Cypher generation models!

References

  1. Caferoğlu, Hasan Alp, and Özgür Ulusoy. “E-sql: Direct schema linking via question enrichment in text-to-sql.” arXiv preprint arXiv:2409.16751 (2024). (https://arxiv.org/pdf/2409.16751)
  2. Lei, Wenqiang, Weixin Wang, Zhixin Ma, Tian Gan, Wei Lu, Min-Yen Kan, and Tat-Seng Chua. “Re-examining the Role of Schema Linking in Text-to-SQL.” In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6943–6954. 2020. (https://aclanthology.org/2020.emnlp-main.564.pdf)
  3. Cao, Zhenbiao, Yuanlei Zheng, Zhihao Fan, Xiaojin Zhang, Wei Chen, and Xiang Bai. “Rsl-sql: Robust schema linking in text-to-sql generation.” arXiv preprint arXiv:2411.00073 (2024). (https://arxiv.org/pdf/2411.00073v2)
  4. Maamari, Karime, Fadhil Abubaker, Daniel Jaroslawicz, and Amine Mhedhbi. “The death of schema linking? text-to-sql in the age of well-reasoned language models.” arXiv preprint arXiv:2408.07702 (2024). (https://arxiv.org/pdf/2408.07702)
  5. Chung, Yeounoh, Gaurav T. Kakkar, Yu Gan, Brenton Milne, and Fatma Ozcan. “Is Long Context All You Need? Leveraging LLM’s Extended Context for NL2SQL.” arXiv preprint arXiv:2501.12372 (2025). (https://arxiv.org/pdf/2501.12372)


The Impact of Schema Representation in the Text2Cypher Task was originally published in Neo4j Developer Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.