The Art of Text Splitting: From Basic to Advanced Techniques for Language Models

Isaac Kargar
7 min readDec 23, 2024

--

Text splitting is a crucial yet often overlooked aspect of working with Language Models (LLMs). In this comprehensive guide, we’ll explore the five levels of text splitting, with a special focus on the advanced techniques that are shaping the future of AI applications. Here is a great video by Greg Kamradt that explains everything in detail:

Introduction

When working with LLMs, how you prepare and segment your data can significantly impact performance. As emphasized in the field, “your goal is not to chunk for chunking’s sake, but to get data in a format where it can be retrieved for value later.”

The Evolution of Text Splitting

Level 1: Character Splitting

The most basic form of text splitting, character splitting breaks text into chunks based on a fixed character count.

# Example implementation using LangChain
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
chunk_size=35,
chunk_overlap=0,
separator=""
)
text = "This is the text I would like to chunk up. It is an example text for this exercise."
chunks = text_splitter.split_text(text)

Pros:

  • Simple implementation
  • Predictable chunk sizes
  • Minimal processing overhead

Cons:

  • Often breaks words mid-sentence
  • Ignores natural text boundaries
  • Not suitable for semantic understanding
  • Rarely used in production environments

Here is a tool you can use to do some visualization of how this method and the next two ones work:

Level 2: Recursive Character Splitting

A more intelligent approach that respects text structure using a hierarchy of separators.

from langchain_text_splitters pimport RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
separators=["\n\n", "\n", " ", ""],
chunk_size=1000,
chunk_overlap=200,
length_function=len,
)
# Example text with natural boundaries
text = """
Chapter 1: Introduction
AI is transforming rapidly.
We need to adapt quickly.
The future is exciting.
Innovation continues daily.
"""
chunks = splitter.split_text(text)

Key Features:

  • Respects document structure
  • Uses multiple separator levels
  • Maintains paragraph integrity
  • Configurable chunk overlap
  • Industry standard for many applications

Use Cases:

  • General documentation
  • Articles and blog posts
  • Email content
  • Standard text documents

Level 3: Document-Specific Splitting

Document-specific splitting represents a significant evolution in text chunking strategy, recognizing that different document types require tailored approaches for optimal processing. Unlike the one-size-fits-all methods of basic character splitting, this level adapts to the unique structures and features of various document formats.

Markdown Documents:

When processing markdown files, the splitting strategy respects the hierarchical structure of headers and formatting. Think of it like chapters in a book — each major section (H1) might contain subsections (H2, H3), with natural breaks that preserve the document’s logical flow. The system recognizes markdown-specific elements such as:

  • Header hierarchies
  • Code blocks
  • Lists and nested lists
  • Block quotes
  • Table structures
from langchain_text_splitter import MarkdownTextSplitter

markdown_document = "# Foo\n\n ## Bar\n\nHi this is Jim\n\nHi this is Joe\n\n ### Boo \n\n Hi this is Lance \n\n ## Baz\n\n Hi this is Molly"

headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on)
md_header_splits = markdown_splitter.split_text(markdown_document)
md_header_splits
[Document(page_content='Hi this is Jim  \nHi this is Joe', metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}),
Document(page_content='Hi this is Lance', metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}),
Document(page_content='Hi this is Molly', metadata={'Header 1': 'Foo', 'Header 2': 'Baz'})]

See more examples here.

Python Code:

from langchain_text_splitter import PythonCodeTextSplitter

python_text = """
class Person:
def __init__(self, name, age):
self.name = name
self.age = age

p1 = Person("John", 36)

for i in range(10):
print (i)
"""

code_splitter = PythonCodeTextSplitter(
chunk_size=100, chunk_overlap=0,
)

code_splitter.create_documents([python_text])
[Document(metadata={}, page_content='class Person:\n  def __init__(self, name, age):\n    self.name = name\n    self.age = age'),
Document(metadata={}, page_content='p1 = Person("John", 36)\n\nfor i in range(10):\n print (i)')]

Read more here:

PDF Processing:

#!pip3 install "unstructured[all-docs]"
from typing import Any

from pydantic import BaseModel
from unstructured.partition.pdf import partition_pdf

# Get elements
raw_pdf_elements = partition_pdf(
filename="document.pdf",

# Using pdf format to find embedded image blocks
extract_images_in_pdf=True,

# Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
# Titles are any sub-section of the document
infer_table_structure=True,

# Post processing to aggregate text once we have the title
chunking_strategy="by_title",
# Chunking params to aggregate text blocks
# Attempt to create a new chunk 3800 chars
# Attempt to keep chunks > 2000 chars
# Hard max on chunks
max_characters=4000,
new_after_n_chars=3800,
combine_text_under_n_chars=2000,
image_output_dir_path="static/pdfImages/",
)

Special Considerations:

Tables:

  • Extract as HTML/markdown tables
  • Preserve structure for LLM comprehension
  • Maintain column relationships

Images:

  • Extract and process separately
  • Generate text descriptions
  • Link to original context

Code:

  • Preserve function boundaries
  • Maintain class hierarchies
  • Keep import statements together

Level 4: Semantic Splitting — Beyond Structure to Meaning

Semantic splitting represents a paradigm shift in how we approach document chunking for language models. Unlike traditional methods that rely on physical text structure, this approach delves into the actual meaning and relationships within the content. It employs two sophisticated methods:

  • hierarchical clustering with positional rewards and sequential comparison with buffer analysis. The first method generates embeddings for each sentence and clusters them based on semantic similarity, while adding a “positional reward” to maintain the logical flow of the document. This is particularly effective for handling short sentences that follow longer ones, ensuring they remain in context.
  • The sequential comparison method takes a different approach by creating overlapping groups of sentences and analyzing the semantic distances between them. It identifies natural break points by studying the embedding distances and using a 95th percentile threshold to determine significant semantic shifts. Think of it as finding the natural “joints” where topics change, much like how a skilled butcher knows exactly where to cut along the natural seams of meat. He plots these distances to visualize semantic shifts, creating a clear picture of where content naturally divides. What makes this method particularly powerful is its ability to adapt to content complexity rather than relying on rigid rules. While it requires more computational resources than simpler approaches, it produces more meaningful chunks that better preserve context and relationships. This becomes especially valuable when dealing with complex documents where topic transitions aren’t clearly marked by structural elements like headers or paragraphs. As language models continue to evolve and computing resources become more accessible, semantic splitting increasingly represents the future of intelligent document processing.

For implementation and more details, please check the following notebook:

source

Level 5: Agentic Splitting — The Future of Intelligent Text Processing

Agentic splitting represents the cutting edge of text processing technology, mimicking how a human would naturally organize and understand content. At its core, this method begins by breaking down text into “propositions” — standalone statements that can exist independently while maintaining their meaning. For example, a complex sentence like “John went to the store, where he bought milk” becomes two clear propositions: “John went to the store” and “John bought milk.” The system then employs an AI agent that makes intelligent decisions about how these propositions should be grouped together. What makes this approach revolutionary is its chunk management system, which maintains rich metadata for each group of propositions. Every chunk receives a unique identifier, a dynamic title that evolves as content is added, and a detailed summary that captures the essence of its contents. The agent continuously evaluates new propositions against existing chunks, deciding whether to create new chunks or add to existing ones based on semantic relevance and context. This process isn’t just about grouping similar content; it’s about understanding the relationships between different pieces of information and maintaining those connections through metadata. While more computationally intensive than traditional methods, agentic splitting produces remarkably coherent and contextually aware chunks. The system can adapt to complex documents, understanding subtle topic shifts and maintaining contextual relationships that simpler methods might miss. Think of it as having an intelligent assistant that not only organizes your documents but understands the intricate web of relationships between different pieces of information. As language models become more sophisticated and computing costs decrease, this approach represents the future of how we’ll process and understand complex documents.

For implementation and more details, please check the following notebook:

Conclusion

Text splitting is evolving from simple character-based approaches to sophisticated AI-driven methods. While advanced techniques like semantic and agentic splitting are currently more resource-intensive, they represent the future of document processing for language models. Understanding and implementing these methods appropriately can significantly improve the performance of your LLM applications.

The key is to remember that the goal isn’t just to split text, but to prepare it in a way that maximizes the language model’s ability to understand and work with the content effectively.

--

--

Isaac Kargar
Isaac Kargar

Written by Isaac Kargar

Co-Founder and Chief AI Officer @ Resoniks | Ph.D. candidate at the Intelligent Robotics Group at Aalto University | https://kargarisaac.github.io/

No responses yet