Our Approach to Table Chunking
Ragie applies specialized chunking strategies for some of the elements we extract from documents. When we index documents we extract elements and categorize them by type. This allows us to improve retrievals by applying additional post processing and varying chunking approaches to suit the type of content. Today I want to focus on our approach to chunking tabular data.
Ragie extracts tabular data from rich document formats like Word and PDF in addition to typical tabular file formats like CSVs and spreadsheets. When we extract tables we create a structured representation that gets used when chunking.
Naively chunking tables for semantic retrieval presents a number of problems:
- A chunk may end in the middle of a column such that the subsequent chunk includes some of the table data, but without the table headers so contextual information is lost
- A chunk may end in the middle of a row so a record gets split across multiple chunks
- If a data format like XML, JSON, or YAML is used to represent the table and the data exceeds the chunk size it’s very likely that the chunks will contain invalid data for the format
- Many data formats used to represent a table will repeat keys for each record making it more likely for the table to get split across multiple chunks and more crucially, repeating key names frequently which negatively impact hybrid search results
To address these issues we developed a specialized table chunker. The Ragie table chunker starts with a structured representation of the data and eventually creates one or more chunks rendering the data as markdown formatted tables. The general approach follows these steps:
- If the chunk size accommodates the full table in markdown format, it is returned in 1 chunk
- If not the table is processed by row, creating a new table and chunk for as many rows as fit within the chunk size
- For tables with many columns, if a single row cannot fit within the chunk size, the chunk size is relaxed up to the maximum size for an embedding
- If a single row exceeds the maximum size for an embedding it is then split
With this approach, in most typical cases, table data is never disassociated from its table headers, rows are never split mid-record, and the table headers aren’t repeated excessively in the chunks.