LLMs Love Structure: Using Markdown for Better PDF Analysis

Dhaval Nagar / CEO

There's a strong case for using Markdown text over plain text when working with LLMs and structured data extracted from PDFs.

PDF File converted to Plain Text

Data extracted and formatted with Markdown provides structure and context to the data which is generally lost with plain text. For Retrieval-Augmented Generation (RAG) based applications, data available in markdown text format holds significant importance.

RAG enables injecting data that is previously unknown to the LLM model. This enables the language model to work on private data or reduce hallucination by providing the latest data compared to the model's training data. RAG appends text chunks relevant to the user's query to the LLM prompt, this is directly dependent on the context window limit of the model, so to generate better output it becomes essential to provide better into in the prompt.

Why Markdown is advantageous:

  • Preserving Structure: Markdown helps you represent the structure inherent in PDFs (tables, headings, lists, etc.). LLMs can understand and leverage this structure during tasks like question-answering or summarization. Plain text flattens the information, making it harder for the model to identify relationships.
  • Enhanced Semantics: Markdown elements like headers (#, ##), bold (**), italics (*), and code blocks (```) add semantic hints about the text's importance and type. LLMs trained on Markdown can pick up on these cues, improving their comprehension. All the latest LLMs can understand Markdown very well. We are using Claude 3 Haiku for all of our experiments and it interpreted the data fairly well.
  • Cleaner Representation: Markdown allows for a more visually organized and readable representation of tabular data. This translates to a better input format for the LLM as compared to potentially messy and unformatted plain text tables. Imaging a page with table and text data immediately after that table, when flattened the last data in the table can blend easily with the next line and may generate convoluted result.
  • Context Preservation: Cleaner representation helps preserve context. Additionally, if your PDF had figures, you could include image links or alt-text in Markdown within the context of the text. This provides valuable information to LLMs that plain text would lack.
  • Chunking: Essential for Retrieval Augmented Generation (RAG) systems, chunking (otherwise known as "splitting") breaks down larger documents in smaller parts for easier processing. While Markdown doesn't directly influence the vector representations themselves within a vector database, it plays a crucial role in preserving and utilizing context during the retrieval process within RAG systems that use vector storage.

How to make it work:

  • PDF Parsing and Markdown generation: Using a reliable PDF parsing library or service, like PyMuPDF or LlamaParse, to extract structured text from your PDF. This would include accurately identifying tables, headings, etc. We are currently using LlamaParse to convert PDF files into Markdowns. LlamaParse understands headers, list, embedded tables, can automatically parse text out of images, and handles a huge range of fonts.
  • Embedding Generation: PDF files can be generated or converted from variety of source files like word documents or powerpoint slides. Using effective chunking (splitting) strategy ensures that related content stays together, or there is some degree of overlapping between chunks. For example, in case of a PPT converted to PDF, use page-page chunking after parsing the PDF.

LlamaParse is a proprietary PDF parsing service. Which means you will have to upload your document to their endpoint, this may not be suitable for a lot of use cases. The Free Plan allows parsing 1000 pages per day.

This example PDF is a Shipping Receipt for a Print company. The receipt has multiple pages and each page has specific headers and footers. Following a Markdown approach with RAG we are able to split and re-build the context needed for the final output.

PDF File converted to Markdown with Table Columns and Rows are intact

Caveats

  • PDFs with complex layouts and structure might be difficult to extract in proper format. Even for the latest libraries or specialized ML services it might be very difficult to accurately parse and preserve the format.

For some cases, plain text with whitespace can also work, however it's definitely not as robust or versatile as Markdown formatting. For example, trying to represent headings, lists, code, or emphasis using only whitespace quickly becomes impractical and unreliable. Subtle details conveyed through formatting, such as italics for emphasis or code blocks for specific commands, are lost in plain text.

Summary

While Markdown is not a strict requirement, but when a structure can influence the quality of the output, the format can make a lot of difference.

If your PDFs contain structured data that you wish to utilize with an LLM, Markdown offers a way to maintain that structure and provide semantic cues. It potentially leads to better output performance compared to unstructured plain text.

More articles

Running Ollama, Llava-Phi3 on a small AWS EC2 Instance for Image Analysis

Ollma is a great way to run models locally, better privacy, performance, and cost utilization. In this post we will are using a small EC2 instance to do Image analysis with the custom Llava-Phi3 model.

Read more

From GPT to 3D Print in Minutes

Few days back a colleague shared a video where the guy built a Voice to (3D) Print automation using ChatGPT, Python, Rhino 3D, and Grasshopper. We thought it would be cool to give it a try.

Read more

Tell us about your project

Our office

  • 408-409, SNS Platina
    Opp Shrenik Residecy
    Vesu, Surat, India
    Google Map