Skip to content

Supported Formats

Complete reference of file types you can upload to your knowledge bases.

Overview

Functional AI supports 21 file formats across documents, code, and data. All files are processed to extract text content for semantic search.

Key principle: If it contains extractable text, it can be used.

Images Not Processed

Images, charts, and diagrams within documents are not processed. Only text content is extracted and indexed.

Document Formats

PDF Files (.pdf)

Aspect Details
Best for Manuals, reports, documentation, whitepapers
Text extraction Excellent for text-based PDFs, poor for scanned images
Max size 500 MB

PDF Quality

Use PDFs with selectable text rather than scanned images. If you have scanned documents, run OCR first.

PDF Optimization & Issues

Optimization tips: - Remove cover pages and blank pages - Compress images to reduce file size - Convert image-heavy PDFs to text-only versions - For scanned documents, use OCR tools - Test if text is selectable by trying to highlight it

Common issues: - Scanned PDFs appear as images - no text extracted - Password-protected PDFs cannot be processed - Very large PDFs (100+ MB) take significant time to process

Microsoft Word (.doc, .docx)

Aspect Details
Best for Business documents, policies, guides, procedures
Text extraction Excellent
Max size 500 MB
Word Optimization Tips
  • Use clear headings (Heading 1, Heading 2, etc.)
  • Avoid text boxes and complex layouts
  • Remove unnecessary images
  • Use tables for structured data (converted to text)
  • Save as .docx (more reliable than old .doc format)

Plain Text (.txt)

Aspect Details
Best for Simple content, logs, notes, transcripts
Text extraction Perfect (already plain text)
Max size 500 MB

Markdown (.md)

Aspect Details
Best for Technical documentation, README files, knowledge bases
Text extraction Excellent (headings preserved as text)
Max size 500 MB

Why Markdown Works Great

Clean structure with headings makes chunking more effective. No complex formatting to interfere with text extraction.

Presentation Formats

PowerPoint (.pptx)

Aspect Details
Best for Slide content, training materials, presentations
Text extraction Good (text from slides)
Max size 500 MB
PowerPoint: What Gets Extracted

Extracted: - Slide titles and body text - Bullet points and lists - Table content

Not extracted: - Images and diagrams - Charts and graphs - Embedded videos - Animations

Optimization tips: - Ensure critical information is in text, not images - Use slide titles effectively (they help with chunking) - Consider exporting to PDF for more control

Data Formats

JSON (.json)

Aspect Details
Best for Structured data, product catalogs, configurations, API responses
Text extraction Excellent (preserves structure as text)
Max size 500 MB

Best use cases: - Product catalogs with descriptions - FAQ data in structured format - Configuration documentation

JSON Optimization

Tips for better results: - Keep nesting reasonable (3-4 levels max) - Use descriptive key names - Include text descriptions alongside data values

Example - Good JSON structure:

{
  "products": [
    {
      "id": "SKU-001",
      "name": "Premium Widget",
      "description": "High-quality widget with advanced features",
      "features": [
        "Durable construction",
        "2-year warranty"
      ],
      "faq": [
        {
          "question": "Is this dishwasher safe?",
          "answer": "Yes, top rack only"
        }
      ]
    }
  ]
}

Code Files

Code files are useful for technical documentation, API references, or coding assistants.

Language Extensions Best For
Python .py Python modules, API implementations
JavaScript/TypeScript .js, .ts Frontend code, Node.js modules
HTML .html Web pages, email templates
CSS .css Style documentation, design systems
Java .java Java API documentation
C/C++ .c, .cpp Systems programming reference
C# .cs .NET application code
Ruby .rb Rails applications
PHP .php PHP application code
Go .go Go service documentation
Shell .sh Bash scripts, automation docs

All code files: Max size 500 MB, perfect text extraction

Code File Best Practices
  • Include comprehensive comments and docstrings
  • Remove credentials and secrets
  • Focus on well-documented, exemplary code
  • Consider if Markdown documentation might be clearer

Other Formats

LaTeX (.tex)

Best for academic papers and technical documents. Max size: 500 MB.

Complete File Type List

Category Supported Extensions
Documents .pdf, .doc, .docx, .txt, .md, .pptx
Data .json
Code .py, .js, .ts, .html, .css
Code (additional) .java, .c, .cpp, .cs, .rb, .php, .go, .sh
Other .tex

Format Recommendations

Best Formats for RAG

Use Case Recommended Format
Product documentation PDF, Markdown
FAQs Markdown, Text
Policies PDF, Word
Technical docs Markdown
Data/catalogs JSON
Code reference Native code files

Formats to Avoid

Format Issue Alternative
Scanned PDFs No text extraction Use OCR first
Image files Not processed Convert to text
Video/Audio Not supported Provide transcripts
Spreadsheets (CSV, XLSX) Not supported Convert to JSON
Password-protected Cannot read Remove protection
Very large files Slow processing Split into sections

File Size Optimization

PDF Optimization

Tools: Adobe Acrobat, SmallPDF, iLovePDF

Techniques: 1. Compress images: Reduce image quality/resolution 2. Remove pages: Delete cover, blank, and unnecessary pages 3. Remove metadata: Author info, comments, etc. 4. Convert to grayscale: If color isn't essential

Example: A 50 MB PDF manual can often be reduced to 5-10 MB without losing text quality.

When to Split Files

Instead of one massive file, consider splitting when:

  • Single file exceeds 50 MB (easier to manage)
  • Content covers distinct topics
  • Updates affect only portions

Example: Instead of "Complete Product Manual.pdf" (100 MB), split into: - "Installation Guide.pdf" (5 MB) - "User Manual.pdf" (20 MB) - "Troubleshooting.pdf" (10 MB) - "Technical Specifications.pdf" (8 MB)

Next Steps

Upload Files to Your Store