Document Loaders#
Note
Combining language models with your own text data is a powerful way to differentiate them. The first step in doing this is to load the data into “documents” - a fancy way of say some pieces of text. This module is aimed at making this easy.
A primary driver of a lot of this is the Unstructured python package. This package is a great way to transform all types of files - text, powerpoint, images, html, pdf, etc - into text data.
For detailed instructions on how to get set up with Unstructured, see installation guidelines here.
The following document loaders are provided:
- CoNLL-U
- Airbyte JSON
- Apify Dataset
- AZLyrics
- Azure Blob Storage Container
- Azure Blob Storage File
- BigQuery Loader
- Bilibili
- Blackboard
- College Confidential
- Confluence
- Copy Paste
- CSV Loader
- DataFrame Loader
- Diffbot
- Directory Loader
- Discord
- DuckDB Loader
- EPubs
- EverNote
- Facebook Chat
- Figma
- GCS Directory
- GCS File Storage
- Git
- GitBook
- Google Drive
- Gutenberg
- Hacker News
- HTML
- iFixit
- Images
- Image captions
- IMSDb
- Markdown
- Notebook
- Notion
- Notion DB Loader
- Obsidian
- PowerPoint
- ReadTheDocs Documentation
- Roam
- s3 Directory
- s3 File
- Sitemap Loader
- Slack (Local Exported Zipfile)
- Subtitle Files
- Telegram
- Unstructured File Loader
- URL
- Selenium URL Loader
- Playwright URL Loader
- Web Base
- WhatsApp Chat
- Word Documents
- YouTube