- by: axiner
- docsloader
- This is a documents loader.
This package can be installed using pip (Python>=3.11):
pip install docsloader
- if you want to install all dependencies:
pip install docsloader[all] - if you want to install specific dependencies:
- txt:
pip install docsloader[txt] - csv:
pip install docsloader[csv] - md:
pip install docsloader[md] - xlsx:
pip install docsloader[xlsx] - pptx:
pip install docsloader[pptx] - docx:
pip install docsloader[docx] - pdf:
pip install docsloader[pdf] - img:
pip install docsloader[img] - auto:
pip install docsloader[auto]
- txt:
The docsloader package provides asynchronous document loaders for various file suffixes. It includes dedicated loaders
for specific file types and an AutoLoader that automatically selects the appropriate loader based on file suffix.
The package supports loading documents from the following file suffixes:
- Text Files:
.txt - CSV Files:
.csv - Markdown Files:
.md - HTML Files:
.html,.htm - Excel Files:
.xlsx,.xls - PowerPoint Files:
.pptx,.ppt - Word Files:
.docx,.doc - PDF Files:
.pdf - Image Files:
.jpg,.jpeg,.png
The package provides the following loader classes:
TxtLoader: For Text filesCsvLoader: For CSV filesMdLoader: For Markdown filesHtmlLoader: For HTML filesXlsxLoader: For Excel filesPptxLoader: For PowerPoint filesDocxLoader: For Word filesPdfLoader: For PDF filesImgLoader: For image filesAutoLoader: Automatically selects the appropriate loader based on file suffix
All loader classes implement asynchronous load methods for efficient document processing.
import asyncio
from docsloader import AutoLoader
from toollib.log import init_logger
logger = init_logger(__name__)
async def main(path_or_url: str):
loader = AutoLoader(
path_or_url=path_or_url,
rm_tmpfile=False,
)
async for doc in loader.load():
logger.info(doc)
if __name__ == "__main__":
asyncio.run(main(path_or_url=r"E:/NewFolder/测试.docx"))This project is released under the MIT License (MIT). See LICENSE