datahtml

datahtml is a library for crawling and extraction of data from html and xml content.

Datahtml lets you:

Extract ld+json data from html
Extract frequently used meta tags from html (those that are used for SEO and social media, between others)
Extract Article data from a html, usually from Newspaper sites
Parse RSS feeds from sites
Crawl some specific social media sites like google and youtube

Under the hood datahtml uses libraries like BeautifoulSoup, Newspaper2k, feedparser between others, but datahtml takes an opinionated approach for crawling based on our expriencies doing so.

Quickstart

pip install datahtml

from datahtml import web, crawler

c = crawler.LocalCrawler()
w = web.download("https://www.infobae.com", crawler=c)
w.links()

License

datahtml is distributed under the terms of the MPL-2.0 license.

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
datahtml		datahtml
docs		docs
tests		tests
.gitignore		.gitignore
.pylintrc		.pylintrc
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
mypy.ini		mypy.ini
pyproject.toml		pyproject.toml
readthedocs.yml		readthedocs.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

datahtml

Quickstart

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

algorinfo/datahtml

Folders and files

Latest commit

History

Repository files navigation

datahtml

Quickstart

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages