Raw Data

A simple sqlite store for raw data.

Its provide a thin rest api for differents namespaces. Each namespace belongs to a sqlite's db.

This project pretends to be simple. Having a sqlite store allows move files directly when needed.

A fileserver is embebed for that purpose, and the option to take a snapshot for each namespace.

Future work could include a sharding strategy to split load, and a index for text data.

✨ New If -stream option is selected, it will stream each new entry by namespace in a Redis Instance.

Use cases

For small data ~1mb.

My use case is to store crawled data (~700kb), up to 500k objects per namespace.

Bigger files are discourage. Each file is loaded in memory for each request. SQLite doesn't provide a way to stream data directly.

Defaults to be considered

A default namespace is created when started.
No auth, reverse proxy auth is easy to be included using nginx. In the future could be included as a auth endpoint in the app.
Every object is compressed and decompressed using zlib.
-stream could be used to stream each new object to redis.

Also check the default config values:

var (
	listenAddr   = Env("RD_LISTEN_ADDR", ":6667")
	nsDir        = Env("RD_NS_DIR", "data/")
	redisAddr    = Env("RD_REDIS_ADDR", "localhost:6379")
	redisPass    = Env("RD_REDIS_PASS", "")
	redisDB      = Env("RD_REDIS_DB", "0")
	redisNS      = Env("RD_REDIS_NS", "RD")
	streamNo     = Env("RD_STREAM", "false")
	eStreamLimit = Env("RD_STREAM_LIMIT", "1000")
)

Data Schema inside each sqlite store

Data Schema V1:

CREATE TABLE IF NOT EXISTS data (
	data_id    TEXT PRIMARY KEY,
    data       BLOB NOT NULL,
	created_at TEXT DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX  IF NOT EXISTS created_ix ON data(created_at);

API

GET /status
- 200 if everything is ok
GET /files
- Fileserver. List all the sqlite files for each namespace
POST /v1/namespace
- Create a namespace { "name" : "my_namespace" }
GET /v1/namespace
- List namespaces
GET /v1/namespace/{namespace}/_backup
- Takes a backup, This action is SYNC, so consider the time of the request for big files ( > 6 GB)
GET /v1/data/{namespace}
- List files as an API, base64 encoded data and uncompressed.
GET /v1/data/{namespace}/_list
- List only IDs and created fields
PUT /{namespace}/{key}
- 201 if created, anything else = fail
- If the path already exist, the data will be replaced with the new sent.
POST /{namespace}/{key}
- 201 if created, anything else = fail
DELETE /{namespace}/{key}
- 200 Deleted
GET /{namespace}/ will be removed in the next release
- List files as an API, base64 encoded data.
- This should be moved to the API endpoints. Filter options will be included in future versions.

Usage

Running the service:

rawdata volume -help
Usage of volume:
  -listen string
    	Address to listen (default ":6667")
  -namespace string
    	Namespace dir (default "data/")
  -redis-ns string
    	Which key namespace use for redis (default "RD")
  -stream
    	Enable stream data to redis
  -stream-limit string
    	How many message by stream (default "1000")

rawdata volume
2023/04/05 17:48:50 new.go:57: NS Loading for default
2023/04/05 17:48:50 new.go:102: Starting from /home/nuxion/Projects/algorinfo/rawdata
2023/04/05 17:48:50 new.go:106: With stream disabled
2023/04/05 17:48:50 volume.go:100: Running web mode on:  :6667

By default default namespace is created:

Create or update a object

curl -v -L -X PUT -d bigswag localhost:6667/default/wehave

Create a new object

curl -v -L -X PUT -d bigswag localhost:6667/default/wehave

Get object (uncompressed original format)

curl -v -L localhost:6667/default/wehave

Delete object

curl -v -L -X DELETE localhost:6667/default/wehave

Similar projects and inspiration for this work

https://github.com/geohot/minikeyvalue It use a Go Web server as coordinator that manage keys in a leveldb store and redirect each request to a Nginx server used as volume. The problem is the same than before, is not suitable for small data in the long term, but I take the idea to have a easy way to restore the data if something fails.
Some time later, I found that the previous work was based on this paper
https://github.com/chrislusf/seaweedfs
Google: http://infolab.stanford.edu/~backrub/google.html

Roadmap

References

Redis INFO

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.vscode		.vscode
docs		docs
pkg		pkg
.dockerignore		.dockerignore
.gitignore		.gitignore
.version		.version
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
clientpy.py		clientpy.py
docker-compose.yml		docker-compose.yml
go.mod		go.mod
go.sum		go.sum
main.go		main.go
nginx.conf		nginx.conf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Raw Data

Use cases

Defaults to be considered

Data Schema inside each sqlite store

API

Usage

Similar projects and inspiration for this work

Roadmap

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

algorinfo/rawdata

Folders and files

Latest commit

History

Repository files navigation

Raw Data

Use cases

Defaults to be considered

Data Schema inside each sqlite store

API

Usage

Similar projects and inspiration for this work

Roadmap

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages