This is the first article and my new blog, and this one will be fast and direct! There is not too much to explain.
Motivation
Terabytes of small files stored into a OCFS2 file system shared between two hosts and giving a lot of trouble, for example:
- Terrible I/O (probably dlm related, since distributed lock is done through network);
- Backup always outdated (mostly because of the problem above);
- Split brain and STONITH;
And the most important one:
- I hate cluster filesystems;
The project
Months ago, i started to think how to breakup this old school “enterprise” architeture to something modern, scalable, high available and easy to handle and i wonder immediately the way Amazon S3 works, and for my project, i’d like to have the same simplicity and scalability as S3 has. Obviously, is more logical to use S3 instead of build something like it, but in my case i already have a private cloud infrastructure, so we choose to build something! At this point i don’t need the “buckets” feature that exists in S3, so this article will not cover a way to do it. To start, i write down on a paper some requirements that i wonder it would be truly important to build the solution:
- virtualized infrastructure;
As mentioned, we already had our own private cloud.
- distributed storage;
A great challenge. How to distribute the files and still know where are they?
- hot backups;
Just in case of a node failure. We MUST have the files available.
- cold backup;
Safe copy
- restful api;
To handle files. Uploads and Deletes.
After a little research about a way to have all of these features, i decide to use NGINX with Lua and DAV modules, together with some data disks to distribute my files. DAV module can handle PUT and DELETE methods, so i didn’t need to write nothing to handle the uploads and deletes. To divide and balance the file tree (shard) i’ve used the MD5 algorithm (as i usually do). I draw the architecture with 5 servers: I will use some fictional IP address to illustrate the architecture 2 frontend servers (to serve and upload content);
- frontend01 (192.168.0.10);
- frontend02 (192.168.0.11);
3 (or more) storage servers (to attach my virtual disks and split the data);
The great thing here is, both frontends and storages are on top of NGINX! Hands on Install NGINX with Lua and DAV module enabled. I will not teach you to do it. You can check how to do here: http://wiki.nginx.org/HttpLuaModule#Installation; and here: http://nginx.org/en/docs/http/ngx_http_dav_module.html Also install dnsmasq in all 5 servers, to resolve dns locally.
Backend Servers
For each backend i decided to use 6 virtual disk: 3 volumes for DATA and 3 volumes for BACKUP, formated as XFS, 100G each. Following the MD5 algorithm, i will have 16 bytes (chars) to split between the DATA disks. Tips:
How to create LVM volumes: http://tldp.org/HOWTO/LVM-HOWTO/ XFS how to: http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=linux&db=bks&srch=&fname=/SGI_Admin/LX_XFS_AG/sgi_html/ch03.html
With all volumes created, lets setup the sharding and mount them: First, split the 16 chars of between the 3 storage servers, and we will have: Storage server 1: chars 1, 2, 3, 4, 5 Storage server 2: chars 6, 7, 8, 9, 0 Storage server 3: chars a, b, c, d, e, f Considering you may want to mount them in /media:
Use the names you gave to the volumes
theflockers ~ $ sudo mkdir /media/data0{0,1,2} theflockers ~ $ sudo mount -t xfs /dev/mapper/vol_data00 /media/data00 -o inode64,nobarrier theflockers ~ $ sudo mount -t xfs /dev/mapper/vol_data01 /media/data01 -o inode64,nobarrier theflockers ~ $ sudo mount -t xfs /dev/mapper/vol_data02 /media/data01 -o inode64,nobarrier
cross backup
theflockers ~ $ sudo mkdir /media/backup_data0{6,7,8} theflockers ~ $ sudo mount -t xfs /dev/mapper/vol_backup_data06 /media/backup_data06 -o inode64,nobarrier theflockers ~ $ sudo mount -t xfs /dev/mapper/vol_backup_data07 /media/backup_data07 -o inode64,nobarrier theflockers ~ $ sudo mount -t xfs /dev/mapper/vol_backup_data08 /media/backup_data08 -o inode64,nobarrier
IMPORTANT: Don’t forget to put the new disks information on to fstab, by disk UUID is preferred. Continuing the sharding setup:
root ~ # mkdir /media/data00/{1,2} root ~ # mkdir /media/data01/{3,4} root ~ # mkdir /media/data02/5 root ~ # mkdir /media/backup_data06/{a,b} root ~ # mkdir /media/backup_data07/{c,d} root ~ # mkdir /media/backup_data08/{e,f}
Them, consolidate all directories in a unique place:
theflockers ~ $ sudo mkdir -p /media/storage/{live,backup} theflockers ~ $ sudo cd /media/storage/ theflockers /media/storage $ sudo find /data0{0,1,2} -maxdepth 1 -mindepth 1 -type d |xargs -i ln -s live/'{}' theflockers /media/storage $ sudo find /backup_data0{6,7,8} -maxdepth 1 -mindepth 1 -type d |xargs -i ln -s backup/'{}'
Repeat the disk configuration on the other storage servers and also setup de sharding: ** Server 2 **
theflockers ~ $ sudo mkdir /media/data0{3,4,5} theflockers ~ $ sudo mount -t xfs /dev/mapper/vol_data03 /media/data03 -o inode64,nobarrier theflockers ~ $ sudo mount -t xfs /dev/mapper/vol_data04 /media/data04 -o inode64,nobarrier theflockers ~ $ sudo mount -t xfs /dev/mapper/vol_data05 /media/data05 -o inode64,nobarrier
# # cross backup
theflockers ~ $ sudo mkdir /media/backup_data0{0,1,2} theflockers ~ $ sudo mount -t xfs /dev/mapper/vol_backup_data00 /media/backup_data00 -o inode64,nobarrier theflockers ~ $ sudo mount -t xfs /dev/mapper/vol_backup_data01 /media/backup_data01 -o inode64,nobarrier theflockers ~ $ sudo mount -t xfs /dev/mapper/vol_backup_data02 /media/backup_data02 -o inode64,nobarrier theflockers ~ $ sudo mkdir /media/data03/{6,7} theflockers ~ $ sudo mkdir /media/data04/{8,9} theflockers ~ $ sudo mkdir /media/data05/0 theflockers ~ $ sudo mkdir /media/backup_data00/{1,2} theflockers ~ $ sudo mkdir /media/backup_data01/{3,4} theflockers ~ $ sudo mkdir /media/backup_data02/5 theflockers ~ $ sudo mkdir -p /media/storage/{live,backup} theflockers ~ $ sudo cd /media/storage/ theflockers /media/storage $ sudo find /data0{3,4,5} -maxdepth 1 -mindepth 1 -type d |xargs -i ln -s live/'{}' theflockers /media/storage $ sudo find /backup_data0{0.1,2} -maxdepth 1 -mindepth 1 -type d |xargs -i ln -s backup/'{}'
** Server 3 **
theflockers ~$ sudo mkdir /media/data0{6,7,8} theflockers ~$ sudo mount -t xfs /dev/mapper/vol_data06 /media/data06 -o inode64,nobarrier theflockers ~$ sudo mount -t xfs /dev/mapper/vol_data07 /media/data07 -o inode64,nobarrier theflockers ~$ sudo mount -t xfs /dev/mapper/vol_data08 /media/data08 -o inode64,nobarrier
# # cross backup
theflockers ~ $ sudo mkdir /media/backup_data0{3,4,5} theflockers ~ $ sudo mount -t xfs /dev/mapper/vol_backup_data03 /media/backup_data03 -o inode64,nobarrier theflockers ~ $ sudo mount -t xfs /dev/mapper/vol_backup_data04 /media/backup_data04 -o inode64,nobarrier theflockers ~ $ sudo mount -t xfs /dev/mapper/vol_backup_data05 /media/backup_data05 -o inode64 nobarrier theflockers ~ $ sudo mkdir /media/data06/{a,b} theflockers ~ $ sudo mkdir /media/data07/{c,d} theflockers ~ $ sudo mkdir /media/data08/{e,f} theflockers ~ $ sudo mkdir /media/backup_data03/{6,7} theflockers ~ $ sudo mkdir /media/backup_data04/{8,9} theflockers ~ $ sudo mkdir /media/backup_data05/0 theflockers ~ $ sudo mkdir -p /media/storage/{live,backup} theflockers ~ $ sudo cd /media/storage/ theflockers /media/storage $ sudo find /data0{6,7,8} -maxdepth 1 -mindepth 1 -type d |xargs -i ln -s live/'{}' theflockers /media/storage $ sudo find /backup_data0{3,4,5} -maxdepth 1 -mindepth 1 -type d |xargs -i ln -s backup/'{}'
So let’s start NGINX configurations Once you have all you backend servers configured, is time to setup nginx to reach the paths you created:
theflockers ~ $ sudo vim /etc/nginx/conf.d/backend.conf server { listen 80; server_name ~^(?P<shard>([0-9]|[a-f])).static.domain.tld$; server_name ~^(?P<shard>([0-9]|[a-f])).backup.static.domain.tld$; access_log /shop/logs/live/nginx/static.access.log; error_log /shop/logs/live/nginx/static.error.log; root /media/storage; # try to reach Live files. If error, fallback to backup try_files /live$uri /backup$uri 404.html; client_max_body_size 20m; location / { root /media/storage/live; # DAV configurations dav_methods PUT DELETE; dav_access user:rw group:rw all:r; create_full_put_path on; # Important! Limit only to your frontends limit_except GET { # put your real network allow 192.168.0.0/24; deny all; } } }
Copy this config to all your backend (storage) servers. Once you’re done, is time to setup your frontends. Frontend Servers Frontend servers don’t need too much disk, only if you need to keep your logs locally. I advise you to send the logs to a remote log server. We need to reach the backend servers. First, let’s configure the access to live files:
theflockers ~ $ sudo vim /etc/nginx/conf.d/static-backends.conf # proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; location ~ '^/([0-9a-f])/' { dav_methods PUT DELETE; proxy_pass http://$1.static.domain.tld$uri; }
Then, backups:
theflockers ~ $ sudo vim /etc/nginx/conf.d/static-backup-backends.conf # proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; location ~ '^/backup_([0-9a-f])/' { dav_methods PUT DELETE; proxy_pass http://$1.backup.static.domain.tld$uri; }
Now we’ll use the Lua module to do the magic with the backends. Configuring HTTP public frontend access:
theflockers ~ $ sudo vim /etc/nginx/conf.d/frontend.conf # initialize the lua module init_by_lua ' rex = require “rex_posix” md5 = require “md5” ' # main upload endpoint (PUT) server { # default listen port and bind all addresses listen 80; server_name upload-endpoint.static.domain.tld; # need to resolve names # local resolver (dnsmasq) resolver 127.0.0.1; client_max_body_size 40m; access_log /var/log/nginx/static-upload.access.log; error_log /var/log/nginx/static-upload.error.log; auth_basic "REST endpoint"; auth_basic_user_file /etc/storage.pwd; location / { # your authorized upload source or any, if open allow 192.168.0.10; deny all; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header Host "static.domain.tld"; proxy_pass http://127.0.0.1$uri; } } # READ ONLY server { # default listen port and bind all addresses listen 80; server_name static.domain.tld; # need to resolve names # local resolver (dnsmasq) resolver 127.0.0.1; set_real_ip_from 127.0.0.1/32; # max file size client_max_body_size 40m; add_header Access-Control-Allow-Origin *; access_log /var/log/nginx/static.access.log; error_log /var/log/nginx/static.error.log; root /tmp; # backends include "/etc/nginx/static-backends.conf"; include "/etc/nginx/static-backup-backends.conf"; set $backend ""; set $backup_backend ""; # # REWRITE DATA TO SEND TO CORRECT BACKEND # rewrite_by_lua ' local uri = ngx.var.uri # get the path to generate the storage path pattern = "^/(.*)$" file = rex.match(uri, pattern) sum = md5.sumhexa(file) bcode = sum:sub(0,1) stor = "/" .. sum:sub(0,1) .."/".. sum:sub(2,3) path = stor .. "/" .. file ngx.var.backend = "/" .. bcode ngx.var.backup_backend = bcode ngx.req.set_uri(path, true) '; try_files $uri $backend$uri /backup_${backup_backend}$uri =404; }
Again, if done, let’s adjust the DNS entries on frontend. The best would be you configure the backend information in your internal DNS server, but if you haven’t DNS servers, you can do it in the /etc/hosts of your frontends. Not a best practice, but it will solve your problem. Example configuration for both frontends:
theflockers ~ $ sudo vim /etc/hosts 127.0.0.1 localhost localhost.localdomain 192.168.0.10 frontend01.domain.tld frontend01 192.168.0.11 frontend02.domain.tld frontend02 192.168.0.12 backend01.domain.tld backend01 192.168.0.13 backend02.domain.tld backend02 192.168.0.14 backend03.domain.tld backend03 # LIVE 192.168.0.12 1.static.domain.tld 192.168.0.12 2.static.domain.tld 192.168.0.12 3.static.domain.tld 192.168.0.12 4.static.domain.tld 192.168.0.12 5.static.domain.tld 192.168.0.13 6.static.domain.tld 192.168.0.13 7.static.domain.tld 192.168.0.13 8.static.domain.tld 192.168.0.13 9.static.domain.tld 192.168.0.13 0.static.domain.tld 192.168.0.14 a.static.domain.tld 192.168.0.14 b.static.domain.tld 192.168.0.14 c.static.domain.tld 192.168.0.14 d.static.domain.tld 192.168.0.14 e.static.domain.tld 192.168.0.14 f.static.domain.tld # CROSS BACKUP 192.168.0.13 1.backup.static.domain.tld 192.168.0.13 2.backup.static.domain.tld 192.168.0.13 3.backup.static.domain.tld 192.168.0.13 4.backup.static.domain.tld 192.168.0.13 5.backup.static.domain.tld 192.168.0.14 6.backup.static.domain.tld 192.168.0.14 7.backup.static.domain.tld 192.168.0.14 8.backup.static.domain.tld 192.168.0.14 9.backup.static.domain.tld 192.168.0.14 0.backup.static.domain.tld 192.168.0.12 a.backup.static.domain.tld 192.168.0.12 b.backup.static.domain.tld 192.168.0.12 c.backup.static.domain.tld 192.168.0.12 d.backup.static.domain.tld 192.168.0.12 e.backup.static.domain.tld 192.168.0.12 f.backup.static.domain.tld
It’s done! Restart NGINX in all nodes! Time to test! To test the solution, we need to put some data into storage. Let’s consider i have a file name “myimage.jpg” and i want to upload it. I can do it using CURL:
theflockers ~ $ -u login:password -X PUT -T myimage.jpg http://upload-endpoint.static.domain.tld/myimage.jpg
To GET the image:
theflockers ~ $ curl --head http://static.domain.tld/myimage.jpg HTTP/1.1 200 OK Date: Wed, 06 Aug 2014 03:59:08 GMT Server: Nginx Last-Modified: Tue, 01 Aug 2014 0:01:38 GMT Accept-Ranges: bytes Content-Length: 698836 Content-Type: image/jpg
How it worked?
- When you PUT the file, the Lua module calculated the Hash of “myimage.jpg” (a6a5e76984ac38726b09f825c31374c7);
- Got the first char ( substr($hash, 0, 1) => “a” ) for the dataset (disk);
- Built the storage backend URL (a.static.domain.tld);
- And backup storage backend URL (a.backup.statig.domain.tld);
- Proxied connection to backend “a”;
- Built the storage file location ( /media/storage/ + substr($hash, 0, 1) + ‘/’ + substr($hash, 1, 2) => “a/6a” );
- Stored the file at the disk (/media/storage/a/6a/myimage.jpg);
Ok guys! Hope this one could be useful!