Wed. Jan 22nd, 2025

This is the first article and my new blog, and this one will be fast and direct! There is not too much to explain.

Motivation

Terabytes of small files stored into a OCFS2 file system shared between two hosts and giving a lot of trouble, for example:

  • Terrible I/O (probably dlm related, since distributed lock is done through network);
  • Backup always outdated (mostly because of the problem above);
  • Split brain and STONITH;

And the most important one:

  • I hate cluster filesystems;

The project

Months ago, i started to think how to breakup this old school “enterprise” architeture to something modern, scalable, high available and easy to handle and i wonder immediately the way Amazon S3 works, and for my project, i’d like to have the same simplicity and scalability as S3 has.  Obviously, is more logical to use S3  instead of build something like it, but in my case i already have a private cloud infrastructure, so we choose to build something! At this point i don’t need the “buckets” feature that exists in S3, so this article will not cover a way to do it. To start, i write down on a paper some requirements that i wonder it would be truly important to build the solution:

  • virtualized infrastructure;

As mentioned, we already had our own private cloud.

  • distributed storage;

A great challenge. How to distribute the files and still know where are they?

  • hot backups;

Just in case of a node failure. We MUST have the files available.

  • cold backup;

Safe copy

  • restful api;

To handle files. Uploads and Deletes.

After a little research about a way to have all of these features, i decide to use NGINX with Lua and DAV modules, together with some data disks to distribute my files. DAV module can handle PUT and DELETE methods, so i didn’t need to write nothing to handle the uploads and deletes. To divide and balance the file tree (shard) i’ve used the MD5 algorithm (as i usually do). I draw the architecture with 5 servers: I will use some fictional IP address to illustrate the architecture 2 frontend servers (to serve and upload content);

  • frontend01 (192.168.0.10);
  • frontend02 (192.168.0.11);

3 (or more) storage servers (to attach my virtual disks and split the data);

  • backend01 (192.168.0.12);
  • backend02 (192.168.0.13);
  • backend03 (192.168.0.14);.

The great thing here is, both frontends and storages are on top of NGINX! Hands on Install NGINX with Lua and DAV module enabled. I will not teach you to do it. You can check how to do here: http://wiki.nginx.org/HttpLuaModule#Installation; and here: http://nginx.org/en/docs/http/ngx_http_dav_module.html Also install dnsmasq in all 5 servers, to resolve dns locally.

Backend Servers

For each backend i decided to use 6 virtual disk: 3 volumes for DATA and 3 volumes for BACKUP, formated as XFS, 100G each. Following the MD5 algorithm, i will have 16 bytes (chars) to split between the DATA disks. Tips:

How to create LVM volumes: http://tldp.org/HOWTO/LVM-HOWTO/ XFS how to: http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=linux&db=bks&srch=&fname=/SGI_Admin/LX_XFS_AG/sgi_html/ch03.html

With all volumes created, lets setup the sharding and mount them: First, split the 16 chars of between the 3 storage servers, and we will have: Storage server 1: chars 1, 2, 3, 4, 5 Storage server 2: chars 6, 7, 8, 9, 0 Storage server 3: chars a, b, c, d, e, f Considering you may want to mount them in /media:

Use the names you gave to the volumes

theflockers ~ $ sudo mkdir /media/data0{0,1,2}
theflockers ~ $ sudo mount -t xfs /dev/mapper/vol_data00 /media/data00 -o inode64,nobarrier
theflockers ~ $ sudo mount -t xfs /dev/mapper/vol_data01 /media/data01 -o inode64,nobarrier
theflockers ~ $ sudo mount -t xfs /dev/mapper/vol_data02 /media/data01 -o inode64,nobarrier

cross backup

theflockers ~ $ sudo mkdir /media/backup_data0{6,7,8}
theflockers ~ $ sudo mount -t xfs /dev/mapper/vol_backup_data06 /media/backup_data06 -o inode64,nobarrier
theflockers ~ $ sudo mount -t xfs /dev/mapper/vol_backup_data07 /media/backup_data07 -o inode64,nobarrier
theflockers ~ $ sudo mount -t xfs /dev/mapper/vol_backup_data08 /media/backup_data08 -o inode64,nobarrier

IMPORTANT: Don’t forget to put the new disks information on to fstab, by disk UUID is preferred. Continuing the sharding setup:

root ~ # mkdir /media/data00/{1,2}
root ~ # mkdir /media/data01/{3,4}
root ~ # mkdir /media/data02/5
root ~ # mkdir /media/backup_data06/{a,b}
root ~ # mkdir /media/backup_data07/{c,d}
root ~ # mkdir /media/backup_data08/{e,f}

Them, consolidate all directories in a unique place:

theflockers ~ $ sudo mkdir -p /media/storage/{live,backup}
theflockers ~ $ sudo cd /media/storage/
theflockers /media/storage $ sudo find /data0{0,1,2} -maxdepth 1 -mindepth 1 -type d |xargs -i ln -s live/'{}'
theflockers /media/storage $ sudo find /backup_data0{6,7,8} -maxdepth 1 -mindepth 1 -type d |xargs -i ln -s backup/'{}'

Repeat the disk configuration on the other storage servers and also setup de sharding: ** Server 2 **

theflockers ~ $ sudo mkdir /media/data0{3,4,5}
theflockers ~ $ sudo mount -t xfs /dev/mapper/vol_data03 /media/data03 -o inode64,nobarrier
theflockers ~ $ sudo mount -t xfs /dev/mapper/vol_data04 /media/data04 -o inode64,nobarrier
theflockers ~ $ sudo mount -t xfs /dev/mapper/vol_data05 /media/data05 -o inode64,nobarrier

# # cross backup

theflockers ~ $ sudo mkdir /media/backup_data0{0,1,2}
theflockers ~ $ sudo mount -t xfs /dev/mapper/vol_backup_data00 /media/backup_data00 -o inode64,nobarrier
theflockers ~ $ sudo mount -t xfs /dev/mapper/vol_backup_data01 /media/backup_data01 -o inode64,nobarrier
theflockers ~ $ sudo mount -t xfs /dev/mapper/vol_backup_data02 /media/backup_data02 -o inode64,nobarrier
theflockers ~ $ sudo mkdir /media/data03/{6,7}
theflockers ~ $ sudo mkdir /media/data04/{8,9}
theflockers ~ $ sudo mkdir /media/data05/0
theflockers ~ $ sudo mkdir /media/backup_data00/{1,2}
theflockers ~ $ sudo mkdir /media/backup_data01/{3,4}
theflockers ~ $ sudo mkdir /media/backup_data02/5
theflockers ~ $ sudo mkdir -p /media/storage/{live,backup}
theflockers ~ $ sudo cd /media/storage/
theflockers /media/storage $ sudo find /data0{3,4,5} -maxdepth 1 -mindepth 1 -type d |xargs -i ln -s live/'{}'
theflockers /media/storage $ sudo find /backup_data0{0.1,2} -maxdepth 1 -mindepth 1 -type d |xargs -i ln -s backup/'{}'

** Server 3 **

theflockers ~$ sudo mkdir /media/data0{6,7,8}
theflockers ~$ sudo mount -t xfs /dev/mapper/vol_data06 /media/data06 -o inode64,nobarrier
theflockers ~$ sudo mount -t xfs /dev/mapper/vol_data07 /media/data07 -o inode64,nobarrier
theflockers ~$ sudo mount -t xfs /dev/mapper/vol_data08 /media/data08 -o inode64,nobarrier

# # cross backup

theflockers ~ $ sudo mkdir /media/backup_data0{3,4,5}
theflockers ~ $ sudo mount -t xfs /dev/mapper/vol_backup_data03 /media/backup_data03 -o inode64,nobarrier
theflockers ~ $ sudo mount -t xfs /dev/mapper/vol_backup_data04 /media/backup_data04 -o inode64,nobarrier
theflockers ~ $ sudo mount -t xfs /dev/mapper/vol_backup_data05 /media/backup_data05 -o inode64 nobarrier
theflockers ~ $ sudo mkdir /media/data06/{a,b}
theflockers ~ $ sudo mkdir /media/data07/{c,d}
theflockers ~ $ sudo mkdir /media/data08/{e,f}
theflockers ~ $ sudo mkdir /media/backup_data03/{6,7}
theflockers ~ $ sudo mkdir /media/backup_data04/{8,9}
theflockers ~ $ sudo mkdir /media/backup_data05/0
theflockers ~ $ sudo mkdir -p /media/storage/{live,backup}
theflockers ~ $ sudo cd /media/storage/
theflockers /media/storage $ sudo find /data0{6,7,8} -maxdepth 1 -mindepth 1 -type d |xargs -i ln -s live/'{}'
theflockers /media/storage $ sudo find /backup_data0{3,4,5} -maxdepth 1 -mindepth 1 -type d |xargs -i ln -s backup/'{}'

So let’s start NGINX configurations Once you have all you backend servers configured, is time to setup nginx to reach the paths you created:

theflockers ~ $ sudo vim /etc/nginx/conf.d/backend.conf
server {

  listen 80;

  server_name ~^(?P<shard>([0-9]|[a-f])).static.domain.tld$;
  server_name ~^(?P<shard>([0-9]|[a-f])).backup.static.domain.tld$;

  access_log /shop/logs/live/nginx/static.access.log;
  error_log /shop/logs/live/nginx/static.error.log;

  root /media/storage;

  # try to reach Live files. If error, fallback to backup

  try_files /live$uri /backup$uri 404.html;
  client_max_body_size 20m;

  location / {
    root /media/storage/live;

    # DAV configurations
    dav_methods PUT DELETE;
    dav_access user:rw group:rw all:r;
    create_full_put_path on;

    # Important! Limit only to your frontends
    limit_except GET {

      # put your real network
      allow 192.168.0.0/24;
      deny all;
    }
  }
}

Copy this config to all your backend (storage) servers. Once you’re done, is time to setup your frontends. Frontend Servers Frontend servers don’t need too much disk, only if you need to keep your logs locally. I advise you to send the logs to a remote log server. We need to reach the backend servers. First, let’s configure the access to live files:

theflockers ~ $ sudo vim /etc/nginx/conf.d/static-backends.conf
#
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

location ~ '^/([0-9a-f])/' {
    dav_methods PUT DELETE;
    proxy_pass http://$1.static.domain.tld$uri;
}

Then, backups:

theflockers ~ $ sudo vim /etc/nginx/conf.d/static-backup-backends.conf
#
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

location ~ '^/backup_([0-9a-f])/' {
    dav_methods PUT DELETE;
    proxy_pass http://$1.backup.static.domain.tld$uri;
}

Now we’ll use the Lua module to do the magic with the backends. Configuring HTTP public frontend access:

theflockers ~ $ sudo vim /etc/nginx/conf.d/frontend.conf
# initialize the lua module

init_by_lua '
  rex = require “rex_posix”
  md5 = require “md5”
'

# main upload endpoint (PUT)
server {

  # default listen port and bind all addresses
  listen 80;
  server_name upload-endpoint.static.domain.tld;

  # need to resolve names
  # local resolver (dnsmasq)
  resolver 127.0.0.1;
  client_max_body_size 40m;
  access_log /var/log/nginx/static-upload.access.log;
  error_log /var/log/nginx/static-upload.error.log;
  
  auth_basic "REST endpoint";
  auth_basic_user_file /etc/storage.pwd;
  
  location / {
    # your authorized upload source or any, if open
    allow 192.168.0.10;
    deny all;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header Host "static.domain.tld";
    proxy_pass http://127.0.0.1$uri;
  }
}

# READ ONLY

server {
  # default listen port and bind all addresses
  listen 80;
  server_name static.domain.tld;

  # need to resolve names
  # local resolver (dnsmasq)
  resolver 127.0.0.1;

  set_real_ip_from 127.0.0.1/32;

  # max file size
  client_max_body_size 40m;
  add_header Access-Control-Allow-Origin *;

  access_log /var/log/nginx/static.access.log;
  error_log /var/log/nginx/static.error.log;

  root /tmp;

  # backends
  include "/etc/nginx/static-backends.conf";
  include "/etc/nginx/static-backup-backends.conf";

  set $backend "";
  set $backup_backend "";

  #
  # REWRITE DATA TO SEND TO CORRECT BACKEND
  #
  rewrite_by_lua '
    local uri = ngx.var.uri
    # get the path to generate the storage path
    pattern = "^/(.*)$"
    file = rex.match(uri, pattern)
    sum = md5.sumhexa(file)
    bcode = sum:sub(0,1)

    stor = "/" .. sum:sub(0,1) .."/".. sum:sub(2,3)
    path = stor .. "/" .. file
    ngx.var.backend = "/" .. bcode
    ngx.var.backup_backend = bcode
    ngx.req.set_uri(path, true)
  ';

  try_files $uri $backend$uri /backup_${backup_backend}$uri =404;
}

Again, if done, let’s adjust the DNS entries on frontend. The best would be you configure the backend information in your internal DNS server, but if you haven’t DNS servers, you can do it in the /etc/hosts of your frontends. Not a best practice, but it will solve your problem. Example configuration for both frontends:

theflockers ~ $ sudo vim /etc/hosts
127.0.0.1 localhost localhost.localdomain
192.168.0.10 frontend01.domain.tld frontend01
192.168.0.11 frontend02.domain.tld frontend02
192.168.0.12 backend01.domain.tld backend01
192.168.0.13 backend02.domain.tld backend02
192.168.0.14 backend03.domain.tld backend03

# LIVE

192.168.0.12 1.static.domain.tld
192.168.0.12 2.static.domain.tld
192.168.0.12 3.static.domain.tld
192.168.0.12 4.static.domain.tld
192.168.0.12 5.static.domain.tld
192.168.0.13 6.static.domain.tld
192.168.0.13 7.static.domain.tld
192.168.0.13 8.static.domain.tld
192.168.0.13 9.static.domain.tld
192.168.0.13 0.static.domain.tld
192.168.0.14 a.static.domain.tld
192.168.0.14 b.static.domain.tld
192.168.0.14 c.static.domain.tld
192.168.0.14 d.static.domain.tld
192.168.0.14 e.static.domain.tld
192.168.0.14 f.static.domain.tld

# CROSS BACKUP
192.168.0.13 1.backup.static.domain.tld
192.168.0.13 2.backup.static.domain.tld
192.168.0.13 3.backup.static.domain.tld
192.168.0.13 4.backup.static.domain.tld
192.168.0.13 5.backup.static.domain.tld
192.168.0.14 6.backup.static.domain.tld
192.168.0.14 7.backup.static.domain.tld
192.168.0.14 8.backup.static.domain.tld
192.168.0.14 9.backup.static.domain.tld
192.168.0.14 0.backup.static.domain.tld
192.168.0.12 a.backup.static.domain.tld
192.168.0.12 b.backup.static.domain.tld
192.168.0.12 c.backup.static.domain.tld
192.168.0.12 d.backup.static.domain.tld
192.168.0.12 e.backup.static.domain.tld
192.168.0.12 f.backup.static.domain.tld

It’s done! Restart NGINX in all nodes! Time to test! To test the solution, we need to put some data into storage. Let’s consider i have a file name “myimage.jpg” and i want to upload it. I can do it using CURL:

theflockers ~ $ -u login:password -X PUT -T myimage.jpg http://upload-endpoint.static.domain.tld/myimage.jpg

To GET the image:

theflockers ~ $ curl --head http://static.domain.tld/myimage.jpg
HTTP/1.1 200 OK 
Date: Wed, 06 Aug 2014 03:59:08 GMT 
Server: Nginx 
Last-Modified: Tue, 01 Aug 2014 0:01:38 GMT 
Accept-Ranges: bytes 
Content-Length: 698836 
Content-Type: image/jpg

How it worked?

  • When you PUT the file, the Lua module calculated  the Hash of “myimage.jpg” (a6a5e76984ac38726b09f825c31374c7);
  • Got the first char (  substr($hash, 0, 1)  => “a” ) for the dataset (disk);
  • Built the storage backend URL (a.static.domain.tld);
  • And backup storage backend URL (a.backup.statig.domain.tld);
  • Proxied connection to backend “a”;
  • Built the storage file location ( /media/storage/ + substr($hash, 0, 1) + ‘/’ + substr($hash, 1, 2)  => “a/6a” );
  • Stored the file at the disk (/media/storage/a/6a/myimage.jpg);

Ok guys! Hope this one could be useful!

Leave a Reply

Your email address will not be published. Required fields are marked *