···11# Project
2233-To run this project use:
44-55-```sh
66-make run
77-```
88-99-To compile this project use:
1010-1111-```sh
1212-make build
1313-```
1414-1515-# TODO
1616-- [x] createdb.py
1717-- [x] testdb.py
1818-- [x] ls.py
1919-- [ ] meta-data.py
2020- - [x] implementar `reg` para recibir informacion de los data nodes
2121-2222-2323-# Assignment 04: Distributed File systems
2424-2525-The components to implement are:
2626-2727-* **Metadata server**, which will function as an inodes repository
2828-* **Data servers**, that will serve as the disk space for file data blocks
2929-* **List client**, that will list the files available in the DFS
3030-* **Copy client**, that will copy files from and to the DFS
3131-3232-# Objectives
3333-3434-* Study the main components of a distributed file system
3535-* Get familiarized with File Management
3636-* Implementation of a distributed system
3737-3838-# Prerequisites
3939-4040-* Python:
4141- * [www.python.org](http://www.python.org/)
4242-* Python SocketServer library: for **TCP** socket communication.
4343- * https://docs.python.org/3/library/socketserver.html
4444-* uuid: to generate unique IDs for the data blocks
4545- * https://docs.python.org/3/library/uuid.html
4646-* **Optionally** you may read about the json and sqlite3 libraries used in the
4747-skeleton of the program.
4848- * https://docs.python.org/3/library/json.html
4949- * https://docs.python.org/3/library/sqlite3.html
5050-5151-### **The metadata server's database manipulation functions.**
5252-5353-No expertise in database management is required to accomplish this project.
5454-However sqlite3 is used to store the file inodes in the metadata server. You
5555-don't need to understand the functions but you need to read the documentation
5656-of the functions that interact with the database. The metadata server database
5757-functions are defined in file mds\_db.py.
33+This is a distributed filesystem written in C!
5845959-#### **Inode**
55+## Documentation
6066161-For this implementation an **inode** consists of:
77+You can read the documentation by:
88+- simply opening the source files in `./src`
99+- running `make docs`. This will create a directory called `docs`. Open it in
1010+your browser like `file:///path/to/this/project/docs/index.html`.
1111+- visiting the website: [dfs-docs](https://sona-tau.github.io/dfs-docs/files.html).
62126363-* File name
6464-* File size
6565-* List of blocks
1313+## Usage
66146767-#### **Block List**
1515+First, you have to setup the project.
68166969-The **block list** consists of a list of:
1717+```sh
1818+make all
1919+```
70207171-* data node address \- to know the data node the block is stored
7272-* data node port \- to know the service port of the data node
7373-* data node block\_id \- the id assigned to the block
2121+This will compile all the necessary executables and put them in a directory
2222+called: `./.build`. The executables provided are:
2323+- `metadata-server`
2424+- `data-node`
2525+- `ls`
2626+- `copy`
74277575-Functions:
7676-7777-* AddDataNode(address, port): Adds new data node to the metadata server
7878-Receives IP address and port. I.E. the information to connect to the data node.
7979-8080-* GetDataNodes(): Returns a list of data node tuples **(address, port)**
8181-registered. Useful to know to which data nodes the data blocks can be sent.
8282-* InsertFile(filename, fsize): Insert a filename with its file size into the
8383-database.
8484-* GetFiles(): Returns a list of the attributes of the files stored in the DFS.
8585-(addr, file size)
8686-* AddBlockToInode(filename, blocks): Add the list of data blocks information of
8787-a file. The data block information consists of (address, port, block\_id)
8888-* GetFileInode(filename): Returns the file size, and the list of data block
8989-information of a file. (fsize, block\_list)
9090-9191-### **The packet manipulation functions:**
9292-9393-The packet library is designed to serialize the communication data using the
9494-json library. No expertise with json is required to accomplish this assignment.
9595-These functions were developed to ease the packet generation process of the
9696-project. The packet library is defined in file Packet.py.
9797-9898-In this project all packet objects have a packet type among the following
9999-command type options:
100100-101101-* reg: to register a data node
102102-* list: to ask for a list of files
103103-* put: to put a files in the DFS
104104-* get: to get files from the DFS
105105-* dblks: to add the data block ids to the files.
106106-107107-#### **Functions:**
108108-109109-##### **General Functions**
110110-111111-* getEncodedPacket(): returns a serialized packet ready to send through the
112112-network. First you need to build the packets. See Build**\<X\>**Packet
113113-functions.
114114-* DecodePacket(packet): Receives a serialized message and turns it into a
115115-packet object.
116116-* getCommand(): Returns the command type of the packet
117117-118118-##### **Packet Registration Functions**
119119-120120-* BuildRegPacket(addr, port): Builds a registration packet.
121121-* getAddr(): Returns the IP address of a server. Useful for registration
122122-packets
123123-* getPort(): Returns the Port number of a server. Useful for registration
124124-packets
125125-126126-##### **Packet List Functions**
127127-128128-* BuildListPacket(): Builds a list packet for file listing
129129-* BuildListResponse(filelist): Builds a list response packet with the list of
130130-files.
131131-* getFileArray(): Returns a list of files
132132-133133-##### **Get Packet Functions**
134134-135135-* BuildGetPacket(fname): Builds a get packet to get a file name.
136136-* BuildGetResponse(metalist, fsize): Builds a list of data node servers with
137137-the blocks of a file, and the file size.
138138-* getFileName(): Returns the file name in a packet.
139139-* getDataNodes(): Returns a list of data servers.
140140-141141-##### **Put Packet Functions (Put Blocks)**
142142-143143-* BuildPutPacket(fname, size): Builds a put packet to put fname and file size
144144-in the metadata server.
145145-* getFileInfo(): Returns the file info in a packet.
146146-* BuildPutResponse(metalist): Builds a list of data node servers where the data
147147-blocks of a file can be stored. I.E a list of available data servers.
148148-* BuildDataBlockPacket(fname, block\_list): Builds a data block packet.
149149-Contains the file name and the list of blocks for the file. See [block
150150-list](http://ccom.uprrp.edu/~jortiz/clases/ccom4017/asig04/#block_list) to
151151-review the content of a block list.
152152-* getDataBlocks(): Returns a list of data blocks
153153-154154-##### **Get Data block Functions (Get Blocks)**
155155-156156-* BuildGetDataBlockPacket(blockid): Builds a get data block packet. Usefull
157157-when requesting a data block from a data node.
158158-* getBlockID(): Returns the block\_id from a packet.
159159-160160-# Instructions
161161-162162-Write and complete code for an unreliable and insecure distributed file server
163163-following the specifications below.
164164-165165-### **Design specifications.**
166166-167167-For this project you will design and complete a distributed file system. You
168168-will write a DFS with tools to list the files, and to copy files from and to
169169-the DFS.
170170-171171-Your DFS will consist of:
172172-173173-* A metadata server: which will contain the metadata (inode) information of the
174174-files in your file system. It will also keep a registry of the data servers
175175-that are connected to the DFS.
176176-* Data nodes: The data nodes will contain chunks (some blocks) of the file that
177177-you are storing in the DFS.
178178-* List command: A command to list the files stored in the DFS.
179179-* Copy command: A command that will copy files from and to the DFS.
180180-181181-### **The metadata server**
182182-183183-The metadata server contains the metadata (inode) information of the files in
184184-your file system. It will also keep a registry of the data servers that are
185185-connected to the DFS.
186186-187187-Your metadata server must provide the following services:
188188-189189-1. Listen to the data nodes that are part of the DFS. Every time a new data
190190-node registers to the DFS the metadata server must keep the contact information
191191-of that data node. This is (IP Address, Listening Port).
192192- * To ease the implementation of the DFS, the directory file system must
193193-contain three things:
194194- * the path of the file in the file system (filename)
195195- * the nodes that contain the data blocks of the files
196196- * the file size
197197-2. Every time a client (commands list or copy) contacts the metadata server
198198-for:
199199- * get: requesting to read a file: the metadata server must check if the file
200200-is in the DFS database, and if it is, it must return the nodes with the
201201-blocks\_ids that contain the file.
202202- * put: requesting to write a file: the metadata server must:
203203- * insert in the database the path of the new file (with its name), and its
204204-size.
205205- * return a list of available data nodes where to write the chunks of the
206206-file
207207- * dblks: then store the data blocks that have the information of the data
208208-nodes and the block ids of the file.
209209- * list: requesting to list files:
210210- * the metadata server must return a list with the files in the DFS and
211211-their size.
2828+NOTE: The `make all` command that this project produces lots of other
2929+intermediary files that are useful for debugging and testing. Please, ignore
3030+these!
21231213213-The metadata server must be run:
3232+> If this is not the first time you run the project, you might want to clear the
3333+> data directory. In the following configuration, you can do this by simply
3434+> running `make clean-data`
21435215215-python meta-data.py \<port, default=8000\>
3636+The first thing you have to do after compiling everything is create the
3737+database. To do this, run the following command:
21638217217-If no port is specified the port 8000 will be used by default.
3939+```sh
4040+createdb
4141+```
21842219219-### **The data node server**
4343+This command does not take any parameters.
22044221221-The data node is the process that receives and saves the data blocks of the
222222-files. It must first register with the metadata server as soon as it starts its
223223-execution. The data node receives the data from the clients when the client
224224-wants to write a file, and returns the data when the client wants to read a
225225-file.
4545+Then, you have to start the metadata server:
22646227227-Your data node must provide the following services:
4747+```sh
4848+metadata-server Port
4949+```
22850229229-1. put: Listen to writes:
230230- * The data node will receive blocks of data, store them using an unique id,
231231-and return the unique id.
232232- * Each node must have its own block storage path. You may run more than one
233233-data node per system.
234234-2. get: Listen to reads
235235- * The data node will receive requests for data blocks, and it must read the
236236-data block, and return its content.
5151+where:
5252+- `Port` is any valid number that can be a port. This is the port that the
5353+metadata server will be listening on.
23754238238-The data nodes must be run:
5555+After starting the metadata server, start a couple data nodes:
23956240240-python data-node.py \<server address\> \<port\> \<data path\> \<metadata
241241-port,default=8000\>
5757+```sh
5858+data-node IPv4 Port Path Port
5959+```
24260243243-Server address is the metadata server address, port is the data-node port
244244-number, data path is a path to a directory to store the data blocks, and
245245-metadata port is the optional metadata port if it was run in a different port
246246-other than the default port.
6161+where:
6262+- `IPv4` is any valid `IPv4` address. This is the IP addresss of the metadata
6363+server.
6464+- `Port` is any valid number that can be a port. This is the port that the
6565+metadata server is listening on.
6666+- `Path` is the file path to the data directory for this data node.
6767+- `Port` is any valid number that can be a port. This is the port that the data
6868+node will be listening on.
24769248248-**Note:** Since you most probably do not have many different computers at your
249249-disposal, you may run more than one data-node in the same computer but the
250250-listening port and their data block directory must be different.
7070+You can now copy files to and from the server. To do this, use the `copy`
7171+command:
25172252252-### **The list client**
7373+```sh
7474+copy IPv4 Port [-s] Path [-s] Path
7575+```
25376254254-The list client just sends a list request to the metadata server and then waits
255255-for a list of file names with their size.
7777+where:
7878+- `IPv4` is any valid `IPv4` address. This is the IP addresss of the metadata
7979+server.
8080+- `Port` is any valid number that can be a port. This is the port that the
8181+metadata server is listening on.
8282+- `Path` is the file path to the source file that you want to copy.
8383+- `Path` is the file path to the destination of the file you want to copy.
8484+Notice that there is a `[-s]`. This must only be supplied once. This flag is to
8585+indicate that the next path represents a file that is in the server. For
8686+example, the following are correct ways to use this command:
25687257257-The output must look like:
8888+```sh
8989+copy 136.145.10.2 42069 -s /home/root/.bashrc /home/cheo/.bashrc
9090+copy 136.145.10.2 42069 /etc/passwd -s /home/sona/important_files.txt
9191+```
25892259259-/home/cheo/asig.cpp 30 bytes
260260-/home/hola.txt 200 bytes
261261-/home/saludos.dat 2000 bytes
9393+The following would be an incorrect way to use this command:
26294263263-The list client must be run:
9595+```sh
9696+copy 127.0.0.1 58008 -s /home/root/.bashrc -s /home/cheo/.bashrc
9797+# ERROR: ^^ ^^
9898+# The -s flag appears twice!
26499265265-python ls.py \<server\>:\<port, default=8000\>
100100+copy 127.0.0.1 58008 /etc/passwd /home/sona/important_files.txt
101101+# ERROR: ^ ^
102102+# The -s flag does not appear!
103103+```
266104267267-Where server is the metadata server IP and port is the metadata server port. If
268268-the default port is not indicated the default port is 8000 and no ':' character
269269-is necessary.
270105271271-### **The copy client**
106106+To list the files that are on the server you can use the `ls` command:
272107273273-The copy client is more complicated than the list client. It is in charge of
274274-copying the files from and to the DFS.
108108+```sh
109109+ls IPv4 Port
110110+```
275111276276-The copy client must:
112112+where:
113113+- `IPv4` is any valid `IPv4` address. This is the IP addresss of the metadata
114114+server.
115115+- `Port` is any valid number that can be a port. This is the port that the
116116+metadata server is listening on.
277117278278-1. Write files in the DFS
279279- * The client must send to the metadata server the file name and size of the
280280-file to write.
281281- * Wait for the metadata server response with the list of available data
282282-nodes.
283283- * Send the data blocks to each data node.
284284- * You may decide to divide the file over the number of data servers.
285285- * You may divide the file into X size blocks and send it to the data
286286-servers in round robin.
287287-2. Read files from the DFS
288288- * Contact the metadata server with the file name to read.
289289- * Wait for the block list with the bloc id and data server information
290290- * Retrieve the file blocks from the data servers.
291291- * This part will depend on the division algorithm used in step (1).
118118+### Example
292119293293-The copy client must be run:
120120+If you want to run this project as an example, try the following:
294121295295-Copy from DFS:
122122+First, make sure you compile all the executables.
123123+```sh
124124+make all
125125+```
296126297297-python copy.py \<server\>:\<port\>:\<dfs file path\> \<destination file\>
127127+Remember, this will put all the executables in a directory called `.build`.
298128299299-To DFS:
129129+Then, launch the metadata server first!
300130301301-python copy.py \<source file\> \<server\>:\<port\>:\<dfs file path\>
302131303303-Where server is the metadata server IP address, and port is the metadata server
304304-port.
132132+```sh
133133+./.build/metadata-server 127.0.0.1 42069
134134+```
305135306306-# Creating an empty database
136136+Leave this running in the background. Now launch several data nodes in other
137137+terminals:
307138308308-The script createdb.py generates an empty database *dfs.db* for the project.
139139+Terminal 1:
140140+```sh
141141+./.build/data-node 127.0.0.1 8001 ./.build/d1 42069
142142+```
309143310310- python createdb.py
144144+Terminal 2:
145145+```sh
146146+./.build/data-node 127.0.0.1 8002 ./.build/d2 42069
147147+```
311148312312-# Deliverables
149149+Terminal 3:
150150+```sh
151151+./.build/data-node 127.0.0.1 8003 ./.build/d3 42069
152152+```
313153314314-* The source code of the programs (well documented)
315315-* A README file with:
316316- * description of the programs, including a brief description of how they
317317-work.
318318- * who helped you or discussed issues with you to finish the program.
319319-* Video description of the project with implementation details. Any doubt
320320-please consult the professor.
154154+I recommend that you put the data directory for the data nodes inside `.build`
155155+so that cleaning up is a lot easier.
321156322322-# Rubric
157157+Now we're ready to start copying files!
323158324324-* (10 pts) the programs run
325325-* (80 pts) quality of the working solutions
326326- * (20 pts) Metadata server implemented correctly
327327- * (25 pts) Data server implemented correctly
328328- * (10 pts) List client implemented correctly
329329- * (25 pts) Copy client implemented correctly
330330-* (10 pts) quality of the README
331331- * (10 pts) description of the programs with their description.
332332-* No project will be graded without submission of the video explaining how the
333333-project was implemented.
159159+There is a `./test.sh` file that can help you test out your files. It will create
160160+a `500MB` file. I tested this with files up to `5GB` so if you want to try that,
161161+just add another 0 to the `./test.sh` file.
+17-10
test.sh
···11#!/usr/bin/env sh
2233-gum log -l "info" -t ansic "Erasing test files"
33+mylog() {
44+ local MSG="$*"
55+ local DATE="$(date +"%a %b %d %H:%M:%S %Y")"
66+77+ echo -e "$DATE \033[96mINFO\033[0m $MSG"
88+}
99+1010+mylog "Erasing test files"
411echo "rm 500MB.bin another_500MB.bin"
512rm 500MB.bin another_500MB.bin
613714echo ""
81599-gum log -l "info" -t ansic "Testing ls"
1616+mylog "Testing ls"
1017echo "./.build/ls 127.0.0.1 8000"
1118./.build/ls 127.0.0.1 8000
12191320echo ""
1414-gum log -l "info" -t ansic "Testing copy from client to server"
2121+mylog "Testing copy from client to server"
1522echo ""
16231717-gum log -l "info" -t ansic "Creating 500MB file with random bytes, called \"500MB.bin\""
2424+mylog "Creating 500MB file with random bytes, called \"500MB.bin\""
1825echo "cat /dev/random | head -c500000000 > 500MB.bin"
1926cat /dev/random | head -c500000000 > 500MB.bin
20272121-gum log -l "info" -t ansic "The size of 500MB.bin is:"
2828+mylog "The size of 500MB.bin is:"
2229echo "cat 500MB.bin | wc -c"
2330cat 500MB.bin | wc -c
24312532echo ""
26332727-gum log -l "info" -t ansic "Copying file to the server"
3434+mylog "Copying file to the server"
2835echo "./.build/copy 127.0.0.1 8000 500MB.bin -s /somewhere/in/the/server/500MB.bin"
2936./.build/copy 127.0.0.1 8000 500MB.bin -s /somewhere/in/the/server/500MB.bin
30373138echo ""
32393333-gum log -l "info" -t ansic "Testing ls"
4040+mylog "Testing ls"
3441echo "./.build/ls 127.0.0.1 8000"
3542./.build/ls 127.0.0.1 8000
36433744echo ""
3838-gum log -l "info" -t ansic "Testing copy from server to client"
4545+mylog "Testing copy from server to client"
3946echo ""
40474141-gum log -l "info" -t ansic "Copying file from the server to the client"
4848+mylog "Copying file from the server to the client"
4249echo "./.build/copy 127.0.0.1 8000 -s /somewhere/in/the/server/500MB.bin another_500MB.bin"
4350./.build/copy 127.0.0.1 8000 -s /somewhere/in/the/server/500MB.bin another_500MB.bin
44514552echo ""
46534747-gum log -l "info" -t ansic "Checking if both files are the same"
5454+mylog "Checking if both files are the same"
4855diff -s 500MB.bin another_500MB.bin