File Storage System: Organize Files By Type & Extension
In today's digital age, managing and organizing files efficiently is crucial. This article explores how to implement a file storage system that automatically organizes files based on their type and extension. This approach not only simplifies file management but also enhances retrieval speed and overall system performance. We'll delve into the architecture, implementation details, and testing strategies for building such a system, ensuring a robust and scalable solution.
Objective: Type and Extension Based File Storage
The primary objective is to create a simple, practical file storage system that organizes files by their type and extension. This eliminates the need for complex AI-driven classification in the initial version (V1). The system aims to streamline file management by automatically categorizing files into predefined directories based on their file type, such as images, videos, documents, and more. This ensures that users can quickly locate and access their files without manually sorting them.
Key Benefits
- Enhanced Organization: Files are automatically sorted into appropriate categories.
- Improved Retrieval Speed: Quickly locate files by navigating through organized directories.
- Simplified Management: Reduces the manual effort required to manage and maintain files.
- Scalability: The system can be easily scaled to accommodate growing storage needs.
Problem Statement Alignment
This file storage system directly addresses the need to "Accept any media type through a unified frontend" and provides "Support for other file types, organizing them appropriately." By automatically categorizing files, the system ensures that all uploaded media, regardless of type, are stored in a logical and accessible manner. This alignment with the problem statement makes the system a valuable component of a comprehensive file management solution.
Storage Organization Strategy
The storage organization strategy revolves around a well-defined directory structure that categorizes files based on their type and extension. This structure ensures that files are stored in a logical and easily navigable manner.
Directory Structure
The directory structure is designed to be intuitive and scalable. The root directory, /storage/, contains several subdirectories, each representing a specific file type. Within each file type directory, further subdirectories are created for different file extensions. This hierarchical structure ensures that files are organized in a granular manner, making it easy to locate specific files.
/storage/
├── images/
│ ├── jpg/
│ ├── png/
│ ├── gif/
│ ├── svg/
│ ├── webp/
│ └── bmp/
├── videos/
│ ├── mp4/
│ ├── avi/
│ ├── mov/
│ ├── mkv/
│ ├── webm/
│ └── flv/
├── audio/
│ ├── mp3/
│ ├── wav/
│ ├── flac/
│ └── ogg/
├── documents/
│ ├── pdf/
│ ├── docx/
│ ├── doc/
│ ├── txt/
│ ├── rtf/
│ └── md/
├── spreadsheets/
│ ├── xlsx/
│ ├── xls/
│ └── csv/
├── presentations/
│ ├── pptx/
│ └── ppt/
├── archives/
│ ├── zip/
│ ├── tar/
│ ├── gz/
│ └── rar/
├── code/
│ ├── py/
│ ├── js/
│ ├── go/
│ ├── java/
│ └── cpp/
└── other/
└── unknown/
Explanation of Directories
- /images/: Stores image files, further categorized by extensions like JPG, PNG, GIF, SVG, WEBP, and BMP.
- /videos/: Stores video files, categorized by extensions like MP4, AVI, MOV, MKV, WEBM, and FLV.
- /audio/: Stores audio files, categorized by extensions like MP3, WAV, FLAC, and OGG.
- /documents/: Stores document files, categorized by extensions like PDF, DOCX, DOC, TXT, RTF, and MD.
- /spreadsheets/: Stores spreadsheet files, categorized by extensions like XLSX, XLS, and CSV.
- /presentations/: Stores presentation files, categorized by extensions like PPTX and PPT.
- /archives/: Stores archive files, categorized by extensions like ZIP, TAR, GZ, and RAR.
- /code/: Stores code files, categorized by extensions like PY, JS, GO, JAVA, and CPP.
- /other/: Stores files that do not fall into any of the above categories, with an
unknownsubdirectory for files with unrecognized extensions.
File Classification Logic
The file classification logic is the core of the file storage system, responsible for determining the correct category for each uploaded file. This is achieved through a combination of MIME type mapping and extension-based fallback mechanisms.
MIME Type Mapping
MIME types provide a standardized way to identify the type of a file. The file storage system uses a mapping of MIME types to specific categories. When a file is uploaded, the system detects its MIME type and uses this mapping to determine the appropriate directory for storage.
package storage
var mimeToCategory = map[string]string{
// Images
"image/jpeg": "images/jpg",
"image/png": "images/png",
"image/gif": "images/gif",
"image/svg+xml": "images/svg",
"image/webp": "images/webp",
// Videos
"video/mp4": "videos/mp4",
"video/avi": "videos/avi",
"video/quicktime": "videos/mov",
"video/x-matroska": "videos/mkv",
// Documents
"application/pdf": "documents/pdf",
"application/msword": "documents/doc",
"application/vnd.openxmlformats-officedocument.wordprocessingml.document": "documents/docx",
"text/plain": "documents/txt",
"text/markdown": "documents/md",
// Spreadsheets
"application/vnd.ms-excel": "spreadsheets/xls",
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet": "spreadsheets/xlsx",
"text/csv": "spreadsheets/csv",
// Archives
"application/zip": "archives/zip",
"application/x-tar": "archives/tar",
"application/gzip": "archives/gz",
// Audio
"audio/mpeg": "audio/mp3",
"audio/wav": "audio/wav",
"audio/flac": "audio/flac",
}
func (s *Storage) ClassifyFile(file *multipart.FileHeader) (string, error) {
// 1. Detect MIME type
mimeType, err := s.detectMIME(file)
if err != nil {
return "", err
}
// 2. Map to category
if category, ok := mimeToCategory[mimeType]; ok {
return category, nil
}
// 3. Fallback to extension
ext := strings.ToLower(filepath.Ext(file.Filename))
if category := s.classifyByExtension(ext); category != "" {
return category, nil
}
// 4. Default to other
return "other/unknown", nil
}
Extension-Based Fallback
In cases where the MIME type is not available or not recognized, the file storage system falls back to using the file extension to determine the category. This ensures that even files with unknown MIME types can be classified and stored appropriately.
var extensionToCategory = map[string]string{
".jpg": "images/jpg",
".jpeg": "images/jpg",
".png": "images/png",
".gif": "images/gif",
".mp4": "videos/mp4",
".avi": "videos/avi",
".pdf": "documents/pdf",
".docx": "documents/docx",
".xlsx": "spreadsheets/xlsx",
".zip": "archives/zip",
".mp3": "audio/mp3",
// ... etc
}
Storage Implementation
The storage implementation details how files are physically stored on the system. This involves handling file uploads, generating storage paths, and ensuring data integrity.
File Storage Handler
The FileStorage struct is responsible for managing the storage of files. It includes methods for classifying files, generating unique filenames, and writing files to disk.
type FileStorage struct {
basePath string
index *MetadataIndex
hasher *ContentHasher
}
func (fs *FileStorage) StoreFile(file *multipart.FileHeader, metadata map[string]string) (*StorageResult, error) {
// 1. Classify file
category, _ := fs.ClassifyFile(file)
// 2. Hash content
hash, _ := fs.hasher.Hash(file)
// 3. Check for duplicates
if existing := fs.index.FindByHash(hash); existing != nil {
return fs.createReference(existing, file.Filename)
}
// 4. Generate storage path
filename := fmt.Sprintf("%s_%s", hash[:12], file.Filename)
fullPath := filepath.Join(fs.basePath, category, filename)
// 5. Store file
if err := fs.writeFile(file, fullPath); err != nil {
return nil, err
}
// 6. Index metadata
fs.index.Add(hash, fullPath, metadata)
return &StorageResult{
Hash: hash,
Path: fullPath,
Category: category,
Size: file.Size,
}, nil
}
Metadata Indexing
Metadata indexing is crucial for efficient file retrieval. The file storage system maintains an index of file metadata, including hash, original name, stored path, category, and more. This index allows for quick searching and retrieval of files based on various criteria.
Metadata Structure
The FileMetadata struct defines the structure of the metadata stored for each file. This includes essential information such as the file's hash, original name, stored path, and category, as well as additional metadata provided by the user.
type FileMetadata struct {
Hash string `json:"hash"`
OriginalName string `json:"original_name"`
StoredPath string `json:"stored_path"`
Category string `json:"category"`
MimeType string `json:"mime_type"`
Size int64 `json:"size"`
UploadedAt time.Time `json:"uploaded_at"`
Metadata map[string]string `json:"metadata"`
}
Index Database (SQLite for simplicity)
For simplicity, the file storage system uses SQLite as the index database. SQLite is a lightweight, file-based database that is easy to set up and maintain. The database schema includes indexes for hash, category, and uploaded_at, allowing for efficient querying.
CREATE TABLE file_metadata (
id INTEGER PRIMARY KEY AUTOINCREMENT,
hash TEXT NOT NULL UNIQUE,
original_name TEXT NOT NULL,
stored_path TEXT NOT NULL,
category TEXT NOT NULL,
mime_type TEXT NOT NULL,
size INTEGER NOT NULL,
uploaded_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
metadata JSON,
INDEX idx_hash (hash),
INDEX idx_category (category),
INDEX idx_uploaded_at (uploaded_at)
);
Retrieval API
The retrieval API provides endpoints for querying and retrieving files based on various criteria. This includes querying by hash, category, name, and type.
Query by Hash
GET /files/{hash}
Query by Category
GET /files?category=images/jpg&limit=100
Search by Name
GET /files?name=vacation&limit=50
List by Type
GET /files/types/images
Implementation Tasks
To successfully implement the file storage system, the following tasks need to be completed:
- [ ] Create
internal/storage/classifier.go - [ ] Implement MIME detection with mimetype library
- [ ] Create extension mapping tables
- [ ] Implement directory structure creation
- [ ] Add file storage handler
- [ ] Create metadata index (SQLite)
- [ ] Implement deduplication logic
- [ ] Add retrieval APIs
- [ ] Create comprehensive tests
- [ ] Add migration for existing files
Testing Strategy
A comprehensive testing strategy is essential to ensure the reliability and performance of the file storage system. This includes both unit tests and integration tests.
Unit Tests
Unit tests focus on testing individual components of the system, such as the file classification logic.
func TestFileClassification(t *testing.T) {
tests := []struct{
filename string
mimeType string
expected string
}{
{"photo.jpg", "image/jpeg", "images/jpg"},
{"video.mp4", "video/mp4", "videos/mp4"},
{"doc.pdf", "application/pdf", "documents/pdf"},
{"unknown.xyz", "application/octet-stream", "other/unknown"},
}
// ...
}
Integration Tests
Integration tests focus on testing the interaction between different components of the system, such as file upload and retrieval.
# Test file upload and retrieval
./test/integration/test_file_storage.sh
Performance Targets
To ensure the file storage system is performant, the following targets should be met:
- Classification: <1ms per file
- Storage: <10ms per file (excluding network I/O)
- Deduplication check: <5ms per file
- Index query: <10ms
- Support 1000+ concurrent uploads
Conclusion
Implementing a file storage system that organizes files by type and extension provides a practical and efficient solution for managing digital assets. By following the strategies outlined in this article, you can build a robust and scalable system that enhances file organization, improves retrieval speed, and simplifies overall file management. For further reading on file systems and storage solutions, visit Wikipedia's File System Article.