When most labs start a biobank project, spreadsheets should always be their first choice for storing and working with . They’re cheap, fast, and easy to use, everyone’s familiar with them, and with Google Sheets and Excel’s collaborative features, any lab member should be able to update it with ease.
Spreadsheets encourage experimentation and flexibility, and doesn’t impose many rigid structures around the type of information the biobank needs to collect. This makes them very easy to pick up.
Spreadsheets should comfortably manage a few hundred items in the bank. But when the collection grows to many hundreds to thousands of items, they start becoming unwieldy. Spreadsheets are also mostly ideal for simple data. It’s hard to upload images and PDFs to a spreadsheet. It’s even harder to manage multiple tables of data that need to reference each other.
For example, a simple biobank might manage one table for phages, another for bacterial hosts, and a third for the physical batches of phages. The list of phage batches should reference the list of phages and the list of hosts, but also contain its own details, like batch lot number, expiration date, the person responsible, etc. While there can only be one listing for P. aeruginosa and a phage Pae7, there can be multiple listings of phage batches.
While spreadsheets are great for new and smaller projects, they get unwieldy when more structure, consistency, and reliability is needed. For that, there are databases.
Spreadsheets are tools meant to explore and interact freely and quickly with data. Databases are tools meant to make data predictable and consistent, which requires structure, constraints, order, and correctness.
These additional constraints give databases a lot more power — there’s a reason why databases power the modern world — but they also add a lot more complexity. The additional structures means users can’t just “move things around” anymore in the data. Data is now organized into well-characterized columns: email column can only contain valid email addresses. Any other content won’t be saved.
For a long time, databases were very hard and expensive to use, but that’s not the case anymore, as there are plenty of tools that bridge the gap between spreadsheets and “real” databases.
Today, databases don’t have to be scary anymore. Gone are the days of having to write SQL code just to see the content in a database. Instead, there are a number of tools today that are just as easy (or even easier) than Google Sheets that have great database-like functionality. Airtable for example looks similar to a spreadsheet, but runs like a database. You can create multiple tabs, like a spreadsheet. You can also link and reference rows from one table to another, and filter, sort, and view your data in different ways. For Phage Directory, we’ve been using Airtable to power anything from the website to our blog that we’ve been running for over six years.
Airtable is a database with a spreadsheet-like interface. This is an example of a library system
Airtable enforces the structure and consistency of databases, but has a very approachable interface. They even support uploading files and images, and turning databases into websites.
Another great tool that we use a lot at Phage Directory is Notion, which is kind of a cross between Apple Notes and Airtable. In fact, this entire post is written in Notion, and served to the labspace.ai website using a custom data pipeline (the original post can be found here: https://phagedirectory.notion.site/Data-banks-for-Biobanks-fff6478089c681d0be0ccb30e50aa8b8).
While use Notion to write blog posts for Labspace, Notion also comes with data structures, that make it great for a database with lots of text content — like blog posts!
Notion offers a lot of flexibility for labs that use bio banks as lab resources, as Notion can double up as a great electronic lab notebook. It has collaborative and comment features like a Google Doc, and multiple people can work in it at the same time.
If Notion and Airtable are so great, do labs even need “real” databases like Postgres or even SQLite? Notion and Airtable still come with lots of constraints, but they should handle 80% of the cases. If you’re trying to store hundreds of thousands of genomes, from raw reads all the way through annotations in Notion, that’s probably not a good idea.