REM Analytics Moves To Local Storage To Reduce Bottlenecks

We’re Koor. We offer tools and expertise to help you manage your data storage. Here’s how we helped one customer migrate to local storage to reduce bottlenecks in their genome-processing pipeline.

REM Analytics develops diverse genetic tests for a range of applications, from validating food authenticity and safety, to human and soil microbiota mapping. Designing these tests requires a lot of genome processing, which makes for intensive computing work.

Priced out of cloud storage

Cloud storage is expensive for everyone, but for research organizations relying on funding from grants, the costs can be prohibitive. REM Analytics started out using Microsoft Azure for their data storage needs. “We were doing a lot of genome processing and assembly of genomes that was very memory intensive … the budget was just not working at all for the computing power and the memory that we wanted,” said Emily Jamieson, Bioengineer at REM Analytics.

So they made the decision to go on-premises—a brave choice given that their IT team consisted of two software engineers and two bioengineers. “None of us had ever set up an on-premises cluster or dealt with storage or anything like that!” said Emily. “We were looking for a local storage solution that would allow us to do dynamic provisioning, with the flexibility to scale up and down for storage that we needed to access from different nodes.”

On-premises with Rook Ceph

The team at REM Analytics considered different provisioners and container storage interfaces, including the CSI from the K0s project and the open-local system. “We settled on Rook Ceph as it seemed to be a robust, well-documented system for on-premises storage, with all the flexibility of dynamic provisioning,” said Emily. “It also seemed to have a lot of scope for future growth of our system.

One of the challenges with on-premises storage is that if you’re used to working with cloud providers, suddenly there’s a lot of configuration and decision making that’s no longer abstracted away. “Things like storage and networking just happen by default in the cloud,” said Emily. “It takes away the guesswork of ‘What resources do I need behind my storage? How do I want to set up my storage? How do I want to structure my database to work with the storage?’”

“If you’re going on-premises, you’ve got these added layers that you need to think about before you get to the application. There are resources out there, but they’re often quite niche. There are so many different options about how to set up an on-premises cluster … it’s hard to get an overview of all the options, work out the best approach, and then make that approach work for your case.”

Help, Rook Ceph is hard

“We set up something that worked with a lot of trial and error … I think we set it up and tore it down at least three times,” said Emily. “Like completely wiping the disks and starting everything again.”

REM Analytics has a number of different applications they run to process their genomes, with everything going through a MongoDB database. Some of those application requests are extremely memory intensive, and were creating bottlenecks for downstream steps of their pipeline. They needed to set up a database with the highest possible read-write speeds and accessibility from different applications in parallel, so they can process the genomes as fast as possible.

“We realized that Rook Ceph was such a big thing with so many parameters and so many possibilities, that we were only just understanding the tip of the iceberg,” said Emily. They reached out to Koor for input on how to best set up their storage classes and leverage Rook Ceph on their hardware and applications.

Solution: Local storage

“The Koor team helped us understand why Ceph wasn’t necessarily the best option for setting up a database with high read-write speeds to cope with multiple parallel requests,” said Emily. “They helped us with setting up local storage and the database in the best way to leverage that local storage so that it isn’t a bottleneck for anything downstream. It’s made our system more robust.”

“We have other bottlenecks in the downstream steps of the pipeline, but we know that the first step is good,” said Emily. By starting from the bottom and moving up, REM Analytics is now confident they have a good foundation for their underlying storage. “Now we’re working on the database structure with the help of MongoDB, and after that we’ll work on optimizing the application itself.”

“It’s still very much a work in progress, but now we can shift from spending most of our time getting things set up and optimized, and focus more on designing our genome tests.”

Advice for choosing storage management

It’s hard to architect a storage solution on your own because every organization has unique needs, and generic best practices aren’t always going to apply to your situation. “For some, security will be absolutely vital,” said Emily. “For us, the data on our database is only valuable to us. It can’t be of much use to anybody else because it’s so specific to our application.”

If you’re trying to decide on the right storage solution, Emily suggests focusing on what is most important for your use case. “Is it volume? Is it read-write speed? Is it memory? Is it high availability? If our application dies for a couple of hours, it’s not catastrophic. For other companies, a 30-second outage could cost you thousands of dollars.”

There are so many different factors you could optimize for, so it’s important to define them up front and be clear on what to prioritize for your use case. “You also need a clear understanding of both the tools and the setup of those tools that are going to give you the best performance for your needs.”

Emily also recommends getting expert advice if necessary. “There isn’t a huge amount of information out there about what works best for a given use case, especially with on-premises,” said Emily. REM Analytics invested a lot of time in researching and trying to figure out the best option for themselves, and ended up with Ceph. They are still using Ceph for their smaller applications, but not actually for their database. “That’s where advice from people like Koor can be helpful, as they try to understand what you need and what setup is going to work for you.”

Koor can help

Are you having issues with your data storage? Is it keeping you up at night? Do you need help thinking through the options? Contact Koor, and let us help you figure out solutions that will work for your situation.

Our appreciation and gratitude to Rebecca Dodd who made this post possible.

Dave Mount August 09, 2023