AI storage: NAS vs SAN vs object for training and inference
Artificial intelligencerelies on vast amounts of data.
Enterprises that take on AI projects, especially for large language modelsand generative AI, need to capture large volumes of data for model training as well as to store outputs from AI-enabled systems.
That data, however, is unlikely to be in a single system or location. Customers will draw on multiple data sources, including structured data in databases and often unstructured data. Some of these information sources will be on-premises and others in the cloud.
To deal with AI’s hunger for data, system architects need to look at storage across storage area networks, network attached storage, and, potentially, object storage.
In this article, we look at the pros and cons of block, file and object storage for AI projects and the challenges of finding the right blend for organisations.
The current generation of AI projects are rarely, if ever, characterised by a single source of data. Instead, generative AI models draw on a wide range of data, much of it unstructured. This includes documents, images, audio and video and computer code, to name a few.
Everything about generative AI is about understanding relationships. You have the source data still in your unstructured data, either file or object, and your vectorised data sitting on block
Patrick Smith, Pure Storage
When it comes to training LLMs, the more data sources the better. But, at the same time, enterprises link LLMs to their own data sources, either directly or through retrieval augmented generationthat improves the accuracy and relevance of results. That data might be documents but can include enterprise applications that hold data in a relational database.
“A lot of AI is driven by unstructured data, so applications point at files, images, video, audio – all unstructured data,” says Patrick Smith, field chief technology officer EMEA at storage supplier Pure Storage. “But people also look at their production datasets and want to tie them to their generative AI projects.”
This, he adds, includes adding vectorisation to databases, which is commonly supported by the main relational database suppliers, such as Oracle.
For system architects who support AI projects, this raises the question of where best to store data. The simplest option would be to leave data sources as they are, but this is not always possible.
This could be because data needs further processing, the AI application needs to be isolated from production systems, or current storage systems lack the throughput the AI application requires.
In addition, vectorisation usually leads to large increases in data volumes – a 10 times increase is not untypical – and this puts more demands on production storage.
This means that storage needs to be flexible and needs to be able to scale, and AI project data handling requirements differ during each stage. Training demands large volumes of raw data, inference – running the model in production – might not require as much data but needs higher throughput and minimal latency.
Enterprises tend to keep the bulk of their unstructured data on file access NAS storage. NAS has the advantages of being relatively low cost and easier to manage and scale than alternatives such as direct-attached storageor block access SAN storage.
Structured data is more likely to be block storage. Usually this will be on a SAN, although direct attached storage might be sufficient for smaller AI projects.
Here, achieving the best performance – in terms of IOPS and throughput from the storage array – offsets the greater complexity of NAS. Enterprise production systems, such as enterprise resource planningand customer relationship management, will use SAN or DAS to store their data in database files. So, in practice, data for AI is likely to be drawn data from SAN and NAS environments.
“AI data can be stored either in NAS or SAN. It’s all about the way the AI tools want or need to access the data,” says Bruce Kornfeld, chief product officer at StorMagic. “You can store AI data on a SAN, but AI tools won’t typically read the blocks. They’ll use a type of file access protocol to get to the block data.”
It is not necessarily the case that one protocol will better than the other. It depends very much on the nature of the data sources and on the output of the AI system
For a primarily document or image-based AI system, NAS might be fast enough. For an application such as autonomous driving or surveillance, systems might use a SAN or even high-speed local storage.
Again, data architects will also need to distinguish between training and inference phases of their projects and consider whether the overhead of moving data between storage systems outweighs performance benefits, especially in training.
This has led some organisations to look at object storage as a way of unifying data sources for AI. Object storage is increasingly in use with enterprises, and not just in cloud storage – on-premise object stores are gaining market share too.
Object has some advantages for AI, not least its flat structure and global name space,low management overheads, ease of expansion and low cost.
Performance, however, has not been a strength for object storage. This has tended to make it more suited to tasks such as archiving than applications that demand low latency and high levels of data throughput.
Suppliers are working to close the performance gap, however. Pure Storage and NetApp sell storage systems that can handle file and object and, in some cases, block too. These include Pure’s FlashBlade, and hardware that runs NetApp’s OnTap storage operating system. These technologies give storage managers the flexibility to use the best data formats, without creating silos tied to specific hardware.
Others, such as Hammerspace, with its Hyperscale NAS, aim to squeeze additional performance out of equipment that runs the network file system. This, they argue, prevents bottlenecks where storage fails to keep up with data-hungry graphics processing units.
But until better-performing object storage systems become more widely available, or more enterprises move to universal storage platforms, AI is likely to use NAS, SAN, object and even DAS in combination.
That said, the balance between the elements is likely to change during the lifetime of an AI project, and as AI tools and their applications evolve.
At Pure, Smith has seen requests for new hardware for unstructured data, while block and vector database requirements are being met for most customers on existing hardware.
“Everything about generative AI is about understanding relationships,” he says. “You have the source data still in your unstructured data, either file or object, and your vectorised data sitting on block.”
about AI and storage
Storage technology explained: AI and data storage: In this guide, we examine the data storage needs of artificial intelligence, the demands it places on data storage, the suitability of cloud and object storage for AI, and key AI storage products.
Storage technology explained: Vector databases at the core of AI: We look at the use of vector data in AI and how vector databases work, plus vector embedding, the challenges for storage of vector data and the key suppliers of vector database products.
#storage #nas #san #object #training
AI storage: NAS vs SAN vs object for training and inference
Artificial intelligencerelies on vast amounts of data.
Enterprises that take on AI projects, especially for large language modelsand generative AI, need to capture large volumes of data for model training as well as to store outputs from AI-enabled systems.
That data, however, is unlikely to be in a single system or location. Customers will draw on multiple data sources, including structured data in databases and often unstructured data. Some of these information sources will be on-premises and others in the cloud.
To deal with AI’s hunger for data, system architects need to look at storage across storage area networks, network attached storage, and, potentially, object storage.
In this article, we look at the pros and cons of block, file and object storage for AI projects and the challenges of finding the right blend for organisations.
The current generation of AI projects are rarely, if ever, characterised by a single source of data. Instead, generative AI models draw on a wide range of data, much of it unstructured. This includes documents, images, audio and video and computer code, to name a few.
Everything about generative AI is about understanding relationships. You have the source data still in your unstructured data, either file or object, and your vectorised data sitting on block
Patrick Smith, Pure Storage
When it comes to training LLMs, the more data sources the better. But, at the same time, enterprises link LLMs to their own data sources, either directly or through retrieval augmented generationthat improves the accuracy and relevance of results. That data might be documents but can include enterprise applications that hold data in a relational database.
“A lot of AI is driven by unstructured data, so applications point at files, images, video, audio – all unstructured data,” says Patrick Smith, field chief technology officer EMEA at storage supplier Pure Storage. “But people also look at their production datasets and want to tie them to their generative AI projects.”
This, he adds, includes adding vectorisation to databases, which is commonly supported by the main relational database suppliers, such as Oracle.
For system architects who support AI projects, this raises the question of where best to store data. The simplest option would be to leave data sources as they are, but this is not always possible.
This could be because data needs further processing, the AI application needs to be isolated from production systems, or current storage systems lack the throughput the AI application requires.
In addition, vectorisation usually leads to large increases in data volumes – a 10 times increase is not untypical – and this puts more demands on production storage.
This means that storage needs to be flexible and needs to be able to scale, and AI project data handling requirements differ during each stage. Training demands large volumes of raw data, inference – running the model in production – might not require as much data but needs higher throughput and minimal latency.
Enterprises tend to keep the bulk of their unstructured data on file access NAS storage. NAS has the advantages of being relatively low cost and easier to manage and scale than alternatives such as direct-attached storageor block access SAN storage.
Structured data is more likely to be block storage. Usually this will be on a SAN, although direct attached storage might be sufficient for smaller AI projects.
Here, achieving the best performance – in terms of IOPS and throughput from the storage array – offsets the greater complexity of NAS. Enterprise production systems, such as enterprise resource planningand customer relationship management, will use SAN or DAS to store their data in database files. So, in practice, data for AI is likely to be drawn data from SAN and NAS environments.
“AI data can be stored either in NAS or SAN. It’s all about the way the AI tools want or need to access the data,” says Bruce Kornfeld, chief product officer at StorMagic. “You can store AI data on a SAN, but AI tools won’t typically read the blocks. They’ll use a type of file access protocol to get to the block data.”
It is not necessarily the case that one protocol will better than the other. It depends very much on the nature of the data sources and on the output of the AI system
For a primarily document or image-based AI system, NAS might be fast enough. For an application such as autonomous driving or surveillance, systems might use a SAN or even high-speed local storage.
Again, data architects will also need to distinguish between training and inference phases of their projects and consider whether the overhead of moving data between storage systems outweighs performance benefits, especially in training.
This has led some organisations to look at object storage as a way of unifying data sources for AI. Object storage is increasingly in use with enterprises, and not just in cloud storage – on-premise object stores are gaining market share too.
Object has some advantages for AI, not least its flat structure and global name space,low management overheads, ease of expansion and low cost.
Performance, however, has not been a strength for object storage. This has tended to make it more suited to tasks such as archiving than applications that demand low latency and high levels of data throughput.
Suppliers are working to close the performance gap, however. Pure Storage and NetApp sell storage systems that can handle file and object and, in some cases, block too. These include Pure’s FlashBlade, and hardware that runs NetApp’s OnTap storage operating system. These technologies give storage managers the flexibility to use the best data formats, without creating silos tied to specific hardware.
Others, such as Hammerspace, with its Hyperscale NAS, aim to squeeze additional performance out of equipment that runs the network file system. This, they argue, prevents bottlenecks where storage fails to keep up with data-hungry graphics processing units.
But until better-performing object storage systems become more widely available, or more enterprises move to universal storage platforms, AI is likely to use NAS, SAN, object and even DAS in combination.
That said, the balance between the elements is likely to change during the lifetime of an AI project, and as AI tools and their applications evolve.
At Pure, Smith has seen requests for new hardware for unstructured data, while block and vector database requirements are being met for most customers on existing hardware.
“Everything about generative AI is about understanding relationships,” he says. “You have the source data still in your unstructured data, either file or object, and your vectorised data sitting on block.”
about AI and storage
Storage technology explained: AI and data storage: In this guide, we examine the data storage needs of artificial intelligence, the demands it places on data storage, the suitability of cloud and object storage for AI, and key AI storage products.
Storage technology explained: Vector databases at the core of AI: We look at the use of vector data in AI and how vector databases work, plus vector embedding, the challenges for storage of vector data and the key suppliers of vector database products.
#storage #nas #san #object #training
·61 مشاهدة