Why DIGIXT is the Best Enterprise Data Management Platform?
Introduction
Enterprise Data is in an era of disruption. The industry has moved from the age of legacy repositories to a time of cloud and big data. Get started with a new data management platform for modern business needs.
About DigiXT
DigiXT is the next-generation data platform that unifies big data, artificial intelligence, and the cloud to unlock new ways of working. It combines all your data into one place and lets you choose who, when, and where that data gets used. DigiXT helps organizations in the public sector, finance & banking, insurance, manufacturing, retail, and media create a more effective organization from top to bottom.
DigiXT provides the most comprehensive application for managing Big Data Analytics and Machine Learning, enabling organizations to capture, understand, and act on real-time information, accelerate decision making and deliver faster business results. It’s a single platform that delivers crucial capabilities across all major data management needs – across Genomics, Clinical Research, Advertising and Media, Banking & Finance, and more.
Enjoying total control, DigiXT provides the heartbeat for all your digital needs. The comprehensive ecosystem for Digital Transformation.
5 C Architecture
DigiXT Connect brings you the power of data engineering in a compact, plug-and-play package. Use DigiXT Connect to connect to any type of data source, such as SQL, NoSQL, IoT devices, APIs, streaming data, etc. while also supporting structured, semi-structured and unstructured data.
The next level of modern data management, DigiXT Construct is just different from what’s already possible. Using a combination of declarative ETL pipelines and real-time visibility into the health of your business, it offers stream, batch processing and end-to-end pipeline tracking.
DigiXT Contain layer consists of a high-performance object storage solution that meets the needs of today’s modern enterprises. With DigiXT, you can store, manage and query objects in real-time with an on-prem/Cloud object storage that supports pushdown and is proven for petabytes of data. It offers an on-prem/Cloud object storage that supports pushdown and is proven for petabytes of data. It supports real-time distributed OLAP data storage suitable for user-facing analytics with smart indexing and also enables zero-copy reads of large data sets for lightning-fast data access in AI and machine learning applications
Need to manage your data or need to find out what is stored in your databases? Our DigiXT Consume layer serves as a unified gateway, exposing and aggregating data from polyglot storage. It also incorporates an MPP (Massively Parallel Processing) distributed query engine which can be used with SQL queries. Supports SQL and can be connected through JDBC. Offers Data Virtualization support. Serves as a convenient data discovery tool that also facilitates collaboration.
DigiXT Control layer provides a simple and single console for monitoring and managing all DigiXT modules. There are provisions for authentication and authorization in this layer. It allows log and audit management, security management, script management, and general management of your DigiXT modules.
Built in AI tools
DigiXT’s platform is engineered to take on the challenges of today’s AI marketplace through a highly scalable, innovative set of tools. From deep learning to natural language processing to data analytics and big data, DigiXT has the most powerful AI tools to help you build smarter products that give your users more control and better results.
Call to Action
We truly believe in the saying “Coming together is a beginning, staying together is progress, and working together is a success.” (Courtesy: Henry ford). We intend to grow through collaboration and are keen to work with data centers, systems integrators as well as, the consulting entities who can all add value to the platform and can take a share of the pie in terms of delivering value to various clients regionally and globally.
Please feel free to reach out to me at arun@saal.ai for more information on potential collaboration.
Conclusion
The world is evolving at a breathtaking speed and the future is now. Our destiny depends upon us remaining agile and ready to adapt to emerging changes in the market, especially in a digital era where technology is touching almost every aspect of our daily lives. We’ve come together pooling our resources as we put to use our creativity, experience, and competent knowledge towards designing & developing a platform that revolutionizes the Big data and AI domains. Internationalization is the key to digital transformation; there are plenty of opportunities in the market for UAE-based companies to seize share by creating digital tools that local and regional customers will be able to use. DigiXT is a good example, highlighting how such tools are being created and nurtured in U.A.E.
About the Author: Arun Arikath has vast experience in commercializing technology products, services, and high-tech initiatives in various domains. Before joining Saal, Arun held key roles with regional and global technology organizations. In his current role as Chief Commercial Officer of Saal, he is focusing on growth and client adoption of the flagship Big Data and AI platform – DigiXT.
Fetching Polyglot Data is Cool, but How About Persistence?
The term “big data” refers to a large volume of data. There are millions upon millions of records in this database. They originate from a variety of sources, with the famed three Vs — Volume, Velocity, and Variety – in mind. DigiXT manages volume and velocity using cutting-edge distributed computing frameworks. We utilize templating to configure a number of data sources, including structured, unstructured, APIs, and real-time data. This addresses one of the critical requirements for managing ingestion from businesses that employ a multitude of distinct data storage solutions for various types of data.
While there are various extract-transform-load (ETL) technologies and platforms that can acquire polyglot data sources, there are only a few data platforms in the business that can support “polyglot” data lakes. DigiXT is thrilled to take a huge step forward in the world of data platforms by offering multi-format storage depending on the type of use cases.
Shifting perspectives of Data Lakes
We aim to shift people’s perceptions about how data lakes are used. To begin with, a data lake is not simply a repository for all data that is utilised for “only one purpose.” Let’s take a look at what a company’s transactional data repositories are like:
Figure 1 Transactional Data Stores: Many Types
Given the aforementioned circumstances, it’s natural to hypothesize that a data lake will not store all of the data in one format, given only a specific use case. We should build the data lake(s) to accommodate numerous use cases, just as we adopted transactional data storage for multiple reasons. That implies there has to be flexibility in the forms in which we may store data in the data lake, and it needs to be configurable. This is one of the DigiXT data platform’s key architectural tenets.
Polyglot Persistence in DigiXT
The primary objective of a data lake is to meet a range of consumer demands rather than simply storing data in one location. With the emergence of various data requirements such as artificial intelligence and machine learning, vast amounts of data must be fed for model training and development. As a result, storing the data in an appropriate format is critical. When we do this, it is very likely that some other use case may need the search of data. Another use-case is to load data quickly and then utilize it for aggregated consumer insights. We might not be able to attain performance for all of these use cases with a single data format and storage. But what about caching frequently used data? A key-value formatted storage system might be useful in this case.
Here is our solution:
Figure 2 DigiXT’s polyglot persistent Data Lake Approach
Unlike other data platforms in the market, we are adopting an innovative approach to keeping storage configurable based on use cases. We carefully design the data elements so that they are not duplicated. All individual abstractions are subsequently stored in a common storage layer backed by object storage technology. The data is encrypted, and stringent policies govern the access. Only appropriate authentication and permission allow external users and apps access to the layer. We also provide an MPP-based distributed query engine for retrieving data using basic SQL. Access, users, queries, and performance are all thoroughly inspected. The query engine may be coupled with the organization’s security infrastructure, such as LDAP, Okta, or OpenID.Standard access drivers are provided for dashboarding and reporting applications.
Advantages
Following are the advantages
- Extensible for new kinds of evolving high-performance storage
- Not tied up with a vendor-proprietary format, so no lock-in
- Maximum performance benefits and ROI for the use case under consideration
- Concurrency benefits, since everyone is not accessing same storage format
- Since underlying storage for almost all formats is same (Object storage), disaster recovery and fault tolerance are easy to manage
- Scaling and distributing base storage is easy and manageable
- Single access point for all data, that too with simple SQL 92 standard
For more details, connect with us.
An Essential Guide to Our Scalable Data Ingestion Platform
Introduction
We recently acquired a solution request from a valuable customer to ingest data from multiple sources into a storage medium so that it could be accessible, utilized and analyzed for a variety of revenue-generating use cases. Massive amounts of data must be ingested in near real-time, and the data must be queried without the additional learning curve that’s required to understand the nuances of big data ecosystem technologies like Spark or Flink.
The article discusses our initial approach, its drawbacks, and difficulties, as well as how it helped us develop a truly innovative platform capable of handling billions of messages in real-time. In situations where the persistence of huge volumes of incoming data and consumption is critical, this feature is an essential component of the DigiXT Data Platform (DDP).
Requirement
We had received the request from one of the most prominent names in the telecommunication services industry. For future research, the customer wants to archive millions of rows of data generated by telecom towers. Note that a single tower, on average, creates tens of millions of rows every second, and there are ten of these towers in total.
Figure 1: Problem Statement
Assume that roughly a billion rows per minute are typical. This data must be recorded so that it may be utilized to reply to various analytical inquiries. Furthermore, the records must be stored for at least six months. The solution must also be prepared to accommodate concurrent inquiries. The query time must be as minimal as possible. SAAL’s team began by delving deep into the data, as with any data-driven endeavour.
First Attempt
Googling to see if anything can be incorporated, and if required, adding some advancements to the current solutions, is the normal and obviously more dependable method to solve any technical problem.
However, at Saal, we want the solution to be unique, scalable, performant, and secure, as well as offer some additional advantages over the conventional alternatives. As a result, we found a commercial device that claimed to be capable of consuming billions of messages, provided the payload included time-series data
Telco data is inherently time-series, thus a database like Influxdb or Rocksdb will be ideal. Many people in the industry are already employing them, as we’ve seen. This article will also briefly go over how to utilize RocksDB to ingest time-series data.
After being persuaded of the database’s selection, we proceeded with the Proof of Concept (POC), beginning the ingestion with the assistance of a highly skilled technical team from the commercial vendor. The first several days were a catastrophe, but we gradually gained momentum. We began absorbing data, and when we saw billions of messages, we knew things were improving and we were getting closer to our objective. As the ingestion rate rose and the data flow was maintained for a longer length of time, we discovered that disk I/O is extremely slow, as is the CPU wait time for completing disk I/O. The “dstat” tool in Linux can readily identify this. We were running Ubuntu, the most recent version, on the POC, and the machine was quite powerful in terms of CPUs and RAM. We had 96 cores and 768 gigabytes of RAM. Here is an example of dstat output:
Figure 2: dstat analysis on the database
We can see that the CPU is doing nothing, that the wait times are increasing, and that the I/O is quite heavy, indicating that the system is unable to write efficiently as more data is ingressed. This might be a problem with the database settings or the drive, but we’ve noticed that the disk is linked through iSCSI on SSD, which is ideal for random reads and writes. As we saw, the connections, WAL-based writing, block size parameters, data compaction, and so on created too much delay and were unable to achieve the throughput necessary for the solution. Unfortunately, the commercial product was unable to fulfill the needs, while being flawless and suitable for small-scale use.
Applying Innovation: The SAAL Difference
Actual reflection of the problem reveals what is most important here: The ability to ingest, store, and retrieve information. Given this, the SAAL team pondered whether we should rely on a database at all. Other than safely storing and retrieving data, as well as protecting and keeping it, we don’t require complete ACID functionality here. However, durability and consistency are crucial, but file systems provide both. We don’t need to “join” several tables because the requirement is that all data be stored in a single table. We must have effective compression as well as a mechanism for efficiently querying large amounts of data.
We quickly concluded that using a columnar format in a compressed mode was the best option. And, because data is massive, we examined some of the finest formats used in the big data world:
As demonstrated, the Parquet 2 format provides the most benefits. Because individual tables in the big data ecosystem can be petabytes in size, attaining quick query response times necessitates intelligent filtering of table data based on criteria in the WHERE or HAVING clauses. Large tables are often partitioned using one or more columns that may efficiently filter the range of data. Date columns, for example, are frequently used as partition keys so that data partitions may be eliminated when a date range is given in SQL queries.
In addition to partition-level filtering, the Parquet file format enables file-level filtering based on the lowest and maximum values of each column in the file. These minimum/maximum column values are saved in the file’s footer.
If the range of data in the file between the minimum and maximum values does not coincide with the range of data given by the query, the system skips the whole file during scans. Filtering file-level minimum/maximum statistics was formerly coarse-grained: if a whole file could not be skipped, the full file had to be read. With the addition of Parquet Page Indexes to the Parquet format, scanners may decrease the amount of data read from disk, even more, providing a substantial performance boost for SELECT queries in SAAL’s DigiXT Data Platform.
The subsequent effort will be to produce the parquet files first. Rather than relying on the database, we utilized a memory-mapped format to generate Parquet files at the client-side, keeping them in line with industry best practices. As a result, we alleviated the bottleneck at the database server, which was a tremendous improvement that allowed us to achieve ingestion at a much higher level than customary. The high-level architectural overview is presented below:
A basic benchmarking enabled us to ingest around 50 billion messages per day, with query times of less than a minute.
The end result is as follows: We accomplished 254,389 parallel splits of Parquet columnar data and queried 211 billion rows in three minutes using three nodes running the query in parallel.
Summary
Due to this revolutionary approach, it is appropriate for any large-scale data ingestion requirements, IoT applications, and/or Telco use cases. DigiXT Data Platform offers an out-of-the-box JDBC/ODBC driver that can be used to ingest data from any other downstream applications for analytics, machine learning, and reporting, facilitating a complete data-driven ecosystem.
For additional information, please contact us.