In today’s digital age, data has become a critical element and is considered the oil of the new age. When you watch a video, your likes, collections, comments and other behaviors are generating data, and this is only a small part of the massive data in the world. With just one sentence, more than 4 million GB of data were generated globally. Shockingly, about 90% of the world’s data was generated in the past two years, and the pace of data generation continues to accelerate, with the amount of global data doubling every four years. Product iteration, business decision-making, AI development and many other aspects rely on data.
For large companies, handling massive amounts of data is not an easy task. From an abstract point of view, the entire data pipeline includes five parts: collection and ingestion, calculation, storage and consumption, but the actual situation is much more complex. There are many open source components to choose from, and their order is not fixed and they are intertwined. 1. Data collection and ingestion : Data collection refers to obtaining data from various data sources. While databases like MYSQL are primarily used for transactional data storage, they tend to be collected as data sources if used for analysis. In addition, there is streaming data from IoT devices such as smart homes and smart cars, as well as data from various applications. After you have the data source, you need to ingest the data into the data pipeline. Some data first enters streaming frameworks such as Kafka, and some data is stored in the data lake through regular batch ingestion, and in some cases is calculated directly after ingestion. 2. Data computing : Data computing is mainly divided into two types: batch processing and stream processing. Modern data processing frameworks, such as Spark Blink, realize batch-stream integration and can process two scenarios at the same time, gradually replacing HDP MapReduce, which only supports batch processing. Batch processing is the regular processing of large amounts of data at scheduled times, such as summarizing the sales of all products every day; stream processing is suitable for real-time data, and the data is processed immediately after it arrives. 3. Data storage : There are many types of storage. Data lakes are used to store unprocessed raw data for further data processing scenarios such as machine learning; data warehouses are used to store processed structured data and are often used in BI. , data visualization and other query scenarios. In recent years, in order to simplify the process, many integrated storage services have emerged that combine the two. 4. Data consumption : The previous series of complex operations are ultimately designed to consume data efficiently. These data can be used for data science prediction and analysis, to facilitate data visualization and report production by PMs or bosses, and can also be used for AI training. At the same time, as a data pipeline, there are interdependencies between tasks, and the execution sequence needs to be reasonably scheduled. Tools such as Airflow allow users to clarify task dependencies by defining a DAG and then schedule each step.
Generally speaking, an enterprise's big data architecture needs to be assembled from many open source components. Offline data, real-time data, batch processing, stream processing and other categories are used in business, involving a large number of components, resulting in high development and operation and maintenance costs.
In the era of data-centric artificial intelligence, almost all software is being redesigned. Take the just-released Tencent tc house-X data platform as an example, which shows many differences in the AI era. 1. Integrated design : Building data architecture in the traditional way is like building building blocks yourself, which is time-consuming and labor-intensive. However, right out of the box, tc house-X looks like a castle made of bricks. This integrated design is not only convenient, but also avoids the problem of multiple copies of data in different components in the traditional way. It allows users to create multiple virtual data warehouses based on one piece of data to support different businesses, avoiding the risk of data inconsistency and saving storage space. Moreover, the resources of each virtual data warehouse are isolated from each other, so that a heavy computing task will not affect the query experience of other businesses, and each virtual data warehouse can be expanded independently. 2. Flexibility brought by cloud native features : tc house - its computing and storage can be expanded independently, greatly reducing resource waste. For example, after migrating some of Tencent's businesses to the platform, computing resource consumption was less than 1/10 of the original amount. While saving resources, we also pursue ultimate performance and develop our own core engine technology. After the Tencent conference team migrated, it only used 1/3 of the original computing resources, but the query performance was 2 to 4 times higher than the original. 3. Intelligence : The intelligence of the platform is reflected in two aspects: AI is data, and data is AI. AI for data uses AI technology to make the data platform more powerful, such as allowing users to query data using natural language to facilitate users who do not understand SQL; it can also analyze load timing characteristics through machine learning, predict and dynamically adjust required resources, and save customer costs. . Data empowers AI, allowing the data platform to better serve AI. Under the traditional architecture, big data and AI architecture are separated, requiring the two systems to be developed and operated separately, and data needs to be imported and exported multiple times. by tc house -
Platforms like tc house - with the continuous development of technology, I believe there will be more similar innovative products in the future, pushing enterprises to new heights in data processing and applications. How do you see the future development of data platforms? Welcome to leave a message and share in the comment area, and don’t forget to like and share this article so that more people can understand the mysteries of big data architecture.
Share on Twitter Share on Facebook
Comments
There are currently no comments