solution consultant



Data challenge
In Vietnam, data sources stored in agencies and organizations are often scattered and inaccessible. Each branch or unit holds a separate database, and shares only, centralized storage of part of the data volume. Moreover, this data format is largely non-standardized. Each company, organization usually stores information in its own structure, which exists in various forms such as text files, excel files, databases, etc. Full information about the content exists in the form of physical storage on paper. Digitized information is only summary, does not reflect accurately and has no interference.

Technology challenge
At present, the Hadoop46 ecosystem is the most widely used open source data technology platform. Hadoop includes basic components such as the HDFS distributed file system, Hbase semi-structured database, MapReduce computation model, and Hive query processor. Hadoop was developed in 2006 for distributed data storage and processing. However, massive data access rates are being scattered across thousands of servers that are still a major source of open source research and development. Querying large amounts of data during interactions (less than a few tens of seconds) is a practical requirement in analysis, monitoring and forecasting problems. MapReduce is designed to handle distributed data with real-time execution time from minutes to hours. The Hive query processor enables large data access through SQL-like script sets, with slow query processing. Hive does not respond in less than a dozen seconds, as each query statement is mapped and executed by a sequence of jobs using MapReduce.

Human challenge
The use of Hadoop requires system operation skills, software development, and specialized data mining. Hadoop does not fit in with the vast majority of traditional users who are used to working with small data stored on relational database administrators and using SQL queries in data mining. In Vietnam, there is no training program to put large data storage and processing technology into the training content in a formal way. Engineers and data specialists in large numbers are negligible compared to the large data potential in Vietnam. They are mainly trained abroad, or self-educated in large companies, pioneers of large data mining. In addition, large data technology operations such as Hadoop require the ability to manage, tune and optimize distributed systems including multiple layers such as storage media layers, network layers, server layers, etc. Choosing technology, tools, and algorithms for large data sets is a choice that requires a lot of expert experience. The picture below (Figure 2) shows a panorama of technologies, tools for large data at the present time.

Infrastructure challenge
Large data storage and exploitation requires huge investment in computing infrastructure, as it requires a lot of storage and computing power, which in most cases requires a cluster of up to tens of thousands of servers. This is also the main reason that pioneers in the big data are global internet companies like Google, Amazon, Facebook, etc. Small and medium companies with limited capital will not have enough capital to start Built-in computing infrastructure is powerful enough for large data mining. However, with the recent development of cloud computing, it will reduce the cost of infrastructure investments as companies can lease server clusters over a period of time as the need arises.

For more information contact:

Mr Nguyen Tung Linh - 0989 890 326 -