This is especially true for those of us who are early adopters blithely chopping our way through the heavy underbrush looking for El Dorado. Having just come through a 12-month development project for a large-scale entity resolution system running on HDFS and Hadoop, we will happily share some of our key lessons learned while the wounds are still fresh.
1. Software Configuration Management
The distributed processing technologies of Big Data are in constant flux. The good news is, every year new open source technologies emerge to help cut through the difficulties of distributed processing. Beyond those, new Big Data technologies seem to appear monthly, and new versions of current technologies seem to come out weekly. This, coupled with the fact that almost all new tools rely on particular versions of Java, makes configuration management an ongoing challenge.
Tool stacks for job flows often use three to five of these tools, and it is not uncommon to use ten or more. Adding to the problem of managing interactions between tool versions, there is the additional complication of platform management. Even though these tools require a real distributed computing environment to take advantage of their full horsepower, for convenience and economy the applications are often developed on a single-server platform setup with virtual machines to emulate a distributed processing environment.
When the target production environment is in the cloud, it can sometimes be difficult to recreate the same cloud environment in the virtual development environment regardless of the platform used. Workflows that run flawlessly in the development environment may have problems when uploaded to the cloud, which leads to the next issue.
2. Debugging, Testing, and Validation
Part of the appeal of Big Data IT is pushing the parallelism to the middleware. This allows programmers to primarily focus on application logic and let the system worry about the distribution of work. Unfortunately, the added distributed processing layer can make software debugging difficult, so much so that debugging map-reduce has become its own small part of the software consultation industry.
One reason debugging is so difficult is because it is not always obvious when a problem is in the application logic, the result of a glitch, or an unexpected default setting in the underlying distributed processing environment. Error messages from the distributed environment are often less than helpful, assuming the messages are even visible. Messages logged in processor nodes are not automatically pushed back to the master node and may not be visible at the application layer without thoughtful design and coding.
Testing has its challenges as well. In traditional single-processor environments, regression testing can exercise many specific functions in a short amount of time by simply scaling back the size of the test data. However, in distributed processing environments, the initialization time can be significant and is often independent of the data size. A large suite of regression test cases can take several hours to run, even though the data size of each test case is quite small, simply because of the time required to initialize each test case.
For testing purposes, we often tend to use smaller data sets so the final outputs can be manually validated and reviewed, especially when there is a test failure. However, when using small test datasets, care should be taken to force the system to actually use multiple processing nodes. Without intervention, distributed processing systems like Hadoop Map/Reduce may default to a single reducer. This may cause the test to miss problems and interactions that only emerge when the system actually distributes data across nodes to process large datasets.
While constructing test data has always been difficult, it can be especially challenging for the distributed environment. There needs to be enough data to actually exercise distributed processing and assessing performance, but at the same time, it needs to be constructed in a way that the correct outputs are known.
3. Real-Time Access
Many Big Data applications run in two modes: a background batch process for life cycle maintenance of the data, and a foreground process that demands real-time or near real-time access. Master data management systems often require this dual action.
The distributed processing tools are more than up to the challenge of processing the large batches of Big Data to update the system, but were not initially designed to deliver the sub-second response times required to support real-time transactional identity resolution. Transactional processing is where the first issue of software configuration management can make magnitudes of difference in the system’s performance. Fine-tuning to achieve acceptable levels of performance often takes considerable experimentation with different combinations of configurations, along with clever design tricks.
While many key concepts of Big Data float around the Internet, the practical IT implementation concepts are still core issues many companies face. We spent the last year developing our scalable Entity Resolution software (HiPER) using distributed processing.
To take advantage of our expertise in distributed processing, entity resolution, and/or hacking through jungles of emerging Big Data concepts, please contact Black Oak Analytics or request a consultation to learn more about our HiPER platform.