Case Studies: Big Data and Scientific Research



This is the fifth and final article in an editorial series aimed at providing a roadmap for scientific researchers wishing to capitalize on the rapid growth of big data technology to collect, transform, analyze and visualize large collections. scientific data.

In the last article, we reviewed the open data movement in scientific research and how it relates to big data. The entire insideBIGDATA Guide to Scientific Research is available for download in the insideBIGDATA White Paper Library.

In order to illustrate how the scientific community is advancing rapidly with the adoption of the big data technology stack, in this section we will examine a number of research projects that have benefited from these tools. In addition, these project profiles show how Big Data regularly merges with traditional HPC architectures. In each case, significant amounts of data are collected and analyzed in the pursuit of an unparalleled understanding of nature and the universe.

Tulane University

As part of its efforts to rebuild after Hurricane Katrina, Tulane University has partnered with Dell and Intel to create a new HPC cluster to enable analysis of large scientific datasets. The cluster is essential for fueling big data analysis in support of scientific research in life sciences and other fields. For example, the school has many oncology research projects that involve statistical analysis of large data sets. Tulane also has researchers studying nanotechnology, the manipulation of matter at the molecular level, involving large amounts of data.

Tulane_quoteTulane worked with Dell to design a new HPC cluster dubbed Cypress, consisting of 124 Xeon-based Dell PowerEdge C8220X server nodes, connected through the high-density, low-latency Z9500 switch, delivering a theoretical peak total compute performance of over 350 teraflops. Dell also leveraged its relationship with Intel, which in turn leveraged its relationship with Cloudera, the leading Hadoop distribution, enabling Tulane to perform big data analysis using Hadoop in an HPC environment. .

The use of Cypress allows Tulane to conduct new scientific research in areas such as epigenetics (the study of mechanisms that regulate gene activity), cytometry (measures of the existence of certain sub- sets of cells in a type of tissue of the human body), primate research, sports-related concussion research and human brain mapping.

Arizona State University

ASU worked with Dell to create a powerful HPC cluster that supports big data analysis. As a result, ASU has created a holistic Next Generation Cyber ​​Capability (NGCC) using Dell and Intel technologies, capable of handling structured and unstructured data, as well as supporting various biomedical genomics tools and platforms. .

ASU turned to Dell and Intel to expand its HPC cluster. The resulting CCGS delivers 29.98 teraflops of sustained performance for HPC, big data, and massively parallel (or transactional) processing with 150 nodes and 2,400 cores. The HPC side of the CCGS consists of 100 Dell PowerEdge M620 servers with Intel® Xeon® E52660 processors and 1360 cores. The transactional portion of CCGS includes 20 Dell PowerEdge M420 servers, each with Intel Xeon E5-2430 processors.

HPC and the Cloudera Hadoop distribution that NGCC is based on can handle datasets of over 300 terabytes of genomic data. In addition, ASU uses the CCGS to understand certain types of cancer by analyzing genetic sequences and patient mutations.

National Center for High-Performance Computing Applications

The National Center for Supercomputing Applications (NCSA) provides computational, data, networking, and visualization resources and services that help scientists, engineers, and academics at the University of Illinois at Urbana-Champaign and across the country. The organization manages several supercomputing resources, including the iForge HPC cluster based on Dell and Intel technologies.

A particularly compelling scientific research project housed in the NCSA building is the Dark Energy Survey (DES), a study of the southern skies aimed at understanding the accelerating rate of expansion of the universe. The project is based on the iForge cluster and ingests around 1.5 terabytes per day.

Translational Genomics Research Institute

To advance health through genomic sequencing and personalized medicine, the Translational Genomics Research Institute (TGen) needs a robust and scalable high-performance computing environment, complemented by powerful big data analysis tools for its Dell | Hadoop platform. TGen has optimized its infrastructure by implementing the Dell Statistica analysis software solution and upgrading its existing Dell HPC cluster with Dell PowerEdge M1000e blades, Dell PowerEdge M420 blade servers and Intel processors. The increased performance accelerated experimental results, allowing researchers to expand treatments to more patients.

As gene sequencers increase in speed and capacity, TGen has expanded its HPC cluster to 96 nodes. This was done with state-of-the-art PowerEdge servers with Intel® Xeon® processors that hit 19 teraflops of processing. The cluster supports 1 million CPU hours per month and 100% data growth year over year. To handle this level of big data, TGen sized its existing Terascala storage to hold 1 petabyte.


The explosion of big data is transforming the way scientists conduct their research. The grants and research programs aim to improve the core technologies around the management and processing of big data sets and to accelerate scientific research with big data. The emerging field of data science is changing the direction and speed of scientific research by allowing people to refine their research by tapping into giant datasets. Scientists have been using data for a long time. What’s new is that the scale of data is overwhelming, which can be an infrastructure challenge. Researchers must now be able to tame large datasets with new Big Data and HPC software tools to make rapid progress in their fields.

If you prefer, the full insideBIGDATA Guide to Scientific Research is available for download in PDF format from the insideBIGDATA White Paper Library, courtesy of Dell and Intel.



Comments are closed.