amplab, UC Berkeley, here.
We make sense of the world around us by turning data into information. For years, research in fields such as machine learning (ML), data mining, databases, information retrieval, natural language processing, and speech recognition have steadily improved their techniques for revealing the information lying within otherwise opaque datasets. But computer science is now on the verge of a new era in data analysis because of several recent developments, including: the rise of the warehouse-scale computer (WSC), the massive explosion in online data, the increasing diversity and time-sensitivity of queries, and the advent of crowdsourcing. Together these trends — often referred to collectively as Big Data — have the potential for ushering in a new era in data analysis, but to realize this opportunity requires us to confront several significant scientific challenges:
WSCs and cloud computing have made the world’s largest computing facilities generally available. However, the programming environments developed for these WSCs are only effective on a narrow range of tasks. Supporting the more varied demands of general data analysis will require a new software infrastructure for WSCs incorporating flexible programming abstractions specifically tailored to the highly parallel datacenter computing environment.
Massive amounts of new online data provide significantly more raw material for data analysis. However, this data comes from diverse sources with no common schema and is of variable quality. We need radically new data management techniques to tame these huge, heterogeneous and highly imperfect datasets.
The great diversity of data sources will enable a far greater range of queries than those supported by traditional data analysis systems, and the ever-increasing size of the datasets means that traditional data-analytics algorithms will require more computational resources and incur higher delays. We thus need far more flexible, scalable, and tunable analysis algorithms so that, over a wide range of queries, explicit tradeoffs can be made between delay, cost, and quality-of-answer.
Crowdsourcing allows, for the first time, large-scale and on-demand invocation of human input. For problems that are “ML-hard” (i.e., are difficult for traditional machine learning and other automated tools), crowdsourcing provides an attractive alternative. To be widely useful, however, these crowdsourcing methods must be tightly integrated within more general data analytics frameworks.
Meeting these challenges will require an entirely new approach that transcends and reshapes disciplinary boundaries. The AMPLab is a five-year collaborative effort at UC Berkeley, involving students, researchers and faculty from a wide swath of computer science and data-intensive application domains to address the Big Data analytics problem. AMP stands for “Algorithms, Machines, and People”. AMPLab envisions a world where massive data, cloud computing, communication and people resources can be continually, flexibly and dynamically be brought to bear on a range of hard problems by people connected to the cloud via devices of increasing power and sophistication.
Along with traditional research funding agencies, AMPLab is sponsored by, and works with many of the world’s leading technology companies and innovative start-ups. These companies participate in twice-yearly multi-day research retreats, provide advice and real-world insight, and interact closely with researchers on projects of mutual interest throughout the research process. AMPLab researchers (including faculty) share a open-plan collaborative research space that is designed to encourage interactions across the various areas of expertise in the lab.