Q Q:

当前位置:首页 > 博士论文
Big Data The power of petabytes
来源:一起赢论文网     日期:2019-03-04     浏览数:686     【 字体:

 BY MICHAEL EISENSTEINFifteen years ago, it was a landmark achievement. Ten years ago, it was an intriguing but highly expensive research tool. Now, falling costs, soaring accuracy and a steadily expanding base of scientific knowledge have brought genome sequencing to the cusp of routine clinical care. A growing number of institutions are con-ducting genome-wide dragnetsearches to identify the mutations responsible for rare dis -eases. The rate at which were finding causative variants in those cases is going up,says Russ Altman, a bioinformatician at Stanford School of Medicine in California. At some centres, its up to 50% of cases.Genomic variants can also reveal drivermutations that might reveal a tumours therapeutic vulnerabilities, or provide clues to whether a specific individual may or may not respond to a drug the drugs phar-macogeneticproperties. The US$1,000 genome, initially conceived as a price point at which sequencing could become a component of personalized medi -cine, has arrived. Our capacity for data gen -eration relative to price has increased in a way that is almost unprecedented in science roughly six orders of magnitude in the past seven or eight years,says Paul Flicek, a special -ist in computational genomics at the European Molecular Biology Laboratorys European Bioinformatics Institute in Cambridge, UK. The HiSeq X Ten system developed by Illu -mina of San Diego, California, can sequence more than 18,000 human genomes per year, for example. The biomedical research community is div-ing in whole-heartedly, with population-scale programmes that are intended to explore the clinical power of the genome. In 2014 the United Kingdom launched the 100,000 Genomes Project, and both the United States (under the Precision Medicine Initiative) and China (in a programme to be run by BGI of Shenzhen) have unveiled plans to analyse genomic data from one million individuals. Many other programmes are under way that, although more regional in focus, are still big dataoperations. A partnership between Geisinger Health System, based in Danville, Pennsylvania, and biotech firm Regeneron Pharmaceuticals of Tarrytown, New York, for instance, aims to generate sequence data for more than 250,000 people. Meanwhile, a grow-ing number of hospitals and service providers worldwide are sequencing the genomes of peo-ple with cancers or rare hereditary disorders (see DNA sequencing soars).Some researchers worry that the flood of data could overwhelm the computational pipelines needed for analysis and generate unprecedented demand for storage one article estimated that the output from genom-ics may soon dwarf data heavyweights such as YouTube. Many also worry that todays big data lacks the richness to provide clinical value.  I dont know if a million genomes is the right number, but clearly we need more than TATIANA PLAKHOVABIG DATAThe power of petabytesResearchers are struggling to analyse the steadily swelling troves of -omicdata in the quest for patient-centred health care.© 2015 Macmillan Publishers Limited. All rights reserved5 NOVEMBER 2015 | VOL 527 | NATURE | S3BIG DATA IN BIOMEDICINE OUTLOOKweve got,says Marc Williams, director of the Geisinger Genomic Medicine Institute.THE MEANING OF MUTATIONSClinical genomics today is largely focused on identifying single-nucleotide variants indi -vidual typosin the genomic code that can dis -rupt gene function. And rather than looking at the full genome, many centres focus instead on the exome the subset of sequences con-taining protein-coding genes. This reduces the amount of data being analysed nearly 100-fold, but the average exome still contains more than 13,000 single-nucleotide variants. Roughly 2% of these are predicted to affect the composition of the resulting protein, and finding the culprit for a given disease is a daunting challenge.For decades, biomedical researchers have dutifully deposited their discoveries of single-nucleotide variants in public resources such as the Human Gene Mutation Database, run by the Institute of Medical Genetics at Cardiff University, UK, or dbSNP, maintained by the US National Center for Biotechnology Infor-mation. However, the effects of these muta -tions were often determined from cell culture or animal models, or even theoretical pre -dictions, providing insufficient guidance for clinical diagnostic tools. In many cases, asso-ciations were made with relatively low levels of evidence,says Williams.The situation is even more complicated for structural variants, such as duplicated or miss -ing chunks of genome sequence, which are far more difficult to detect with existing sequenc-ing technologies than single-nucleotide vari-ants. At the whole-genome scale, each person has millions of variants. Many of these are in sequences that do not encode proteins but instead regulate gene activity, so they can still contribute to disease. However, the extent and function of these regulatory regions are poorly defined. Although capturing all this variability is desirable, it may not offer the best short-term returns for clinical sequencing. Youre shoot-ing yourself in the foot if youre collecting data you dont know how to interpret,says Altman.Efforts are now under way to rectify this problem. The Clinical Genome Resource,  which was set up by the US National Human Genome Research Institute, is a database of disease-related vari -ants, and contains information that could guide medical responses to these variants as well as the evidence supporting those associations. Genomics England, which runs the 100,000 Genomes Project, aims to bolster progress in this area by establishing clinical interpretation partnerships: doctors and researchers will collaborate to establish robust models of diseases that can potentially be mapped to specific genetic alterations. However, quantity is as important as quality. Mutations that offer a strong detrimental effect bring an evolutionary disadvantage, so they tend to be exceedingly rare and require large sample sizes to detect. Establishing statistically mean-ingful disease associations for variants with weak effects also needs large numbers of people. In Iceland, deCODE Genetics has demon-strated the power of population-scale genomics, combining extensive genealogy and medical-history records with genome data from 150,000 people (including 15,000 whole-genome sequences). These findings have allowed deCODE to extrapolate the population-wide distribution of known genetic risk factors, including gene variants linked to breast cancer, diabetes and Alzheimers disease. They have also enabled studies in humans that normally require the creation of genetically modified animals. We have established that there are about 10,000 Icelanders who have loss-of-function mutations in both copies of about 1,500 different genes,says Kári Stefánsson, the companys chief executive. Were putting sig-nificant effort into figuring out what impact the knockout of these genes has on individuals.This work was helped by the homogeneous nature of the Icelandic population, but other projects require a broadly representative spec-trum of donors. Efforts such as the interna -tional 1000 Genomes Project have catalogued some of the worlds genetic diversity, but most data are heavily skewed towards Caucasian populations, making them less useful for clinical discovery. Because they come from the genetic mother ship, so to speak, people of African ancestry carry a lot more genetic vari-ants than non-Africans,says Isaac Kohane, a bioinformatician at Harvard Medical School in Boston, Massachusetts. Variants that seem unusual in Caucasians might be common in Africans, and may not actually cause disease.Part of the problem stems from the refer-ence genome the yardstick sequence by which scientists identify apparent abnormali-ties, developed by the multinational Genome Reference Consortium. The first version was cobbled together from a few random donors of undefined ethnicity, but the latest iteration, known as GRCh38, incorporates more infor-mation about human genomic diversity. INTO THE CLOUDHarvesting genomes or even exomes at the population scale produces a vast amount of data, perhaps up to 40 petabytes (40 million gigabytes) each year. Nevertheless, raw stor -age is not the primary computational concern. Genomicists are a tiny fraction of the people who need bigger hard drives,says Flicek.  I dont think storage is a significant problem.A greater concern is the amount of variant data being analysed from each individual. The computation scales linearly with respect to the number of people,says Marylyn Ritchie, a genomics researcher at Pennsylvania State University in State College. But as you add more variables, it becomes exponential as you start to look at different combinations.This becomes particularly problematic if there are additional data related to clinical symptoms or gene expression. Processing data of this mag-nitude from thousands of people can paralyse tools for statistical analysis that might work adequately in a small laboratory study. Scaling up requires improvisation, but there is no need to start from scratch. Fields like meteorology, finance and astronomy have been integrating different types of data for a long time,says Ritchie. Ive been to meetings where I talk to people from Google and Face-book, and our big datais nothing like their big data. We should talk to them, figure out how theyve done it and adopt it into our field.Unfortunately, many talented program-mers with the skills to wrangle big data sets are lured away by Silicon Valley. Philip Bourne, associate director for data science at the US Youre shooting yourself in the foot if youre collecting data you dont know how to interpret.Human genomes are being sequenced at an ever-increasing rate. The 1000 Genomes Project has aggregated hundreds of genomes; The Cancer Genome Atlas (TGCA) has gathered several thousand; and the Exome Aggregation Consortium (ExAC) has sequenced more than 60,000 exomes. Dotted lines show three possible future growth curves.DNA SEQUENCING SOARS2001 2005 2010 2015 2020 2025100103106109Human Genome ProjectCumulative number of human genomes1000 GenomesTCGAExACCurrent amount1st personal genomeRecorded growthProjectionDouble every 7 months (historical growth rate)Double every 12 months (Illumina estimate)Double every 18 months (Moore's law)STEPHENS, Z. D.  ET AL.  PLOS BIOL. 13, E1002195 (2015)/CC BY 4.0 HTTP://CREATIVECOMMONS.ORG/LICENSES/BY/4.0© 2015 Macmillan Publishers Limited. All rights reservedS4 | NATURE | VOL 527 | 5 NOVEMBER 2015BIG DATA IN BIOMEDICINE OUTLOOKNational Institutes of Health (NIH), believes that this is partly due to a lack of recognition   and advancement within a publication-driven system of scientific credit that leaves software creators and data managers out in the cold. Some of these people truly want to be scholars, but they cant get the stature of faculty thats just not right,says Bourne. Processing power is another limiting fac-tor. This is not a desktop game the real practitioners are proficient in massively par-allel computation with hundreds if not thou-sands of CPUs, each with large memory,says Kohane. Many groups that analyse massive amounts of sequence data are moving to cloud-based architectures, in which the data are deposited within a large pool of computa-tional resources and can then be analysed with whatever processing power is required. Theres been a gradual evolution towards this idea that you bring your algorithms to the data,says Tim Hubbard, head of bioinfor -matics at Genomics England. For Genomics England, this architecture is contained in a secure government facility, with strict control over external access. Other research groups are turning to commercial cloud systems, such as those provided by Amazon or Google.PRIVACY PROTECTIONIn principle, cloud-based hosting can encourage sharing and collaboration on data sets. But reg -ulations on patient consent and privacy rights surrounding highly sensitive clinical informa-tion pose tricky ethical and legal issues. In the European Union, collaboration is impeded by member states having different rules on data handling. Sharing with non-EU nations relies on cumbersome mechanisms to estab -lish adequacy of data protection, or restrictive bilateral agreements with individual organiza-tions. To help solve this problem, a multinational coalition, the Global Alliance for Genomics and Health, developed the Framework for Respon-sible Sharing of Genomic and Health-Related Data. The Framework includes guidelines on privacy and consent, as well as on accountabil-ity and legal consequences for those who break   the rules. In data-transfer agreements, you could save yourself pages and pages of rules if the institu-tion, researcher and funder agree to follow the Framework,says Bartha Knoppers, a bioethi-cist at McGill University in Montreal, Canada, who chairs the Alliances regulatory and ethics working group. The Framework also calls for safe havensthat allow the research community to analyse centralized banks of genomic data that have been identity-masked but not fully de-identified, so they remain useful. We want to link it to clinical data and to medical records, because were never going to get to precision medicine otherwise, so were going to have to use coded data,explains Knoppers.  Integrating genomics into electronic health records is becoming increasingly important for many European nations. Our objective is to put this into the standard National Health Service,says Hubbard. The UK 100,000 Genomes Pro-ject may be the furthest along at the moment, but other countries are following. Belgium recently announced an initiative to explore medical genomics, for example.All these nations benefit from having cen-tralized, government-run health-care systems. In the United States, the situation is more frag-mented, with different providers relying on distinct health-record systems, supplied by dif-ferent vendors, that are generally not designed to handle complex genomic data. The NIH launched the Electronic Medical Records and Genomics (eMERGE) Network in 2007 to define best practices.FROM DATA TO DIAGNOSISThe immediate goal of genomically enriched health records is to explain the implications of gene variants to physicians, and one of its earliest implementations is pharmacogenetics. The Clinical Pharmacogenetics Implementa-tion Consortium has translated known druggene interactions reported in PharmGKB (a database run by Altman and his colleagues) for clinical use. For example, people with certain variants may respond poorly to particular anti -coagulants, leading to increased risk of heart attack. The issue there is, how do you take a practitioner who has 12 minutes per patient and about 45 seconds of time allocated for pre -scribing drugs, and influence their practice in a meaningful way?says Altman. As long as deciding how to adapt care to genetic findings remains a job for humans, this process will remain time- and labour-intensive. Nevertheless, combining genotype and phenotype information is proving fruit-ful from a research perspective. Most clinically relevant gene variants were identified through genome-wide association studies, in which large populations of people with a given disease were examined to identify closely associated genetic signatures. Researchers can now work backwards from health records to determine what clinical manifestations are prevalent  among individuals with a given genetic variant. And the genome is only part of the story other -omesmay also be useful barometers of health. In July, Jun Wang stepped down as chief executive of BGI to start up an organiza-tion to analyse BGIs planned million-genome cohort alongside equivalent data sets from the proteome, transcriptome and metabolome. I will be initiating a new institution to focus on using artificial intelligence to explore this kind of big data,he says. IT TAKES PATIENTSAs researchers strive to integrate data from health records and clinical trials with genomic and other physiological data, patients are starting to contribute. When were focused on things like behaviour, nutrition, exercise, smoking and alcohol, you cant get better data than what patients report,says Ritchie. Wearable devices, such as smartphones and FitBits, are collecting data on exer -cise and heart rate, and the volume of such data is soaring (see page S12) as it can be gathered with minimal effort on the  wearers part. Each patient may become a big-data pro-ducer. The data we generate at home or in the wild will vastly exceed what we accumulate in clinical care,says Kohane. Were trying to cre -ate these big collages of different data modali-ties from the genomic to the environmental to the clinical and link them back to the patient.As these developments materialize, they could create computational crunches that will make todays big datastruggles seem like pocket-calculator problems. And as scientists find ways to crunch the data, patients will be the ultimate winners.  Michael Eisenstein is a freelance s

上一篇:Holism for target recognition in synthetic aperture radar imagery