师大女生：我在哈佛做生物统计科研 - 大学生

大学生

师大女生：我在哈佛做生物统计科研

秉持尊重学生和家长隐私的原则，以下所展示的素材，仅为部分科研素材。

还有N多学术素材，由于知识产权或家长原因，不便公开，敬请理解。

本文作者Z同学，来自大陆的一所重点师范大学，统计专业，大四。梦想着进入美国名校读统计学或商业分析专业。为此，参加美国名校科研，增加学术背景，开拓视野，获得真知。通过美国名校科研老师的指导，申请到了哈佛大学生物统计专业的科研机会。

下文是学生在科研学习的过程中所写，供学生家长参考。文章下方，附学生的2篇中文科研周总结，供参考。

Summer research monthly statement report

Z L

I am honored to have this opportunity to participate in the summer research program at Harvard medical school. Now I will report to all on this research project. In recent years, machine learning has been very popular in the field of artificial intelligence, and it is also a new tool for improving prediction level. My major and machine learningare also very relevant, therefore, before I came to the United States, I had decided to learn some machine learning algorithm as soon as I can, and applying in medical related field, although I did not know own research.Before leaving, I also think aboutthe possible difficulties in the project: one is the data acquisition and preprocessing, the data is one of the key factors for successful machine learning, machine learning professor N, a Amazon AI team member, once said: no matter how good is an algorithm, the best way to drive machine learning progressing is to obtain large amounts of data.The second is the improvement of the algorithm.These two hypotheses have also been proved in a month of scientific research.

我很荣幸有这个机会参加哈佛医学院的暑期研究项目。现在我要向大家报告，近年来，机器学习在人工智能领域得到了广泛的应用，同时也是一种新的学习工具。提高预测水平。我的专业和机器学习也非常重要，因此，在我来美国之前，我有决定尽快学习一些机器学习算法，并应用于医学相关领域，虽然我自己不知道。在离开之前，我还考虑了项目中可能遇到的困难：一是数据采集和预处理，数据是机器学习成功的关键因素之一，机器学习N教授，一个亚马逊AI团队成员，曾经说过：不管算法有多好，驱动机器学习的最好方法是获取大量的数据。改进算法，这两个假设在一个月的科学研究中也得到了证实。

After the first meeting, I understand the content of the degrees of freedom is very high, from the selected topic, suppose every steps, data, and even if solving the problem,every step is up to myself, mentor’s rich background can help in every way to me.Although the mentor said I also can choose the subject of the financial sector to study, which is a small kidding? Ha~ Because I had no medical background, so in the early stage of the research I take some time to replenish the knowledge of tumor and genes, so it can complete the topic selection better, ultimately I determine the project isTumor Gene Identification, research mainly with MATLAB platform.

第一次见面后，我了解自由度的内容很高，从选题出发，假设每一步、数据，即使解决了问题，每一步都取决于我自己，导师丰富的背景可以帮助我。说我也可以选择金融部门的课题来研究，这是个小玩笑吗？因为我没有医学背景，所以在研究初期，我需要一定时间补充肿瘤和基因的知识，从而完成选题。最后我决定项目的istumor基因鉴定，研究主要以MATLAB为平台。

After determining the research content, I analyze the existing data, the data characteristics of these genes are less samples, but gene dimensionality is high, and these data had been labeled set. So with these data,and basing on the literature study, I decided to choose the SVM prediction model for training and classification. (I need explain data a little more—data is another classmate send to me, there have been some problems with the data at the beginning, so I added another set of data, all data in project is from TCGA database.)Then I encountered the first difficulty: the preprocessing of the data, the quality of the data will affect the classification effect of the later SVM, so I spent a lot of time on the data processing.The processing of data is divided into three steps: First, the data is been normalized, so that the data is in the same level, which will eliminate the differences of data as much as possible. Second, remove extraneous genes and redundant genes, so that the genes where remained are genes that are either mutated or mutating and not duplicated.When removing extraneous genes,I chose the information index to classification method, which is a good way to consider the effect of variance size on classification results, this way is based on the common signal-to-noise ratio method.In removing redundant genes, I chose The correlation coefficient of redundancy elimination method, determininga gene whether need to eliminate with the help of the similarity between each gene, the final classification results shows that the feature extraction function of genes is very obvious.Third, I used the principal component method to classify the genes, after these three steps, there are only 134 genes left, which greatly reduce the dimension and get the expected result.After the data preprocessing, the data sets were randomly divided into training setandtest set, first put the training set into the SVM model to determine sample type, the accuracy is as high as 98.8889%, this result is good. So this model cango on forecasting, , the accuracy in forecasting test set classification is 99.2063%.To this end, the study of the project ended and the classification effect is the ideal result.

在确定研究内容后，对现有的数据进行分析，发现这些基因的数据特征较少，但基因较少。维度很高，这些数据被标记为集合。因此，根据这些数据，在文献研究的基础上，我决定选择支持向量机训练分类预测模型。（我需要解释数据，多一点数据是另一个同学发给我的，有首先是数据的一些问题，所以我增加了一组数据，在项目的所有数据从TCGA数据库。）然后我遇到的第一个难点是数据的预处理，数据的质量会影响后期SVM的分类效果，所以我在数据处理上花费了大量的时间。数据的处理分为三个步骤：首先，对数据进行归一化处理，使数据处于同一水平，尽可能地消除数据的差异。第二，去除多余的基因和冗余。因此，基因，仍然是任何突变或变异和不重复基因的基因。当去除多余的基因，我选择了信息索引的分类方法，这是一个很好的方法来考虑方差大小对分类的影响。结果，这种方法是基于常用的信噪比方法，在去除冗余基因时，选择了相关系数。

redundancy elimination method, determininga gene whether need to eliminate with the help of the similarity between each gene, the final

分类结果表明，基因的特征提取功能非常明显。第三。对这些基因进行分类，在这三个步骤之后，只剩下134个基因，大大减少了维数并得到了预期。结果，经过数据预处理、数据集随机分为训练setandtest集，先放在SVM的训练模型确定样品类型，准确度高达98.8889%，效果良好。该模型可以预测的准确性在预测中，测试集分类为99.2063%，对项目的研究结束，分类效果是理想的结果。

Actually, before determine using the SVM algorithm, I also tried lasso algorithm and neural network, lasso algorithm and principal component analysis has same effect, dimension reduction, it all have a good effect on extracting feature selection.The BP neural network is one of the prediction algorithm will often use, but only after a lasso algorithm processing of data still belonged to the noisy and high dimension data, it did not achieved ideal effect in the BP neural network training.Finally, the SVM algorithm is been found for the characteristics of genetic data, and doing a large number of effective data preprocessing before the application classifier, which can result in a better results.

实际上，在使用SVM算法确定之前，我还尝试了套索算法和神经网络，套索算法和本金。构件分析具有相同的效果，降维，对特征提取都有很好的效果。其中一种预测算法将经常使用，但只有经过套索算法处理的数据仍然属于噪声和高。维数数据，在bp神经网络训练中未取得理想效果，最后得到了支持向量机算法。遗传数据的特点，并在应用分类器之前做大量有效的数据预处理，这可能导致较好的结果。

The main scientific research project for the SVM algorithm improvement concentrate on the data processing, on the feature selection and extraction achieved good effect, and then in the classifier training also achieved good results.This scientific research project also need to continue to study: first, although the classification result is not bad, but the operation is very time consuming, especially in eliminate gene steps, which need up to an hour, hope it can accelerate the speed in the future.Second, the application of the data is open, if the model is applied to hospitals, which is a more real complex and large database, whether such processing method can also achieve ideal result or not, so support vector machine (SVM) on gene expression data analysis research have a lot of work to do in the future.

支持向量机算法改进的主要科研项目集中在数据处理、特征选择和提取取得了很好的效果，然后在分类器的训练中也取得了良好的效果。继续研究：第一，虽然分类效果不错，但操作非常耗时，特别是在消除基因方面。需要一个小时的步骤，希望它能加快未来的速度。应用于医院，这是一个更真实、复杂、庞大的数据库，这种处理方法是否也能达到理想的效果呢？因此，支持向量机（SVM）在基因表达数据分析方面的研究还有很多工作要做。

China and the United States have a lot of differences in teaching even if in university.Although I have participated in some projects with my teacher before in China, what I do more is doingwith teacher's leader step by step; but this project research degrees of freedom is very high, in order that what the algorithm I want to apply is much more, sometimes I can not find direction, and I overturn the idea in the past many times, always looking for a new and suitable thought, this also lead to a few problems on time management.What’s more, in communication with my dear mentor, I sometimes feel that I don't have the idea of taking shape to communicate with my tutor, these thought should be changed in my future study life.Boston is a very attractive city, where science and technology has become a pillar industry, whichnow many cities want to transformation in the direction of development. Duringthe leisure time, I always can meet some interesting people and thingsin the library or campus, this also let me really looking forward to the future study life.In addition to the knowledge gains, my oral English has also been improved, which is not only thanks to my mentor but also to my host family.Finally, the project was completed with the help of my dear mentor L, L, teacher Z and teacher L. Thanksall of you very much!

中国和美国，即使在大学有很多差异教学。虽然我也参加了一些项目，我教师在中国，我所做的更多的是对着老师的领导一步一步；但本项目研究的自由度很高，为了我要应用的算法要多很多，有时候我找不到方向，我在过去推翻了这个想法。很多时候，总是在寻找新的合适的思想，这也导致了一些时间管理上的问题。与我亲爱的导师的沟通，我有时觉得我不知道如何形成与我的导师沟通，这些在我未来的学习生活中，思想应该改变。波士顿是一个非常有吸引力的城市，那里的科学和技术已经成为支柱。行业，现在许多城市想发展的方向转变。在闲暇时间，我总能遇到一些人。有趣的人和事图书馆或校园，这也让我很期待今后的学习生活中，除了对知识的提高，我的口语也得到了提高，这不仅要感谢我的导师，还要感谢我的寄宿家庭。项目完成了我亲爱的导师L ，帮助L ，Z老师和L老师。感谢您们！

附学生科研期间周总结：

week 3周报

接着之前的周报，侧重于统计方法的研究，因此我决定利用机器学习算法做肺癌ALK基因的分类预测。首先要做的是提取特征值，通常用的是都是主成分分析等方法，曾经研读过的一篇关于财政评价体系构建文献，它利用lasso算法很好地进行了特征归类，因此我就想利用此算法提取出更加利于分析的特征值，在提取出来后，原有的特征变得比之前小了很多，随后进行分类预测，我选择的是适用度非常广的神经网络算法，但是在输出分类结果的时候并没有得到预期的结果，即预测精度并没有得到很好地提高，可能导致此种的原因有二：第一，数据集过于片面，数据量可能不够；第二，选择了错误的算法。在下一周我将换一种机器学习方法进行分类预测。希望可以提高预测精度。

week 4周报

由于上一周应用lasso和神经网络模型得到的预测精度并没有得到很好地提高，因此在本周中我换了一种新的机器学习方法——SVM方法进行分类预测。由于对这个模型没有学过，因此先用了一些时间来重新学习了一下，截止目前我已经得到了初步的分类结果，但对于某些参数的设置和应用我还是没太搞清楚，因此我申请再多用一点点时间来得到更精确的分类结果。在实验中，没有得到理想的分类结果其实也是正常情况之一，如果多个方法做出来结果相同那么就可以说明此组数据的结果就是一类的。但是这个结论现在定下还有一点过早，我需要再一点时间来检测是否有问题出现。