Yogesh Pandit, Master’s in Bioinformatics Thesis Defense
Title: Context Specific Text Mining For Annotating Protein Interactions With Experimental Evidence
Mathew Palakal, Ph.D., Chair
Yunlong Liu, Ph.D.
Xiaowen Liu, Ph.D.
Proteins are the building blocks in a biological system. They interact with other proteins to make unique biological phenomenon. Protein-protein interactions play a valuable role in understanding the molecular mechanisms occurring in any biological system. Protein interaction databases are a rich source on protein interaction related information. They gather large amounts of information from published literature to enrich their data. Expert curators put in most of these efforts manually. The amount of accessible and publicly available literature is growing very rapidly. Manual annotation is a time consuming process. And with the rate at which available information is growing, it cannot be dealt with only manual curation. There need to be tools to process this huge amounts of data to bring out valuable gist than can help curators proceed faster. In case of extracting protein-protein interaction evidences from literature, just a mere mention of a certain protein by look-up approaches cannot help validate the interaction. Supporting protein interaction information with experimental evidence can help this cause. Also to avoid redundancy in curation efforts and for ease of accessibility, efforts are being put to relate manually curated information to standardized ontologies. In this study, we are applying machine learning based classification techniques to classify any given protein interaction related document into an interaction detection method. We use biological attributes and experimental factors, different combination of which define any particular interaction detection method. Then using predicted detection methods, proteins identified using named entity recognition techniques and decomposing the parts-of-speech composition we search for sentences with experimental evidence for a protein-protein interaction. We report an accuracy of 75.1% with a F-score of 47.6% on a dataset containing 2035 training documents and 300 test documents.
There are no upcoming dates for this event.