Московский семинар

по биоинформатике


Новости

Контакты

Схема проезда:

ИМБ

ФБиБи, МГУ


Статья о семинаре


Краткие резюме докладов

   

2012-14   2009-11   2006-08   2003-05   2000-02   1997-99   1994-96

   

 


54.

23.1.1997.

 

V.L.Surin

Scientific Center of Haematology, Russian Acad. Med. Sci.

Non-traditional variants of PCR: Application to oncology and phylogenetic studies

PCR with random primers is applied to analysis of molecular mechanisms of chromosomal translocations t(9;22) causing a number of blood oncological diseases. Biprimer RAPD-PCR systems are used to establish phylogenetic relationships between different animal species.


55.

6.2.1997

 

G.K. Frank, V.Ju. Makeev

Institute of Molecular Biology

G and T nucleotide contents show specie-invariant negative correlation for all three codon positions

The nucleotide contents of the three codon positions show a number of statistical pairwise correlations, some of which are universal for all analyzed genomes. Among the most prominent of these correlations are negative correlations between G and T contents found in genes of all species analyzed. The pair A /C, which is complementary to G/T shows similar negative correlation in genes of most species. In the genes of several species including all mammalian genes studied, positive correlations between A and T contents, and G and C contents are found. Since these regularities are observed in all three codon positions they are connected with amino-acid content of proteins. Such correlations may origin from features of the mutation process or/and translation reading frame check. The well-known bias of the preference for G in the first codon position and its deficiency in the second is accompanied by opposite bias in T content In the third codon position there is no general nucleotide preference, but its content is often biased with regard to GC content of the gene. G and T contents in this case are always shifted in the opposite directions. Several ideas are drawn to explain this preference. [Frank & Makeev, 1997].


56.

15.5.1997

 

Nikita Vassetsky

Institute of Molecular Biology

Reconstruction of sequences of ex-functional genes

Eukaryotic genomes often carry sequences that had coded for genes but were then spoiled with mutations. The only reliable way to reconstruct the functional sequence, to compile it from numerous copies, is applicable to repeated genome elements. We propose two approaches to approximate the functional gene sequence for non-repeated defective (but not heavily damaged) genes. The first requires a sequence of a closely related functional gene. On the base of their alignment the effect of the frameshift mutations in the defective gene is eliminated so that a putative amino acid sequence can be deduced. Another approach is based on the structural distinctions between the coding and non-coding nucleotide sequences (commonly used to estimate the coding potential of a sequence). The tested sequence is optimized to be "coding-like", which eliminates the effect of the frameshift mutations. Of course, these methods produce no more than a rough approximation to the real functional sequence. Still, this may suffice to gain valuable information (e.g., the produced amino acid sequence may prove similar to a functional protein, while their original nucleotide sequences are dissimilar).


58.

5.6.1997

 

Igor Rogozin

CNR - Istituto Tecnologie Biomediche Avanzate; Institute of Cytology and Genetics, Novosibirsk

Analysis of donor splice sites in different eukaryotic organisms

We present here a new approach for functional site analysis. It is based on four main assumptions: each variation of nucleotide composition makes different contribution to the overall binding free energy of interaction between functional site and other molecule; nonfunctioning site-like regions (pseudosites) are absent or rare in genome; there may be errors in the sample of sites; and nucleotides of different site positions are considered to be mutually dependent. In this algorithm, the site set is divided into subsets, each described by a certain consensus. Donor splice sites of the human protein-coding genes were analyzed. Comparing the results with other methods of donor splice site prediction has demonstrated a more accurate prediction of consensus sequences AG/GU(A,G), G/GUnAG, /GU(A,G)AG, /GU(A,G)nGU, G/GUA than is achieved by weight matrix and consensus (A,C)AG/GU(A,G)AGU with mismatches. The probability of the first type error E1 for the obtained consensus set was about 0.05, and the probability of the second type error E2 was 0.15. The analysis demonstrated that accuracy of the functional site prediction could be improved if one takes into account correlations between the site positions. The accuracy of prediction by using human consensus sequences was tested on sequences from different organisms. Some differences in consensus sequences for the plant Arabidopsis sp., the invertebrate Caenorhabditis sp. and the fungus Aspergillus sp. were revealed. For the yeast Saccharomyces sp. only one conservative consensus /GUA(U,A,C)G(U,A,C) was revealed (E1=0.03, E2=0.03). Yeast can be suggested as a very interesting model for analysis of molecular mechanisms of splicing.


61.

11.9.1997

 

Sh. Sunyaev

Institute of Molecular Biology

Towards statistical model of protein family. Application for the remote homologue hunting

Profile search is one of the most popular methods of sequence analysis and homologue hunting. Given the multiple alignment of a protein family it derives family profile; i.e., the matrix of position dependent scores. It allows to detect remote homologs invisible for pairwise alignment routines. However, currently used methods are purely empirical and do not provide any statistical model of the protein family formation. We present new semi-empirical statistical method of profilemaking which outperforms existing profile methods as well as Hidden Markov Models. Further improvement of statistical model and possible development of appropriate alignment strategy will be also discussed. [Сюняев и др., 1999; Sunyaev et al., 1999a; Sunyaev et al., 2000c].


62.

25.9.1997

 

Igor Berezovsky

Institute of Molecular Biology

Hierarchy regions of amino acid sequence in accordance with their role in energetic properties of protein spatial structure

We represent an amino acid sequence by the energy curve of the interactions between parts of the spatial structure. Each point is set in correspondence with the Van der Waals interaction energy between the regions of the globule separated by this point. We determine positions corresponding to the minimal interaction between parts of the globule. Zero level of the energy means independence of the adjacent regions. Residues corresponding to minima on the curve are boundaries of structurally independent parts of the globule. After several iterations for different values of the minimal energy we divide the sequence into a hierarchy of structural segments.

We analyze several families of proteins, determine and compare boundaries of domains and modules, describe the differences in contribution of these structural units to mantaining the spatial structure, compare the structural division with results of sequence alignment.

[Berezovsky et al., 1997; Berezovsky et al., 1999; Berezovsky et al., 2000a; Berezovsky et al., 2000b].


63.

2.10.1997

 

Eugene V. Koonin

National Center for Biotechnology Information, NLM, NIH

A genomic perspective on protein families

To extract maximum information from the rapidly accumulating genome sequences, all conserved genes need to be classified according to their homologous relationships. By comparing proteins encoded in 7 complete genomes from 5 major phylogenetic lineages (5 bacterial, one archaeal, and one eukaryotic) and elucidating consistent patterns of sequence similarities, we delineated 720 Clusters of Orthologous Groups (COGs). Each COG consists of individual, orthologous proteins or orthologous sets of paralogs from at least three lineages. Orthologs typically have the same function, allowing transfer of functional information from one member to an entire COG. This automatically yields a number of functional predictions for poorly characterized genomes. The COGs comprise a framework for functional and evolutionary genome analysis. The COG system was used to analyze three additional genomes, those of the giant symbiotic plasmid from Rhizobium sp., the pathogenic bacterium Helicobacter pylori, and the nematode Caenorhabditis elegans (~60% of the genome). This analysis included semi- automatic functional annotation of the conserved portion of each gene set, and identification of common and rare phylogenetic patterns, which significantly differ in bacteria and eukaryotes. A systematic survey of conserved families missing in H. pylori suggests major revisions of the central metabolic pathways in this bacterium.


64.

8.1.1998

 

Boris Galitsky, I.M.Gelfand, Alexander Kister

Rutgers University

Classificaion of immunoglobulins

Immunoglobulin (human heavy) sequences were analyzed in terms of patterns (keywords) of small amino acid fragments. Representation of the sequences as a combination of 17 keywors of each fragments revealed that 6 main combinations describe the majority of sequences ( 60% exactly, 40% approximately).

An important feature of the new classification principle is that the knowledge of few keywords, or even of the residues at several key positions, allows one to determine the class affiliation of immunoglobulins and thus to predict residue, or residue type in almost any position of a sequence.

Prediction algorithm, designed for the molecular language, displays the features of the semantic processor for the natural language. The rule-based classification principle, developed on immunoglobulin sequences, is applicable to a wide variety of protein families.

[Galitsky et al., 1998; Galitsky et al., 1999].


65.

22.1.1998

 

V.A.Shepelev

Institute of Molecular Genetics

On the distribution of dinucleotides in nucleic acid sequences

The distribution of dinucleotides in nucleic acid sequences can be described by a set of dinucleotide frequencies as well as relative frequencies (odds-ratios). It is well known that in general the odds-ratio deviates from unity. This leads to the concept of genome signature, which implies that the set of odds-ratios is to some extent specific for particular genomes and taxons. The so-called empirical distribution function yields more detailed description for the dinucleotide distribution. Theoretical distributions are derived for the reference purpose (zero-order model). An alternative approach deals with the distribution of waiting times for different dinucleotides. Examples of distribution for large mammalian and human viruses are given. Special features of distributions for higher eukaryotes are also shown.


66.

5.2.1998

 

V.Ju.Makeev

Institute of Molecular Biology

Probabilistic approach to segmentation of DNA sequences

The problem of DNA segmentation is considered. Each DNA sequence is presented as the set of statistically independent blocks, with Bernoulli probability of individual nucleotide to appear. The total probability of the whole sequence to appear is calculated. This probability is maximized to obtain an ideal segmentation. To estimate the probability of nucleotides to appear within alleged blocks Dirichlet probabilistic approach is used. [Ramensky et al., 1999].


67.

12.2.1998

 

Pavel Khil

Institute of Bioorganic Chemistry

Phylogenetic analysis of long terminal repeats of the HERV-K family endogenous retroviruses

Sequences of 45 long terminal repeats (LTRs) of the human endogenous retroviruses HERV-K family precisely mapped by us earlier on human chromosome 19 were determined, and a nearest neighbour dendrogram was constructed. No correlation was observed between the degree of identity of the LTR pairs and their relative positions on the chromosome. Thus sequences of distantly located LTRs positioned even on different chromosome arms could be highly similar to each other whereas those of closely located LTRs could significantly differ. We conclude that the LTRs randomly transposed across the chromosome in the course of the evolution. The alignment of the LTR sequences allowed us to assign most of the LTRs to two major subfamilies. The LTRs belonging to the first subfamily (LTR-I) are characterised by higher intrasubfamily sequence divergence than those of the second subfamily (LTR-II). The two subfamilies are easily discernible due to the presence of characteristic deletions/insertions in the LTR sequences. The higher divergence of the first subfamily members suggests that their propagation started at earlier stages of the evolution, probably soon after their ancestor insertion into the primate genome. In turn, each of the subfamilies includes several distinct branches with various degrees of intragroup divergence and with characteristic diagnostic features, suggesting that the members of the branches represent amplified copies of particular master genes having appeared in different periods of the evolution. The sequences of the LTRs demonstrate characteristic distribution of conservative and variable regions indicating that the LTRs might have some sequence-dependent functions in the primate genome.


70.

4.6.1998

 

M.G.Sadovsky

Institute of Biophysics, Russian Acad. Sci., Siberian Branch, Krasnoyarsk

Genetic texts, vocabularies and information

Genetic sequences are considered as texts (genetic texts, GT). Each GT corresponds to the frequency vocabulary, that is the list of all subwords of fixed length with their counts. The fundamental problems are reconstitution of a longer vocabulary (in particular, the text itself) given some vocabulary, and comparison of two GT given thir vocabularies. In the first case, we demonstrate the existence of a critical length after which the GT can be reconstituted unambiguously and study the behavious of this length for various genes and their fragments. In the second case, we consider the problem of reconstitution of an ensemble of vocabularies. We introduce the quality of reconstitution of vocabulary of length q+1 (and in general q+s) given the vocabulary of length q. This value characterized the informational capacity of a GT.

We classify GTs by their statistical characteristics. By definition close GTs have in some sense similar vocabularies. Using automated classification we demonstrate that functionally similar sequences are close in the above sense. We demonstrate the connection between the structure (vocabulary) and taxonomy. GTs having close vocabularies are non-randomly distributed in the set of families. It is interesting that a point in the classification space is exactly a family: there is no correlation with higher taxonomic levels.

We develop a method of sequence comparison not using alignment. Sequence vocabularies are compared via an intermediate object, namely the hybrid vocabulary which is the statistical ancestor for all sequences in a group. The statistical ancestor is the vocabulary that can be obtained from any set under comparison by adding some minimal amount of information. The set of words in the hybrid vocabulary is the union of the word sets of the compared vocabularies, whereas the word frequencies are the averages of the frequencies in these vocabularies. This provides for the minimum total entropy of the considered vocabularies relative to the hybrid vocabulary. We present the results of comparison of sequences from EMBL.


71.

2.7.1998

 

Boris Galitsky

Rutgers University

Natural language undertsanding and formal scenarios. Some applications.

The talk will focus on the logical aspects of natural language understanding. The issues of logical programming and peculiarities of metaprogramming technique are addressed as a basis for representation of natural language (NL) semantics.

The approach of NL understanding in the expandable problem domain is implemented, allowing real-time introduction of new facts and definitions of new concepts.

The syntactic analysis systems of Apresian/Boguslavsky (IPPI) and START of Katz (MIT) will be presented in respect to compatibility with the the semantic processor, based on advanced reasoning involving time, space, action, knowledge and belief. Semantic subsystem of filtering of the speech recognition results illustrates some mechanism of reasoning in inconsistent conditions.


72.

6.7.1998

 

Yury Wolf

National Center for Biotechnology Information, Bethesda

Distribution of protein folds in the three superkingdoms of life

A protein fold recognition procedure was developed on the basis of iterative database search using the PSI-BLAST program. In the completely sequenced genomes, folds could be automatically identified for 20-30% of the proteins, with 5-6% more detectable by additional analysis of conserved motifs. The distribution of the most common folds is very similar in bacteria and archaea but distinct in eukaryotes; parasitic bacteria being different from the free-living ones. In all superkingdoms, the P-loop NTPases are the most abundant fold. In bacteria and archaea, the next most common folds are TIM-barrels, ferredoxin-like proteins and methyltransferases, whereas in eukaryotes, the second to fourth places belong to protein kinases, b-propellers and TIM-barrels. Several statistical aspects of fold distribution are discussed.


76.

15.10.1998

 

И.А.Захаров

Институт общей генетики

Полиморфизм и половые соотношеиия в популяциях: исследования в природе и математические модели

Оптимальной половой структурой у полигамных видов является такая, при которой в популяции с максимальной частотой присутствуют самки, дающие в своем потомстве наибольшее количество жижзнеспособных и плодовитых дочерей. Самцов при этом должно быть меньше, чем самок, т.к. сокращение их доли увеличивает пищевые и другие ресурсы, которыми могут воспользоваться самки. Доля самцов, однако, не может быть сколь угодно малой. Её предел задается, во-первых, тем, сколько самок в сезон размножения может найти и оплодотворить самец, и, во-вторых, тем, при какой доле самцов начинают сказываться вредные последствия инбриидинга, когда самцы с заметной вероятностью будут оплодотворять родствнных им дочерей.

У некоторых насекомых минимизация числа самцов в популяции достигается наличием самок, дающих бессамцовое потомство. При этом общий объем потомства у них оказывается равным 0.5 от объема потомства нормальных самок, но дочери из бессамцового потомства проявляют большую выживаемость, поскольку снабжены дополнительными пищевыми ресурсами.

Популяция в указанном случае состоит из: m самцов, n1 нормальных самок, n2 “бессамцовых” самок. Соотношение m:n1:n2 устойчиво воспроизводится в поколениях.


84.

13.5.1999

 

S.A.Spirin

Belozersky Institute of Physico-Chemical Biology

PSI-BLAST and its Russian analog

I will discuss two approaches used to estimate the quality of a local alignment. In the first approach the quality is defined as the sum of substitution weights. This approach is used in the most popular algorithms for local alignment and databank screening, such as Smith-Waterman algorithm, BLAST, and FastA. In the other approach the quality is defined as the so-called power of a local alignment. Although the second approach is much less popular, it has some advantages.

PSI-BLAST is a relatively new tool for databank screening. Its main idea is to use for screening the so-called profile created with the results of preliminary screening(s). Its background is usual BLAST, thus it uses the sum as the quality measure.

Recently V.K.Nikolaev wrote a program based on the ideas of PSI-BLAST, but using power as quality. The first tests of the program are promising. I will explain the algorithm of this program.

[Николаев и др., 1997].


85.

22.6.1999.

 

Gregory Kucherov

INRIA-Lorraine

On maximal repetitions in sequences

I will talk about maximal repetitions, called "tandem repeats" in biological literature. In the first part I'll present some theoretical background. In particular, I'll mention important data structures (suffix tree, DAWG (Directed Acyclic Word Graph)) and describe main ideas behind our algorithm which finds all so-called maximal repetitions in a sequence in linear time on the length of the sequence. In the second part, I'll talk about our implementation of this algorithm and computer experiments on DNA sequences.

The first part is a common work with Roman Kolpakov from Moscow University. The second part describes partly a recent work by Mathieu Giraud (student of ENS Lyon).


86.

23.9.1999

 

Д.А.Филимонов

ИБМХ РАМН

Компьютерная оценка свойств химических соединений с использованием неполной эмпирической информации: математические основы прогноза биологической активности

Оценка физико-химических свойств и биологической активности химических соединений (ХС) необходима для решения многих задач биологии, медицины, экологии, поскольку эмпирические данные для каждого ХС содержат лишь часть из огромного разнообразия свойств. Целью нашей работы является создание методов оценки свойств ХС на основе использования имеющихся эмпирических данных о структурах и свойствах ХС и применения компьютерных технологий извлечения знаний из имеющейся информации.

В основе прогноза спектров биологической активности в системе PASS (http://www.ibmh.msk.su/PASS/default.htm) лежит традиционная гипотеза SAR/QSAR/QSPR/Molecular Modelling:

Активность = Функция (Структура молекулы)

В системе PASS структура молекул описывается дескрипторами молекулярного базиса атомных окрестностей (МоБАО), а более чем 500 активностей представлены в обучающей выборке из более чем 30000 веществ качественно: “наличие/отсутствие” эффекта.

Задача прогноза биологической активности в такой постановке сводится к проблеме построения решающего алгоритма, чему и будет посвящен доклад.

Выбор оптимального алгоритма прогноза биологической активности среди разнообразных классов и разновидностей оценок возможности проявления активности прогнозируемым ХС выполнен с использованием скользящего контроля с исключением по одному и по два и случайного разбиения обучающей выборки на две независимые подвыборки на основе разработанных нами критериев максимальной ошибки прогноза (MEP) и инвариантной точности прогноза (IAP). Будут представлены и обсуждены основные результаты этих исследований.

Представление результатов прогноза в системе PASS выполняется в виде двух оценок по каждой прогнозируемой активности, которые по своему построению являются оценками вероятности ошибок 1-го и 2-го рода, но могут интерпретироваться и как вероятности принадлежности к классам активных и неактивных ХС. Это предоставляет пользователю ясные возможности для решения разнообразных практических задач.


87.

28.10.1999

 

Eugene Koonin

National Center for Biotechnology Information (NLM, NIH; Bethesda, USA)

Horizontal gene transfer: evidence and role in the evolution of prokaryotes

Orthologous gene families that are conserved in diverse bacterial, archaeal and eukaryotic genomes typically show patchy phylogenetic distribution, which suggests that horizontal gene transfer and lineage-specific gene loss played a major role in evolution. Distinguishing between these two types of events with confidence is not easy. However, combined analysis of patterns of phylogenetic distribution and tree topologies suggests parsimonious scenarios that favor horizontal transfer, differential gene loss or a combination thereof for individual gene families. Horizontal gene transfer appears to involve all functional categories of prokaryotic genes, with the possible exception of some of the core components of translation and transcription, but seems particularly prominent among genes that encode DNA repair and signal transduction system components. Frequently, horizontal transfer seems to be accompanied by the elimination of the original gene responsible for the respective function. Such events can be classified into two categories: i) non-orthologous gene displacement – replacement of a gene by an unrelated or distantly related gene coding for a functionally similar protein, and ii) xenologous gene displacement - replacement of a gene by an ortholog from a phylogenetically distant lineages. I will attempt to present a rough quantitative evaluation of the amount of relatively recent horizontal gene transfer in evolution. The conclusion will be that between very distant lineages, such as, for example, archaea and bacteria, it is significant but not overwhelming. By contrast, within tight taxonomic groups, such as the Euryarchaeota, gene exchange seems to be rampant. Apparent horizontal gene transfer and lineage - specific gene loss will be exemplified by a systematic analysis of the evolution of aminoacyl-tRNA synthetases which includes a variety of evolutionary scenarios. In spite of the prominence of horizontal gene transfer and differential gene loss, a clear phylogenetic signal still can be extracted from comparisons of entire protein sets from completely sequenced genomes. Phylogenetic trees produced by using parameters of the distribution of similarity scores between likely orthologs to calculate evolutionary distances between genomes will be discussed.


   

 

© Seminar, 1993 - 2016