关于我们
书单推荐
新书推荐
|
生物信息学中的数学方法(英文版) 读者对象:本科生、研究生 全书采用将生物分子的符号序列首先变换为数值序列,之后,用数学方法从这些数列中提取生物信息。因而,先全面系统介绍了DNA序列和蛋白质序列的数字及图形表达后,就用几种正交变换进行基因识别,蛋白质对比,以及用聚类分析来了解DNA和蛋白质的性质及进行分类。微数组,脂肪组是现成的大量生物数据,我们就用数理统计方法处理这些数据得到了某种癌症的诱因。生物过程是动态过程,书中介绍了用微分方程和差分方程模型模拟这种过程的多种算法。由于生物数据部分缺失现象经常发生,最后介绍了迷失数据弥补的许多方法。 更多科学出版社服务,请扫码获取。 CHAPTER 1 SOME BIOLOGICAL CONCEPTS We outline some biological concepts those are needed in thebook. 1.1 Cell The cell is the functional basic unit of life. It is thefunctional unit of all known living organisms. It is the smallestunit of life that is classified as a living thing, and is oftencalled the building block of life. Organisms can be classified as unicellular, consisting of asingle cell; including most bacteria, or multicellular, includingplants and animals. Humans contain about 10 trillion (1013) cells.Most plant and animal cells are between 1 and 100 μm and thereforeare visible only under the microscope. Living cells can be classified into two categories: prokaryote,such as bacteria, in which the cell does not have a distinctnucleus, and eukaryote, such as most of animal cells, in which thecells have distinct nuclei. The prokaryote cell is simpler, andtherefore smaller, than a eukaryote cell, lacking a nucleus andmost of the other organelles of eukaryotes. There are two kinds ofprokaryotes: bacteria and archaea, these share a similarstructure. All cells possess genetic material, deoxyribonucleic acid (DNA),the hereditary material of genes, and ribonucleic acid (RNA),containing the information necessary to build various proteins, thecell’s primary machinery. 1.2 Genetic Material: DNA, Gene and RNA 1.2.1 DNA DNA is the basic information macromolecule of life. It consistsof a polymer of nucleotides, in which each nucleotide is composedof a standard deoxyribose sugar and phosphate group unit, connectedto a nitrogenous base of one of four types: adenine, guanine,cytosine, or thymine, abbreviated here A, G, C, and T,respectively. Because of similarities in chemical structure of theirnitrogenous bases, adenine and guanine are classified as purines,while cytosine and thymine are classified as pyrimidines. Adjacentnucleotides in a single strand of DNA are connected by a chemicalbond between the sugar of one and the phosphate group of next. Theclassic double-helix structure of DNA is formed when two strands ofDNA form hydrogen bonds between their nitrogenous bases, resultingin the familiar “ladder” structure. Under normal conditions, thesehydrogen bonds form only between particular pairs of nucleotides,referred to as base pairs: adenine pairs only with thymine, andguanine pairs only with cytosine. Two strands of DNA arecomplementary if the sequence of bases on each is such that theypair properly along the entire length of both strands. The Figure1.1 and Figure 1.2 show the structure of a four-base fragment of aDNA double helix as following.
Figure 1.1 Four kinds of deoxynucleotides in the leftside. The chemical structure of a fragment of a DNA double helix in the right side (PLATE I)
Figure 1.2 The structure of the DNA double helix. The sideview of DNA is in the left; the top view of DNA is in the right. The bases are showed by four colour(PLATE I) 1.2.2 Gene A gene is a molecular unit of heredity of a living organism. Itis a name given to some stretches of DNA and RNA that code for apolypeptide or for an RNA chain that has a function in theorganism. Living beings depend on genes, as they specify allproteins and functional RNA chains. Genes hold the information tobuild and maintain an organism’s cells and pass genetic traits tooffspring, although some organelles (e.g. mitochondria) areself-replicating and are not coded by the organism’s DNA. Allorganisms have many genes corresponding to various biologicaltraits, some of which are immediately visible, such as eye color ornumber of limbs, and some of which are not, such as blood type orincreased risk for specific diseases, or the thousands of basicbiochemical processes that comprise life. The sequence in which the different bases occur in a particularstrand of DNA represents the genetic information encoded on thatstrand named gene. In molecular genetics, an open reading frame(ORF) is the part of a gene that actually encodes a protein. Thetranscription termination pause site is located after the ORF,beyond the translation stop codon, because if transcription were tocease before the ribosome reaches the translation stop codon, anincomplete protein would be made. Normally, inserts which interruptthe reading frame of a subsequent region after the start codoncause frameshift mutation of the sequence and dislocate thesequences for stop codons (terminator). These genes are the classic focus of attention of geneticists.The gene structure and expression mechanism in typical eukaryotecells are complicated (Figure 1.3). Genes themselves are oftenorganized into exons (the strains corresponding to them calledprotein-coding region), which are the sequences that willeventually be used by the cell, alternating with introns, thestrains of them called non-protein-coding region, which will beexcised and discarded.
Figure 1.3 The gene structure and expression mechanism intypical eukaryote cells (PLATE II) In all organisms, the genetic information stores on DNAsequences. There are two major steps separating a protein-codinggene from its protein: first, the DNA on which the gene residesmust be transcribed from DNA to messenger RNA (mRNA); and, second,it must be translated from mRNA to protein. Figure 1.4 illustratesthe two steps. RNA-coding genes must still go through the firststep, but are not translated into protein. The process of producinga biologically functional molecule of either RNA or protein iscalled gene expression, and the resulting molecule itself is calleda gene product.
Figure 1.4 Two major steps delivering information fromgenes to protein (PLATE II) 1.2.3 RNA RNA is part of a group of molecules known as the nucleic acids,which are one of the four major macromolecules (along with lipids,carbohydrates and proteins) essential for all known forms of life.Like DNA, RNA is made up of a long chain of components callednucleotides. Each nucleotide consists of a nucleobase, a ribosesugar, and a phosphate group. The sequence of nucleotides allowsRNA to encode genetic information. All cellular organisms usemessenger RNA (mRNA) to carry the genetic information that directsthe synthesis of proteins. In addition, many viruses use RNAinstead of DNA as their genetic material. RNA is transcribed with only four bases (adenine, cytosine,guanine and uracil), but these bases and attached sugars can bemodified in numerous ways as the RNAs mature. Messenger RNA (mRNA)is the RNA that carries information from DNA to the ribosome, thesites of protein synthesis (translation) in the cell. The codingsequence of the mRNA determines the amino acid sequence in theprotein that is produced. Many RNAs do not code for protein however(about 97% of the transcriptional output is non-protein-coding ineukaryotes). These so-called non-coding RNAs (ncRNA) can be encodedby their own genes (RNA genes), but can also derive from mRNAintrons. The most prominent examples of non-coding RNAs aretransfer RNA (tRNA) and ribosomal RNA (rRNA), both of which areinvolved in the process of translation (Figure 1.5 illustratestheir tertiary structure). There are also non-coding RNAs involvedin gene regulation, RNA processing and other roles. Certain RNAsare able to catalyse chemical reactions such as cutting andligating other RNA molecules, and the catalysis of peptide bondformation in the ribosome; these are known as ribozymes.
Figure 1.5 Tertiary structure (cloverleaf structure) oftRNA is in the left; tertiary structure rRNA is in the right (PLATE III) During transcription, both exons and introns are transcribedinto RNA, called message RNA, in their linear order. Thereafter, aprocess called splicing takes place, in which the intron sequencesare excised and discarded from the RNA sequence. The remaining RNAsegments, the ones corresponding to the exons, are ligated to formthe mature RNA strand. There is large number of non-coding regionsdistributed in the genomes. For instance, the density of proteincoding in the human genome is very low. There are only 2 percentDNA sequences in human genome that encode proteins[1.1]. Thenon-coding DNA is usually responsible for the complex regulation ofthe genome and functioning of genes. Like DNA, RNA is made up of a series of nucleotides, but withseveral important differences: RNA is single-stranded, contains thesugar ribose, and substitutes the nitrogenous base uracil(abbreviated U) for thymine. After post-transcriptionalmodification, which includes the removal introns the RNA will go onto various fates within the cell. Of particular interest is mRNA,which will be translated into protein. 1.3 Protein and Amino Acids We know that the entire genetic information of any livingorganism is coded by four different nucleotides. DNA moleculesserve as back up for complete genetic information for the wholeorganism. As mentioned in the last section, the particular andwell-defined fragments of this information, so-called codingsequences, are then translated, using complex molecular mechanisms,into other information which contained within protein sequences andcoded with 20 different amino acids. The names of the amino acidslist in Table 1.1.
Proteins are the main conductors and work force in any livingprocess within a cell, tissue or organism. They are composed ofsequentially linked amino acids but can only express theirbiological function when they achieve a certain activethree-dimensional (3-D) structure. Their biological function aswell as their active 3-D structure is determined primarily by theamino acid sequence within the protein. We will introduce theconstruction form of them as following. Each of these amino acids is represented by one or moresequences of three RNA nucleotides known as a codon; for example,the RNA sequence AGG encodes the amino acid Lysine. The combinationof four possible nucleotides in groups of three results in 43 or 64codons, meaning that most amino acids are coded by more than onecodon[1.2]. An organelle known as ribosome performs the translationof mRNA into protein. The ribosome pairs each codon in the RNAsequence with the appropriate amino acid, and then adds the aminoacid onto the growing protein. The process of translation ismediated by two special type of codon: start codon signal thelocation on the RNA molecular where translation should begin, whilestop codons signal the location where translation should terminate.Once the sequence of amino acids that make up a particular proteinis assembled, the protein dissociate from the ribosome and foldsinto a specific three-dimensional form. The function of a proteinultimately depends on both its three-dimensional structure and itsamino acid sequence. Protein goes on to perform a variety offunctions in the cell, covering all aspects of cellular functionsfrom metabolism to growth to division. In molecular biology protein structure describes the variouslevels of organization of protein molecules. The Figure 1.6describes the different structure.
Figure 1.6 Four levels of protein structure: primarystructure, secondary structure, tertiary structure, quaternary structure (PLATE III) The primary structure refers to amino acid linear sequence ofthe polypeptide chain. Secondary structure refers to highly regularlocal sub-structures. Two main types of secondary structure, thealpha-helices and the beta-sheets, these secondary structures aredefined by patterns of hydrogen bonds between the main-chainpeptide groups. They have a regular geometry, being constrained tospecific values of the dihedral angles ψ and φ on the Ramachandranplot. Both the alpha-helix and the beta-sheet represent a way ofsaturating all the hydrogen bond donors and acceptors in thepeptide backbone. Some parts of the protein are ordered but do notform any regular structures. They should not be confused withrandom coil, an unfolded polypeptide chain lacking any fixedthree-dimensional structure. Tertiary structure refers to the three-dimensional structure ofa single protein molecule. The alpha-helices and beta-sheets arefolded into a compact globule. The folding is driven by thenon-specific hydrophobic interactions (the burial of hydrophobicresidues from water), but the structure is stable only when theparts of a protein domain are locked into place by specifictertiary interactions, such as salt bridges, hydrogen bonds, andthe tight packing of side chains and disulfide bonds. The disulfidebonds are extremely rare in cytosolic proteins, since the cytosolis generally a reducing environment. Quaternary structure is the three-dimensional structure of amulti-subunit protein and how the subunits fit together. In thiscontext, the quaternary structure is stabilized by the samenon-covalent interactions and disulfide bonds as the tertiarystructure. Complexes of two or more polypeptides (i.e. multiplesubunits) are called multimers. Specifically it would be called adimer if it contains two subunits, a trimer if it contains threesubunits, and a tetramer if it contains four subunits. The subunitsare frequently related to one another by symmetry operations, suchas a 2-fold axis in a dimer. Multimers made up of identicalsubunits are referred to with a prefix of “homo-” (e.g. ahomotetramer) and those made up of different subunits are referredto with a prefix of “hetero-”(e.g. a heterotetramer, such as thetwo alpha and two beta chains of hemoglobin). 1.4 Chromosome In the nucleus of each cell, the DNA molecule is packaged intothread-like structures called chromosomes. Each chromosome is madeup of DNA tightly coiled many times around proteins called histonesthat support its structure. Chromosomes are not visible in thecell?s nucleus ― not even under a microscope ― when the cell is notdividing, which is called chromatin. However, the DNA that makes upchromosomes becomes more tightly packed during cell division and isthen visible under a microscope. Most of what researchers knowabout chromosomes was learned by observing chromosomes during celldivision. In most cases, nuclear material of prokaryotic cell consists ofa single chromosome that is in direct contact with cytoplasm. Here,the undefined nuclear region in the cytoplasm is callednucleoid. Plants, animals, fungi, slime moulds, protozoa, and algae areall eukaryotic. Their cells are about 15 times diameter than atypical prokaryote and can be as much as 1000 times greater involume. The major difference between prokaryotes and eukaryotes isthat eukaryotic cells contain membrane-bound compartments in whichspecific metabolic activities take place. The most important amongthese is a cell nucleus, a membrane- delineated compartment thathouses the eukaryotic cell’s DNA. This nucleus gives the eukaryoteits name, which means “true nucleus”. Other differencesinclude. Figure1.7 shows that in the eukaryote cell, DNA is organizedinto chromosomes, each of which is a continuous length of doublestranded DNA that can be hundreds of millions base pairs long. Mosthuman cells contain 23 pairs of chromosomes, one member of eachpair paternally inherited and the other maternally inherited. Thetwo chromosomes in a pair are virtually identical, with theexception of the sex chromosome, for which there are two types, Xand Y. Nearly every cell in the body contains identical copies ofthe full set of 23 pairs of chromosomes.
Figure 1.7 The diagram shows the double helix structure ofDNA in relation to a chromosome. The chromosome is X-shaped because of its replication (PLATEIV) 1.5 Omics The English-language neologism omics informally refers to afield of study in biology ending in omics, such as genomics,proteomics and lipidomics. The related suffix-ome is used toaddress the objects of study of such fields, such as the genome,proteome and lipidome respectively. The suffix-ome as used inmolecular biology refers to a totality of some sort; it is anexample of a “neo-suffix” formed by abstraction from various Greekterms in-ωμα, a sequence that does not form an identifiable suffixin Greek. 1.5.1 Genomics Genome is the entirety of an organism’s hereditary information.It is encoded either in DNA or, for many types of virus, in RNA.The genome includes both the genes and the non-coding sequences ofthe DNA/RNA. Human genome is currently thought to containapproximately 30 000~40 000genes[1.3]. Genomics is a discipline in genetics concerned with the study ofthe genomes of organisms. The field includes efforts to determinethe entire DNA sequence of organisms and fine-scale geneticmapping. The field also includes studies of intragenomic phenomenasuch as heterosis, epistasis, pleiotropy and other interactionsbetween loci and alleles within the genome. In contrast, theinvestigation of the roles and functions of single genes is aprimary focus of molecular biology or genetics and is a commontopic of modern medical and biological research. Research of singlegenes does not fall into the definition of genomics unless the aimof this genetic, pathway, and functional information analysis is toelucidate its effect on, place in, and response to the entiregenome’s networks. In other words, genomics is the study of all the genes of acell, or tissue, at the DNA (genotype), mRNA (transcriptome), orprotein (proteome) levels, because a genome is the sum total of allindividual organism’s genes. 1.5.2 Microarray Recent advances in high-throughput genomic technologies enableacquisition of different types of molecular biological data, forinstance, DNA-sequence and mRNA-expression data, on a genomicscale, in other words, the microarray technique. Microarrays areused as a tool for analyzing information in gene expression dataover a broad range of biological applications such as cancerclassification, cancer prognosis and to study a variety ofbiological processes, from differential gene expression in humantumors to yeast sporulation, and so on. 1.5.3 Proteomics Proteomics is the large-scale study of proteins, particularlytheir structures and functions. Proteins are vital parts of livingorganisms, as they are the main components of the physiologicalmetabolic pathways of cells. The term “proteomics” was first coinedin 1997 to make an analogy with genomics, the study of the genes.The word “proteome” is a blend of “protein” and “genome” in 1994.The proteome is the entire complement of proteins, including themodifications made to a particular set of proteins, produced by anorganism or system. This will vary with time and distinctrequirements, or stresses, that a cell or organism undergoes. Whileproteomics generally refers to the large-scale experimentalanalysis of proteins, it is often specifically used for proteinpurification and mass spectrometry (MS). 1.5.4 Lipidomics Lipidomics is the large-scale study of pathways and networks ofcellular lipids in biological systems. The word “lipidome” is usedto describe the complete lipid profile within a cell, tissue ororganism and is a subset of the “metabolome” which also includesthe three other major classes of biological molecules:proteins/amino acids, sugars and nucleic acids. Lipidomics is arelatively recent research field that has been driven by rapidadvances in technologies such as mass spectrometry, nuclearmagnetic resonance (NMR) spectroscopy, fluorescence spectroscopy,dual polarisation interferometry and computational methods, coupledwith the recognition of the role of lipids in many metabolicdiseases such as obesity, atherosclerosis, stroke, hypertension anddiabetes. This rapidly expanding field complements the hugeprogress made in genomics and proteomics, all of which constitutethe family of systems biology. Lipidomics research involves the identification andquantification of the thousands of cellular lipid molecular speciesand their interactions with other lipids, proteins, and othermetabolites. Investigators in lipidomics examine the structures,functions, interactions, and dynamics of cellular lipids and thechanges that occur during perturbation of the system. Han and Gross[1.4] first defined the field of lipidomics throughintegrating the specific chemical properties inherent in lipidmolecular species with a comprehensive mass spectrometric approach.Although lipidomics is under the umbrella of the more general fieldof “metabolomics”, lipidomics is itself a distinct discipline dueto the uniqueness and functional specificity of lipids relative toother metabolites. In lipidomics research, a vast amount of informationquantitatively describing the spatial and temporal alterations inthe content and composition of different lipid molecular species isaccrued after perturbation of a cell through changes in itsphysiological or pathological state. Information obtained fromthese studies facilitates mechanistic insights into changes incellular function. Therefore, lipidomics studies play an essentialrole in defining the biochemical mechanisms of lipid-relateddisease processes through identifying alterations in cellular lipidmetabolism, trafficking and homeostasis. REFERENCES [1.1] Gibbs W W. The unseen genome: gems among the junk.Scientific American, 2003, 289: 46-53. [1.2] Yin C. A novel exon finding algorithm based on the 3-baseperiodicity analysis of genome information. Chicago: University ofIllinois, 2005. [1.3] International Human Genome Sequencing Consortium. Initialsequencing and analysis of the human genome. Nature, 2001, 409:860-921. [1.4] Han X, Gross R W. Global analyses of cellular lipidomesdirectly from crude extracts of biological samples by ESI massspectrometry: a bridge to lipidomics. J Lipid Res., 2003, 44 (6):1071-1079.
CHAPTER 2 GRAPHICAL REPRESENTATIONS OF DNA SEQUENCE Last chapter introduces that DNA is a polymer composed of fourmolecules, nucleotides, A, G, T and C. The nucleotides join end toend to form a single DNA sequence. It can be said that a DNAsequence is a symbolic sequence, which consists of four symbols, A,G, T and C. The famous method, alignment, in Bioinformatics mayapply the symbolic sequence of DNA to the management and analysisof DNA data. Since last eighties, for visualizing sorting andcomparing various DNA sequences, many attempts for getting thegraphical representation of the DNA sequence have made. Thischapter will introduce the graphical representations of a DNAsequence. 2.1 Three-Dimension (3-D) Graphical Representation For the graphical representations of a DNA sequence, thepioneers, Hamori and Ruskin[2.1, 2.2] suggested a 3-D graphicalrepresentation of a DNA sequence, mapping the DNA sequence into a3-D space function called H curve. They used four unit vectors torepresent A, T, C and G in the three dimension space and took aninteger coordinate, k, in z-axis formed the 3-D graphicalrepresentation. They define the vector g(k) to be a function of k, the positionnumber of the nucleotides of an arbitrary DNA sequence. Thisfunction will be allowed to have one of four values. Suppose thata, b and c are unit positive vectors pointing in the direction ofthe Cartesian x, y and z axes, respectively. When the particularvalue of k corresponds to nucleotide A, the value of the vectorfunction will be set to g(k) ? a ? c; when k corresponds tonucleotide T, g(k) ? ?b ? c; when k corresponds to nucleotide C,g(k) ? ?a ? c; and when k corresponds to nucleotide G, g(k) ? b ?c. Set and will have Hij designate thethree-dimensional curve traced out by H(n) when k increases from ito j. The authors called this curve as H curve. The 3-D graphical representation of the DNA sequence, H curve,is a fundamental idea for the graphical representation of the DNAsequence. This representation for DNA sequence is one to onecorresponding, that is, H curve is a “fingerprint” of the DNAsequence. H curves appear to be particularly suitable for thevisual analysis and comprehension of both the local and the globalfeatures of long DNA sequence. But, it is not easy to visualize,which needs better 3-D graph software to help. 2.2 2-D Graphical Representation The 2-D graphical representation of DNA sequences was firstproposed by Gates[2.3, 2.4] (Figure 2.1), and rediscoveredindependently by Nandy[2.5] (Figure 2.2) and Leong et al[2.6](Figure 2.3). Their methods are based on selecting the fourcardinal directions in (x, y) coordinate system to represent thecontent of the four nucleotides in DNA sequences. The methodsessentially consists of plotting a point corresponding to anucleotide by moving
Figure 2.1 The representations of 4 nucleotides by Gates Figure 2.2 Therepresentations of 4 nucliotides by Nandy
Figure 2.3 The representations of 4 nucleotides by Leonget al one unit in the positive or negative direction of x or ycoordinate axis, depending on the defined association of anucleotide with a cardinal direction, the cumulative plot of suchpoints produces a graph that corresponds to the sequence ofnucleotides, for example, in the gene fragment under consideration.We have mentioned that there are three possible independent axessystems to plot a 2-D graph of a DNA sequence, which is shown inFigure 2.3. These 2-D graphical representations of a DNA sequence haveproblems, for example, some segments of a DNA sequence, T, AGT,AGTCT, AGTCAGT, AGTCAGTCT,…, will have the same graphicalrepresentation in Nandy’s system, which is named the graphicalrepresentation with degeneracy. The same features also appear inother two systems mentioned. It is easy to see that various pointsand lines corresponding to some bases in DNA sequence areoverlapped, and there are some circuits in the graphicalrepresentation. One uses the minimum length of all the DNAsequence, each of which forms a circuit in a graphicalrepresentation, to measure the relative degree of degeneracy ofdifferent graphical representations. The smaller the minimumcircuit length means the higher degeneracy. Clearly the minimumcircuit length of the three graphical representations of a DNAsequence mentioned above is 2. Therefore, the three graphicalrepresentations of a DNA sequence have the highest degeneracy. Guo et al[2.7] suggested a novel 2-D graphical representation ofDNA strand with low degeneracy. They designed four special vectorsin Cartesian (x, y) coordinate system to represent the fournucleotides A, T, C, G in the following manner:
which are illustrated in Figure 2.4.
Figure 2.4 The representations of A, T, C and G in Gaosystems They also proved that let S be a DNA sequence whose graphicalrepresentation Gd(S) forms a circuit with the minimum length |S|.Suppose f? is the frequency of the nucleotide appeared in thesequence for ??{A, T, C, G}.
if and only if d is even. This conclusion indicates that for large enough d, the graphicalrepresentation Gd(S) of a DNA sequence has lower degeneracy.However, if d approaches infinity then this system approximatesNandy?s system. The 2-D graphical representations introduced above withdegeneracy, but they have the advantage of being clear at a glance,it is very easy to visualize. So, we need other 2-D graphicalrepresentations to represent the DNA sequence withoutdegeneracy. 2.3 2-D Graphical Representations Without Degeneracy Yau et al[2.8] presented a 2-D graphical representation of DNAsequence without degeneracy. We constructed a purine-pyrimidinegraph on two quadrants of Cartesian coordinate system, with purinein the fourth quadrant and pyrimidine in the first quadrant. Theunit vectors representing of four nucleotides A, T, G, C, are asfollows:
which are shown in Figure 2.5(a). For comparing with Gates’system is shown in Figure 2.5(b), also. The unit vectors representing A, T, G and C are different fromthose of Gates’ method (Figure 2.5(b)). Only two quadrants of theCartesian coordinates are utilized. Figure 2.6(a) illustrates twoDNA graphs representing human and mouse first exon of ?-globingene, respectively, based on the four vectors we designed. As acomparison, the same two sequences plotted using Gates’ approachare also shown in Figure 2.6(b), which have many circuits.
Figure 2.6 2-D graphs of both human and mouse ?-globinexon-1 DNA sequences were generated by Yau et al (a) or Gates (b)method. Both sequences were obtained from NCBI GenBank (AF527577 or gi: 22094826 for human ?-globin, and J00413 or gi:193793 for mouse ?-globin) This 2-D graphical representation we presented[2.8] resolvesrepresentational degeneracy and is mathematically proven toeliminate circuit formation. Furthermore, given x-projection andy-projection of any point (x, y) on the 2-D graphicalrepresentation of DNA sequence, the number of A, G, C and T frombeginning of the sequence to that point could be found. In addition, we[2.9] introduce another 2-D graphicalrepresentation of DNA sequence in the first quadrant of thecoordinate plane, and we prove the no degeneracy of thisrepresentation mathematically, as well as we give the formulae forthe frequency of A, G, C and T from starting point of the sequenceto the point which provide the x-projection and y-projection. For constructing the new 2-D graphical representation of DNAsequence without degeneracy in one quadrant, the vectorrepresentations of four nucleotides A, G, C and T describe as , , , that means OA lies on x-axis, the angle between OG and x-axis is30 degrees, the angle between OC and x-axis is 60 degrees, and OTlies on y-axis, which are shown in Figure 2.7, and |OA| ? |OG| ?|OC| ? |OT| ? 1, as well.
Figure 2.7 Purine and pyrimidine in first quadrant It is that purine under the bisector, OQ, of the first quadrantand pyrimidine above the bisector of the first quadrant with thebound line OQ. The method for getting the graphical representation of a DNAsequence is following. If S1, S2, …, Sn is a DNA sequence, lengthn, where Si belongs to {A, T, C, G}, the corresponding graphicalrepresentation of the sequence, the sequence of points, P1, P2, …,Pn will be constructed such that the vector Pi?1 Pi corresponds toSi, where P0 is the origin, and |Pi?1 Pi| ?1. If Si ? A, then Pi?1Pi is parallel to x-axis; if Si ? T, then Pi?1 Pi is parallel toy-axis; if Si ? G, then the angle between Pi?1 Pi and the ray fromPi?1 parallel to x-axis is 30 degrees; if Si ? C, then the anglebetween Pi?1 Pi and the ray from Pi?1 parallel with x-axis is 60degrees, and so on. For getting the numerical sequence (the coordinate) of pointsP1, P2, …, Pn corresponding to DNA sequence S1, S2, …, Sn in thecomputer, we need to introduce a two dimensional array Pi ? (x(i),y(i)), i ? 1, 2, …, n. If Si ? A, then Pi ? Pi?1+(1, 0); if Si ? G,then Pi ? Pi?1+( /2, 1/2); if Si ? C, then Pi ? P +(1/2, /2)and if Si ? T, then Pi ? Pi?1+(0, 1), where i ? 1, 2, …, n, andPi?1 ? (0, 0), i ? 1. For calculating the numerical sequence (x(i),y(i)) and the graphical representation of the DNA sequence, theMATLAB code are done. The computational results of both human andmouse ?-globin exon-1 DNA sequence are illustrated in Figure2.8.
Figure 2.8 2-D graphs of both human (solid circle) andmouse ?-globin exon-1 gene sequence (open circle) in the firstquadrant Cartesian coordinate system. The??-globin sequences wereobtained from NCBI GenBank at AF527577 and J00413 for human andmouse, respectively To prove there is no circuit or degeneracy in the new 2-Dgraphical representation of DNA sequence, we assume that the numberof nucleotides forming a circuit is n, and fA, fG, fC, and fT arethe frequencies corresponding to the number of appearances of A, G,C and T in the circuit, respectively. Hence, fA + fG + fC + fT ? nand due to fAA, fG G, fC C and fT T form a circuit. The followingequation holds
For getting mx, nx and my, ny from 2x and 2y, we may use theiterative comparison of both the decimal parts of 2x and multipleof to get nx, then mx = 2x? nx. Similarly, we canobtain my and ny. From the mathematical arguments, we may see the graphicalrepresentation of DNA sequence in first quadrant has someadvantages. There is no degeneracy, and given the coordinate of anypoint in the representation, we can know the distribution of A, G,C and T from starting point the point considered. Furthermore, weget some information of the DNA sequence from the representationalgraph of it, also. For graphical representation sequence P1, P2, …,Pn, if point Pn is locate
你还可能感兴趣
我要评论
|