The material presented here is based on the paper(s):
- A. Dal Palù, A. Dovier, F. Fogolari, and E. Pontelli.
CLP-based protein fragment assembly. (Draft)Theory and Practice of Logic Programming, special issue dedicated to ICLP 2010.
10(4-6): pp 709-724, July 2010.
- Alessandro Dal Palù, Agostino Dovier, Federico Fogolari, and Enrico Pontelli.Exploring Protein Fragment Assembly Using CLP.In IJCAI11, Twenty-second International Joint Conference on Artificial Intelligence, July 16-22, 2011, pp. 2590-2595, AAAI Press, Barcelona.
Install (if not already present in your PC) SICStus Prolog 4.*
It is free for 30 days. In case you wish to use our code for more than 30 days and your institution is not interested in buying a SICStus Prolog licence, we will be glad of sending you a version precompiled for your system, independent of commercial software (just email us).
Download and Decompress the Protein.rar file. A “Protein” folder should contain all needed files. We will refer to this folder simply as the main folder.
Under Linux, from the main folder, launch:
sicstus -l tuples.pl
(or, but see below for goal details, sicstus -l tuples.pl –goal “pf_*(ID,*), halt.”)
Under Windows, launch SICStus Prolog, set the main folder as the Prolog working directory, and compile the file ‘tuples.pl’. It is important that you set the working directory otherwise auxiliary files needed during computations cannot be found.
In the main folder there is the file prot-list.pl which is a database of peptides/proteins, each of them identified by an ID complaining the Protein Data Bank standards.
If you use one of the proteins IDs stored in the file prot-list.pl, launch the goal:
where the ‘*’ are search options explained below. If the protein you wish to fold is not in the database, write its Primary sequence in a Prolog list using small letters (e.g. [a,a,c,t,r,s] for AACTRS) and call
Then use “new” as its ID.
Alternatively, add it in the file prot-list.pl using the same format as the other proteins stored and recompile the file tuples.pl.
In the first call for a protein extra initial time is needed to prepare auxiliary files that are saved in the subdirectory “temp” (these files are not automatically erased, you are in charge to do that). The successive runs on the same protein exploit these files and will be faster.
The outputs are stored in the subfolder “results”. They are in pdb format. You can see them in 3D using a viewer for molecular visualization, like, e.g. viewer lite, protein explorer, rasmol, etc.
For your convenience we have added the subfolder ORIGPROT where you can find the deposited structure (always in PDB format) of the protein tested in the paper (anyway, such data can be easily retrieved by searching their ID in the Protein Data Bank).
In the subfolder CLUSTER_RES instead, you’ll find the sequences we found in out tests.
From the command line you can choose one of the following search modes. Assume, e.g., that the protein ID is 1LE0.
This is the basic search. It looks for the first admissible conformation
Enumerates every (by default, improving, using Branch and Bound) solution. The solutions are saved as distinct pdb files.
The same, with a timeout of TIME seconds.
- Performs the local neighboring search with a timeout of TIME seconds.
Here there are some computation results:
|Branch and||Bound||(2 DAYS)||LNS||(6 HOURS)|
|PROTEIN ID||N||ENERGY||TIME (min)||RMSD||ENERGY||TIME (min)||RMSD|
Moreover, there are several options that control the general behaviour of the search and can be set at the end of the file tuples.pl. Options are input by the command new_options([option1,option2,…,optionN]). The default value (used for the tests above) is the following:
The other options are:
- use_centroids use a model with calpha + centroid. If no specified, the search uses the calpha (only backbone) model.
- improve_energy every time a structure is found, it looks for a conformation with better energy. If not specified, the search enumerates all solutions.
- contiguous_check It performs a check (just warning) that the distance between consecutive C alphas are in distance within 3.6–4.0 A
- max_most_frequent(MMF) This allows to limit to MMF the number of different fragments used for a single position. Since the fragments are sorted by decreasing probability, the most frequent ones are searched first.
- max_non_optimal_choices(MNOC) This is a limit to the number of non optimal choices made during the exploration of a branch in the search tree.
- structural constraints
- use_secondary impose the secondary structure described in the prot_list.txt file.
- use_original_codes if the database was created with a specific protein, this option can be used to overimpose the original torsion angles. Only one solution is found.
the original torsion angles are used only among the specified ranges Sk..Ek. Here the positions refer to aminoacids in the sequence. Note that if no initial offset is specified, the first amino acid has position equal to 1.
the original spatial position are fixed among the specified ranges Sk..Ek. The positions refer to aminoacids in the sequence. Note that if no initial offset is specified, the first amino acid has position equal to 1.
mutates the original protein sequence at position Pk with amino acid AAk. The positions refer to amino acids in the sequence. Note that if no initial offset is specified, the first amino acid has position equal to 1.
- box(PosAA, [RX,RY,RZ], [[MinX,MaxX],[MinY,MaxY],[MinZ,MaxZ]])
The amino acid in position PosAA in the sequence, is bounded to be placed in the box identified by the Min Max list. The R vector specifies the original position of the first C alpha in the original protein. To be used with use_original_codes([[1,E1],…]) Note that if no initial offset is specified, the first amino acid has position equal to 1.
Thanks for reading this page. For any question feel free to contact us.
Alessandro Dal Palù, Agostino Dovier, Federico Fogolari, and Enrico Pontelli.