Barleymap stand-alone application
Barleymap is a tool which allows searching the position of sequences in sequence-enriched genetic/physical maps.
Barleymap was designed with 3 main goals in mind:
Therefore, there are three basic tasks which can be carried out with barleymap, depending on the input data used:
To do this, barleymap works with the following resources:
Barleymap has 4 different groups of tools, which are further explained in following sections:
The following builds have been tested:
Other versions of the previous aligners could also work with barleymap, as long as the aligner parameters or its output format remain as in the versions above.
Either:
Configure environmental variables (e.g. PATH and PYTHONPATH in linux) as needed.
For example:
mkdir /home/$USER/apps;
cd /home/$USER/apps;
curl -O https://github.com/Cantalapiedra/barleymap/archive/barleymap-3.0.tar.gz
tar -zxf barleymap-3.0.tar.gz
rm barleymap-3.0.tar.gz
mv barleymap-3.0 barleymap
export PATH=$PATH:/home/$USER/apps/barleymap/bin/
export PYTHONPATH=$PYTHONPATH:/home/$USER/apps/barleymap/
The barleymap web interface was designed to be used with a CherryPy server, which should be installed and configured independently.
Those interested on running their own barleymap web servers should check the CherryPy documentation to setup the server with their own infrastructure. For example, when using CherryPy after an Apache server you could need to add to its configuration files a reverse proxy, like:
ProxyPass /barleymap http://127.0.0.1:$CHERRYPYPORT/barleymap/
ProxyPassReverse /barleymap http://127.0.0.1:$CHERRYPYPORT/barleymap/
Barleymap web versions includes three shell scripts (_START, _RESTART and _STOP) to make it easier to start, restart and stop the application server.
Note that besides the configuration steps detailed in the next sections, for the barleymap web application it could be required, depending on the environment, to configure the src/bmap.conf and src/server.conf files.
To configure barleymap you will need to edit the following configuration files under the barleymap/conf directory:
Note that barleymap is distributed with .sample files, which are just examples of the previous configuration files. You could use them as templates to create you own configuration files. Just remember that the ones actually used by barleymap must not have the .sample suffix.
All barleymap configuration files have one field per row. Each row has 2 space-separated columns: the field name and the field value. To a edit a field, edit its value. Field names should be left unmodified.
The first thing you will need to do is configuring the paths.conf file under the barleymap/conf directory. The content of the paths.conf file must have the next fields (shown as in paths.conf.sample file):
# App absolute path
app_path PATH_TO_BARLEYMAP_DIR
# Relative paths to auxiliary apps
genmap_path app_aux/
split_blast_path app_aux/
# Absolute paths to temporary and datasets folders
tmp_files_path PATH_TO_BARLEYMAP_DIR/tmp_files
datasets_path PATH_TO_BARLEYMAP_DIR/datasets/
annot_path PATH_TO_BARLEYMAP_DIR/datasets_annotation/
maps_path PATH_TO_BARLEYMAP_DIR/maps/
########### Aligners
# Blast
blastn_app_path PATH_TO_NCBI_BLAST/bin/blastn
blastn_dbs_path PATH_TO_BLAST_DATABASES
# GMAP
gmap_app_path PATH_TO_GMAP/bin/gmap
gmap_dbs_path PATH_TO_GMAP_DATABASES
gmapl_app_path PATH_TO_GMAP/bin/gmapl
# miniprot
miniprot_app_path PATH_TO_MINIPROT/miniprot
miniprot_dbs_path PATH_TO_MINIPROT_DATABASES
# HS-Blastn
hsblastn_app_path PATH_TO_HSBLASTN/hs-blastn-src/hs-blastn
hsblastn_dbs_path PATH_TO_HSBLASTN_DATABASES
########### Other
citation Cantalapiedra_CP,_Boudiar_R,_Casas_AM,_Igartua_E,_Contreras-Moreira_B._BARLEYMAP:_physical_and_genetic_mapping_of_nucleotide_sequences_and_annotation_of_surrounding_loci_in_barley._Mol_Breeding_(2015)_35:13_DOI_10.1007/s11032-015-0253-1
stdalone_app https://github.com/Cantalapiedra/barleymap
First, you will need to edit the app_path field to point to the absolute path in which barleymap has been installed.
The values of fields genmap_path and split_blast_path, and also those under the section Other (citation and stdalone_app) should be left unmodified.
For most of the other fields, the directories they reference will most likely be empty at the moment. Thus, you could configure them already or wait until you decide where the data will be stored.
The tmp_files_path field indicates to barleymap where it should write temporary files to. The datasets_path, annot_path and maps_path fields tell barleymap from which directories should read data corresponding to datasets, annotation and maps, respectively. To be sure that barleymap is reading those paths correctly, using absolute paths are recommended.
Regarding the section Aligners, only the fields corresponding to the aligner or aligners which will be used by the current barleymap instance will need to be edited. For each aligner to be used, barleymap needs:
The absolute path to the binary file of the aligner. You will need to change the values of blastn_app_path, gmap_app_path, gmapl_app_path, miniprot_app_path and hsblastn_app_path fields.
The absolute path to the sequence databases (genome, sequence-enriched map, or any other sequence reference). You will need to change the values of blastn_dbs_path, gmap_dbs_path, miniprot_dbs_path and hsblastn_dbs_path.
Note that although both the standalone and the web versions need their own configuration files, the actual resources (databases, datasets and maps) can be shared by both applications by configuring the previous fields to point to the same directories.
A database represents a sequence reference (a genome or similar), which can be queried with one or more alignment tools (BLASTN, GMAP or HS-BLASTN in the current version of barleymap).
To look for position of sequences through alignment, barleymap requires at least one database to work with. It takes 3 steps to add a database to barleymap:
The databases.conf.sample file shows 3 databases as examples:
#name unique_id type
SpeciesAGenome speciesA std
SpeciesBGenome speciesB std
PolyploidGenome polyploid big
Note that the aligner to be used with each database is not specified: for a single barleymap database, you could actually have BLASTN, GMAP and HS-BLASTN sequence databases.
The first database, called “SpaciesAGenome” has the ID “speciesA”, and thus its actual files (e.g. from BLASTN) and folders (e.g. from GMAP) should be prefixed “speciesA” and placed under the directory indicated for that aligner in the paths.conf file. For example, if we created a GMAP index for this “speciesA” database, and the path for GMAP databases was configured as /home/user/barleymap/databases/gmap/ in paths.conf, we should move the “speciesA” folder created by GMAP so that the database will be at /home/user/barleymap/databases/gmap/speciesA.
Once at least one database has been correctly configured, the following tools can be used:
A map stores the positional arrangement, either physical or genetical, of sequences from one or several databases.
It takes 5 steps to add a map to barleymap:
An “anchored” map (e.g. a sequence-enriched genetic map) requires files for positions, since the positions from the databases, obtained through alignment (e.g. contig_1300 position 12430), need to be translated to map positions (e.g. chr1H position 44.1 cM). - Search type: it states which type of algorithm will be performed when searching sequences in the databases associated to this map. Can be either “greedy”, “hierarchical” or “exhaustive”.
The maps.conf.sample file shows 3 maps as examples:
#name id has_cm has_bp default_pos_type map_type search_type db_list folder_name main_datasets
MapName mapID cm_false bp_true bp physical greedy db_genome mapID_folder dataset1,dataset2,dataset3
Map2 map2 cm_true bp_false cm anchored hierarchical db_anchored1 map2_dir dataset4
PhysGenetMap physgenetmap cm_true bp_true cm anchored exhaustive db_anchored1,db_anchored2,db_anchored3 physgenetmap_path dataset3,dataset5
The first map, called “MapName”, with ID “mapID”, and stored in the folder mapID_folder, is a physical map (“cm_false”, “bp_true”, “bp”, “physical”), with a single database (with ID db_genome) of sequences associated to it. Having a single database makes irrelevant the “Search type” field (which has been set to “greedy”, arbitrarily). The map has 3 datasets associated as main datasets.
The second map (“Map2”) is a genetic map (“cm_true”, “bp_false”, “cm”, “anchored”) with a single database of sequences associated to it (db_anchored1). Again, the “Search type” will be irrelevant having only one database (and thus has been configured to “hierarchical”, arbitrarily).
The third map (“PhysGenetMap”) is a genetical and physical map (“cm_true”, “bp_true”, “cm”, “anchored”), with 3 databases associated to it (db_anchored1, db_anchored2 and db_anchored3). Here, the “Search type” algorithm was set to “exhaustive”, to keep searching each query in the next database, until a map position has been found for the it.
A map-database file contains the map position of the sequences of a database. For example, a file called map2.contigs_database could have:
>Map2
#Marker chr cM multiple_positions other_alignments
contig_1011389 1 0.106232294617565 No No
contig_1029771 1 0.106232294617565 No No
contig_110298 1 0.106232294617565 No No
contig_111381 1 0.106232294617565 No No
contig_1170672 1 0.106232294617565 No No
contig_1269062 1 0.106232294617565 No No
contig_13304 1 0.106232294617565 No No
contig_13532 1 0.106232294617565 No No
Rows starting with “>” or “#” will be ignored, so that it can be used for comments, map name or header fields.
Data rows have 5 or 6 (depending whether the map has cM, bp or both types of position) tab-delimited fields:
A “chrom” file has the information about the name and size of the chromosomes of a map. For example:
chr1H 1 558535432
chr2H 2 768075024
chr3H 3 699711114
chr4H 4 647060158
chr5H 5 670030160
chr6H 6 583380513
chr7H 7 657224000
chrUn 8 249774706
Each row has 3 or 4 (depending whether the map has cM, bp or both types of position) tab-delimited fields:
Once that at least one database and one map have been correctly configured, the following tools can be used:
The bmap_find and bmap_locate could be used, but there is no interest in running them without having configured datasets previously.
Some genes or markers are searched in sequence databases, genomes or maps very often. Therefore, it is advantageous to search them once and store the result, so that this position can be retrieved in sucessive searches without repeating the alignment process. This is the main purpose of datasets.
The other use of datasets is showing which other loci (genes, markers, etc) are present in the region in which a query of interest (e.g. a marker associated to a QTL) has been found.
Therefore, a dataset contains the map positions of markers, genes or other loci, which have been previously aligned and mapped. These positions can be retrieved using just the loci identifiers.
Creating a dataset takes 4 steps:
Create a file called dataset_ID.map_ID, for each map which will be associated to such dataset, and put it in the folder created for the dataset in the previous step (e.g. barleymap/datasets/dataset_ID/dataset_ID.map_ID). Note that these files could be created using bmap_build_datasets, as explained in following sections. However, if it is desired to create the files manually, see format of the dataset-map files below for details.
The datasets.conf.sample file shows 4 datasets as examples:
#name unique_id type filename file_type db_list synonyms_file records_prefix
DATASET1 dataset1 genetic_marker dataset1.fasta fna ANY /home/user/dataset1.syns no
DATASET2 dataset2 gene dataset2.gtf gtf DB1,DB2 no DAT2_
>DATASET3 dataset3 anchored dataset3.fasta fna ANY no no
>DATASET4 dataset4 map "maps_path" map DB3 no DAT3_
The first dataset (“DATASET1”), with ID “dataset1”, is of type genetic marker (“genetic_marker”), and its source are FASTA formatted sequences (“dataset1.fasta”, “fna”). It is associated to any database (“ANY”), it has a list of synonyms (“/home/user/dataset1.syns”) and there is no prefix shared by all its loci.
In contrast, the fourth dataset (“DATASET4”) is used to create a map (map). Note that its filename is indicated as “maps_path”, and thus the data for this dataset corresponds to the same file as the one used as map. It is associated to a single database (“DB3”), it has no synonyms and all its loci have the prefix “DAT3_”.
A dataset-map file contains the map position of a set of commonly used markers, genes, etc. For example:
>Map2
#Marker chr cM multiple_positions other_alignments
S_180989 1 0.106232294617565 No No
S_35790 1 0.21246458923513 No No
i_30945 1 3.64730878186969 No No
S_165910 1 3.68361907676158 No No
S_161137 1 4.10764872521246 No No
S_193700 1 4.10764872521246 No No
S_206684 1 4.10764872521246 No No
i_66630 1 4.95750708215297 No No
Note that this file has the same format as that of files for maps, and, as explained above, a map can also be configured as a dataset.
As with files for maps, rows starting with “>” or “#” will be ignored, so that it can be used for comments, map name or header fields.
Data rows have 5 or 6 (depending whether the map has cM, bp or both types of position) tab-delimited fields:
Once that at least one database, one map and one dataset have been correctly configured, the following tools can be used:
Datasets of type “gene” can be enriched with annotation data, including a description text, a class of feature (to be defined by the admin of the app), and lists of Gene Ontologies (GO), protein families (PFAM) and InterPro (IPR) identifiers.
Creating the annotation of a dataset of genes takes 4 steps:
Create a file for each annotation field for that dataset, and put it in the folder created for the dataset annotation in the previous step (e.g. barleymap/datasets_annotation/dataset_annot_ID.desc.tab for description texts). See [format of the dataset annotation files] (https://github.com/Cantalapiedra/barleymap#format-of-the-dataset-annotation-files) below for details.
Create a row in the conf/datasets_annotation.conf file, with 5 space-separated fields.
The datasets.conf.sample file shows the annotation of a single dataset:
#name unique_id dataset_unique_id filename type
DT2_DESC dt2_desc dataset2 dt2.desc.tab txt
DT2_CLASS dt2_class dataset2 dt2.class.tab class
DT2_GO dt2_go dataset2 dt2.go.tab go
DT2_PFAM dt2_pfam dataset2 dt2.pfam.tab pfam
DT2_IPR dt2_ipr dataset2 dt2.ipr.tab ipr
All the 5 annotation fields have been annotated for dataset2. Note that you could leave some fields without annotation if desired (for example, if a dataset lacks class data).
A dataset annotation file contain the identifier of the gene and the annotation field, which will be either a text (description and class) or an identifier (GO, PFAM, IPR). For example:
Gene000010 GO:0008270
Gene000040 GO:0006412
Gene000040 GO:0003735
Gene000040 GO:0005622
Gene000040 GO:0005840
Gene000090 GO:0004672
Gene000090 GO:0005524
NOTE: in the web version of barleymap some of the options and parameters, which can be changed in the standalone version, have been fixed. Check https://barleymap.eead.csic.es/barleymap/help.
The main tools which barleymap provides, both in the standalone and in the web version, are for alignment of sequences, finding markers, genes, etc, or locate features in a given position.
The alignment of sequences can be performed from the Align sequences button in barleymap web, or using the command bmap_align in the standalone version.
In the standalone version, info and a full list of parameters can be obtained running the tool with the “-h/–help” options:
Usage: bmap_align.py [OPTIONS] [FASTA_FILE]
typical: bmap_align.py --maps=map queries.fasta
Options:
-h, --help show this help message and exit
--aligner=ALIGNER Alignment software to use (default "gmap"). The "gmap"
option means to use only GMAP. The "blastn" option
means to use only Blastn. The "hsblastn" option means
to use only HS-Blastn. The order and aligners can be
explicitly specified by separating the names by ","
(e.g.: blastn,gmap --> First Blastn, then GMAP).
--thres-id=THRES_ID Minimum identity for valid alignments. Float between
0-100 (default 98.0).
--thres-cov=THRES_COV
Minimum coverage for valid alignments. Float between
0-100 (default 95.0).
--threads=N_THREADS Number of threads to perform alignments (default 1).
--maps=MAPS_PARAM Comma delimited list of Maps to show.
-b, --best-score Will return only best score hits.
--sort=SORT_PARAM Sort results by centimorgan (cm) or basepairs (bp)
(default: defined for each map in maps configuration.
-k, --show-multiples Queries with multiple positions will be shown (are
obviated by default).
-a, --anchored Show anchored features at positions of queries.
-g, --genes Genes at positions of queries will be shown.
-m, --markers Additional markers at positions of queries will be
shown. Ignored if -g.
-d, --show-all-features
All features will be used to enrich a map. By default,
only main datasets of each map are used to enrich.
-o, --show-on-markers
Additional features will shown for each query. By
default, they are shown by interval of markers
-e EXTEND_WINDOW, --extend=EXTEND_WINDOW
Centimorgans or basepairs (depending on sort) to
extend the search of -g or -m.(default 0.0)
-u, --show-unmapped Not found (unaligned, unmapped), will be shown.
-c, --collapse Mapping results and features (markers, genes) will be
shown at the same level.
-f cM positions will be output with all decimals
(default, 2 decimals).
-v, --verbose More information printed.
The user provides one or more FASTA formatted sequences to be searched, which are the queries. The user will choose also one or more maps (–maps), from which to obtain the position of the queries.
Note that the user does not choose one or more sequence databases. In fact, when he chooses a map, he is implicitly chosing the databases associated to that map in the map configuration.
The user can choose alignment thresholds: minimum alignment identity (–thres-id) and minimum query coverage (–thres-cov). He can also choose whether to obtain also a list of unmapped sequences (-u, –show-unmapped), and whether to include or not as mapped those queries with more than one position as result (-k, –show-multiples), or instead report them as unmapped. In the case of maps which are both genetical and physical, the user can choose whether the results will be sorted by cM or by bp (–sort).
The user may also choose whether to show only the resulting map, or also information about datasets, including genes (-g, –genes), markers (-m, –markers) or anchored (-a, –anchored) features in the region. Note that these data can be shown only when they are in the same position as each query (-o, –show-on-markers), or all data which can be found between each two queries. In addition, in both cases, the user can choose to extend the search of datasets up- and down-stream of the features positions (-e, –extend). Note that if the search is not extended (by default is 0.0) no information will be shown between markers, even without the “-o, –show-on-markers” option. Also, the user can choose whether to show information about main datasets only (those associated to the map in the configuration file), or about all the datasets with information for this map (-d, –show-all-features).
There are several parameters which can be changed in the standalone version only, and which are fixed in the web version. The user can choose to obtain results only from those alignments with the best score (-b, –best-score), among those over the alignment thresholds. This is active by default in the web version. Also, the user of the standalone version can choose the number of threads (–threads) to be used during alignment. Note that these number of threads is actually given as parameter to the actual aligner (BLASTN, GMAP, HS-BLASTN, etc.) and the actual barleymap process runs in a single core. In the standalone version, the user can also change the verbosity which will be output to stderr (-v, –verbose), and also whether the cM positions will be output with full decimals (-f) or formatted with 2 decimals (by default). Finally, in the standalone version the information about datasets can be shown as additional columns in the results table, or can be shown “inline” with the map results (-c, –collapse), using the same columns from the results table.
One important parameter is the aligner (or aligners), which can be changed in both the standalone (–aligner) and in the web version. In the latter, there are some fixed options, using BLASTN only, GMAP only, or GMAP followed by BLASTN. In the standalone version, either a single or a comma-separated list of aligners can be specified, and the aligners will be used in that order (check the alignment algorithm below).
For all the databases of a given map, barleymap searches all the queries in the first database, using the first aligner. If there are queries which have not been found, it uses the next aligner in the same database. If all the aligners have been used in this database, try with the next database, starting with the first aligner. Note that for protein sequences the only supported aligner is ‘miniprot’. Repeat until all the queries have been found or there are no more databases and aligners to use. With each query for which alignment targets have been found, search the map position of those targets, and associate those positions to the corresponding queries.
In short:
Given a list of unmapped queries *U*, a list of databases *D* from a map *m*, and a list of aligners *A*:
For each *d* in *D*,
for each *a* in *A*,
use *a* to search a target *t* in *d*, for each *q* in *U*.
Associate *q* to *t*.
Update *U*.
Break if *U* is empty.
End of loop.
For each *q* not in *U*, that is, each *q* associated to *t*:
Find a position *p* from *m* for *t*.
Report *p* as the position of *q*.
Figure. Schematic representation of the standalone Barleymap “hierarchical” algorithm. For the “greedy” algorithm all the query sequences are aligned to next database, instead of just the unaligned ones. For the “exhaustive” algorithm, the “align to next database” arrow should be moved to the box “Retrieve map positions” and change the tag “unaligned” to “unmapped”. Several types of input can be queried (1): identifiers (query IDs), genomic positions or FASTA (query sequences). The alignment modes (2) support genomic, transcript and protein sequences. For nucleotides, BLASTN is used for genomic sequences and GMAP for transcripts/cDNAs; the “auto” uses both to maximize the alignment rate. For amino acid sequences, MINIPROT is used to align proteins to genomes considering splicing and frameshifts. Genomic positions enter the pipeline at the mapping module (3), where targets are associated with the queries, and after filtering (4), enriched with surrounding genes and markers. Finally, gene annotation such as protein domains might also be appended to the results.
In addition there are three versions of the algorithm: “greedy”, “hierarchical” and “exhaustive”. These versions cannot be specified as parameter of the application. Instead, have to be associated to each of the maps in their configuration.
In the “greedy” version, the list of unaligned queries is only updated for each database. Therefore, every query is searched in every database.
In the “hierarchical” version, a query is removed from the list of unaligned queries when an alignment hit has been found for the query, regardless of whether the query has map position or not.
In the “exhaustive” version, a query is removed from the list of unaligned queries when the query an alignment hit has been found for the query, and a map position has been associated to it.
Finding markers (or other datasets features, like genes, etc.) can be performed from the Find markers button in barleymap web, or using the command bmap_find in the standalone version.
In the standalone version, info and a full list of parameters can be obtained running the tool with the “-h/–help” options:
Usage: bmap_find.py [OPTIONS] [IDs_FILE]
typical: bmap_find.py --maps=map queries.ids
Options:
-h, --help show this help message and exit
--maps=MAPS_PARAM Comma delimited list of Maps to show.
--sort=SORT_PARAM Sort results by centimorgan (cm) or basepairs (bp)
(default: defined for each map in maps configuration.
-k, --show-multiples Queries with multiple positions will be shown (are
obviated by default).
-a, --anchored Show anchored features at positions of queries.
-g, --genes Genes at positions of queries will be shown. Ignored
if -a
-m, --markers Additional markers at positions of queries will be
shown. Ignored if -g or -a.
-d, --show-all-features
All features will be used to enrich a map. By default,
only main datasets of each map are used to enrich.
-o, --show-on-markers
Additional features will shown for each query. By
default, they are shown by interval of markers
-e EXTEND_WINDOW, --extend=EXTEND_WINDOW
Centimorgans or basepairs (depending on sort) to
extend the search of -g or -m.(default 0.0)
-u, --show-unmapped Not found (unaligned, unmapped), will be shown.
-c, --collapse Mapping results and features (markers, genes) will be
shown at the same level.
-f cM positions will be output with all decimals
(default, 2 decimals).
-v, --verbose More information printed.
Most of the parameters work the same as in the bmap_align command or the “Align sequences” web application. Instead of FASTA formatted sequences, the input in this tool are identifiers of features (markers names, for example). Therefore, there are some parameters not present in this tool, like the aligner choice or the alignment thresholds. Note that, because of this, the positions stored and returned as output by this tool were obtained with a specific set of parameters. This includes also the list of databases and the algorithm configured for a given map when the positions were obtained by alignment. Check www.floresta.eead.csic.es/barleymap/ for details about the datasets used in the web version.
Locating features (markers, genes, etc.) in a region can be performed from the Locate by position button in barleymap web, or using the command bmap_locate in the standalone version.
In the standalone version, info and a full list of parameters can be obtained running the tool with the “-h/–help” options:
Usage: bmap_locate.py [OPTIONS] [IDs_FILE]
typical: bmap_locate.py --maps=map queries.tab
Options:
-h, --help show this help message and exit
--maps=MAPS_PARAM Comma delimited list of Maps to show.
--sort=SORT_PARAM Sort results by centimorgan (cm) or basepairs (bp)
(default: defined for each map in maps configuration.
-k, --show-multiples Queries with multiple positions will be shown (are
obviated by default).
-a, --anchored Show anchored features at positions of queries.
-g, --genes Genes at positions of queries will be shown. Ignored
if -a
-m, --markers Additional markers at positions of queries will be
shown. Ignored if -g or -a.
-d, --show-all-features
All features will be used to enrich a map. By default,
only main datasets of each map are used to enrich.
-o, --show-on-markers
Additional features will shown for each query. By
default, they are shown by interval of markers
-e EXTEND_WINDOW, --extend=EXTEND_WINDOW
Centimorgans or basepairs (depending on sort) to
extend the search of -g or -m.(default 0.0)
-u, --show-unmapped Not found (unaligned, unmapped), will be shown.
-c, --collapse Mapping results and features (markers, genes) will be
shown at the same level.
-f cM positions will be output with all decimals
(default, 2 decimals).
-v, --verbose More information printed.
In this case each query is a map position, instead of a sequence or a marker identifier, and it has 2 tab-separated fields:
The user input is a list of such map positions.
For example, for a physical map:
chr2 711398163
chr6 376635036
chr7 4184387
chr7 227367117
As explained above, the main barleymap tools, including those in the web version, return as result map positions for the queries. The hits from the alignment step remain hidden for the user. However, in some cases alignment results are preferred. These can be obtained with the secondary tools, which are available only in the standalone version.
The bmap_align_to_db command allows obtaining the alignment results from FASTA formatted queries to one or more databases. The parameters of bmap_align_to_db can be obtained running it with “-h/–help”:
Usage: bmap_align_to_db.py [OPTIONS] [FASTA_FILE]
typical: bmap_align_to_db.py --databases=MorexGenome queries.fasta
Options:
-h, --help show this help message and exit
--databases=DATABASES_PARAM
Comma delimited list of database names to align to
(default all).
--databases-ids=DATABASES_IDS_PARAM
Comma delimited list of database IDs to align to
(default all).
--aligner=ALIGNER Alignment software to use (default "gmap"). The "gmap"
option means to use only GMAP. The "blastn" option
means to use only Blastn. The "hsblastn" option means
to use only HS-Blastn. The order and aligners can be
explicitly specified by separating the names by ","
(e.g.: blastn,gmap --> First Blastn, then GMAP).
--thres-id=THRES_ID Minimum identity for valid alignments. Float between
0-100 (default 98.0).
--thres-cov=THRES_COV
Minimum coverage for valid alignments. Float between
0-100 (default 95.0).
--threads=N_THREADS Number of threads to perform alignments (default 1).
--search=SEARCH_TYPE Whether obtain the hits from all DBs (greedy), only
best score hits (best_score), or first hit found
(hierarchical); (default greedy).
--ref-type=REF_TYPE Whether use GMAP (std) or GMAPL (big), when using
--databases-ids only.
-v, --verbose More information printed.
Most of the parameters are the same used in the bmap_align command, and have already been explained above. The –databases (or –databases-ids), –search and –ref-type parameters are the same as some of the fields used to configure a map, and have also been explained (see Configuration of maps).
The result of bmap_align_to_db is a table with 12 fields:
Note that rows starting with “>” or “#” are for database name and column headers, respectively. An example of the output file from bmap_align_to_db:
>DB:150831_barley_pseudomolecules
#query_id subject_id identity query_coverage score strand qstart qend sstart send database algorithm
m_10235 chr1 100.00 100.00 120.0 - 1 121 80265474 80265594 genome2.0 gmap
m_19 chr7 100.00 100.00 196.0 - 1 197 17091119 17091315 genome2.0 gmap
m_10 chr1 100.00 100.00 210.0 + 1 211 71173958 71174168 genome2.0 gmap
m_1100 chr5 100.00 100.00 178.0 + 1 179 434303690 434303868 genome2.0 gmap
m_1012 chr2 100.00 100.00 226.0 + 1 227 29125654 29125880 genome2.0 gmap
The bmap_align_to_map command allows obtaining the alignment results from FASTA formatted queries to one or more maps, i.e. to the databases which have been configured in barleymap for those maps (see configuration of maps). The parameters of bmap_align_to_map can be obtained running it with “-h/–help”:
Usage: bmap_align_to_map.py [OPTIONS] [FASTA_FILE]
typical: bmap_align_to_map.py --maps=MorexGenome queries.fasta
Options:
-h, --help show this help message and exit
--maps=MAPS_PARAM Comma delimited list of Maps to show (default all).
--aligner=ALIGNER Alignment software to use (default "gmap"). The "gmap"
option means to use only GMAP. The "blastn" option
means to use only Blastn. The "hsblastn" option means
to use only HS-Blastn. The order and aligners can be
explicitly specified by separating the names by ","
(e.g.: blastn,gmap --> First Blastn, then GMAP).
--thres-id=THRES_ID Minimum identity for valid alignments. Float between
0-100 (default 98.0).
--thres-cov=THRES_COV
Minimum coverage for valid alignments. Float between
0-100 (default 95.0).
--threads=N_THREADS Number of threads to perform alignments (default 1).
--search=SEARCH_TYPE Whether obtain the hits from all DBs (greedy), only
best score hits (best_score), or first hit found
(hierarchical); (default greedy).
-v, --verbose More information printed.
Most of the parameters are the same used in the bmap_align command, and have already been explained above. The –search parameter would override the algorithm configured as default for the map.
The result of bmap_align_to_map is a table with the same format as that of bmap_align_to_db (see above).
Note that rows starting with “>” correspond to maps names. An example of the output file from bmap_align_to_map:
>>Map:morex_genome
#query_id subject_id identity query_coverage score strand qstart qend sstart send database algorithm
i_11_20855 chr1H 99.20 100.00 238.08 + 1 241 81133442 81133992 150831_barley_pseudomolecules gmap
i_12_10235 chr1H 100.00 100.00 120.0 - 1 121 80265474 80265594 150831_barley_pseudomolecules gmap
i_BK_19 chr7H 100.00 100.00 196.0 - 1 197 17091119 17091315 150831_barley_pseudomolecules gmap
i_BK_10 chr1H 100.00 100.00 210.0 + 1 211 71173958 71174168 150831_barley_pseudomolecules gmap
i_11_10030 chr1H 98.30 100.00 235.92 + 1 241 13590814 13591054 150831_barley_pseudomolecules gmap
These are several tools which are included with barleymap, but do not perform searches in the barleymap configured resources. These are intended to help in the configuration of the application, specially when dealing with datasets. Note that these tools use the barleymap main tools from the standalone version, and therefore are included only in the standalone version. These tools could be used with the standalone version to check or create resources to be used with barleymap web version though.
The bmap_config command returns a report including all the resources which have been configured with the app, including:
The bmap_build_datasets is a script which uses the datasets.conf file as input, and tries to generate the files for those datasets (markers, genes, maps, etc.), to be included as resources in barleymap.
Lets say that the user has a number of files from markers of genes which he wants to use as datasets in barleymap. The user must configure the datasets.conf file, in which the files with the data for those datasets will be referenced. Then, the user could need to generate the files with the actual positions of those datasets to the maps, so that they can be retrieved by barleymap. Basically, we would need to perform alignments with FASTA sequences, reformat gtf or bed files, etc. In turn, the user can use the bmap_build_datasets, which will run bmap_align for datasets of type FASTA, and will reformat GTF and BED files, etc. Then the user just will need to put those files under the appropiate directory where barleymap has been configured to read datasets.
When there are datasets which the user wants to keep in the configuration file, but must not be processed by bmap_build_datasets, the user can “comment” those datasets in the datasets.conf file with the prefix “>”. Note that if he uses “#” instead of “>” the dataset will be fully commented, and will not be used at all by barleymap.
The parameters of bmap_build_datasets are:
Usage: bmap_build_datasets
Options:
-h, --help show this help message and exit
--threads=N_THREADS Number of threads to perform alignments (default 1).
--dataset=DATASET_PARAM
A single dataset to process. By default all datasets
are processed.
-v, --verbose More information printed.
The –threads parameter will be used as input for datasets of type FASTA (to perform the alignments).
Also the user can run the bmap_build_datasets for a single dataset (–dataset parameter), for example when a new dataset is to be added to a barleymap application for which the other datasets had already been created previously.
The bmap_datasets_index is used to create an index file for the datasets. Currently, the script can be run for only a single dataset. For example, to process the datasets.conf file the user would need to loop over the file externally and run bmap_datasets_index for each of the iteration steps with:
bmap_datasets_index current_dataset
Note that the current_dataset is the file which barleymap uses to read the dataset (i.e. the one which could be generated with bmap_build_datasets). As a result, bmap_datasets_index returns a file called “current_dataset.idx”, which should be placed in the same directory where the current_dataset file is located and read by barleymap.
Note that for large dataset files, using index files will make the retrieval of markers, genes, etc. faster, whereas for small dataset files is likely better to not use index files.
README is part of Barleymap. Copyright (C) 2013-2014 Carlos P Cantalapiedra. Copyright (C) 2024 Bruno Contreras Moreira and Najla Ksouri (terms of use can be found within the distributed LICENSE file).
Cantalapiedra, C.P., Boudiar, R., Casas, A.M. et al. BARLEYMAP: physical and genetic mapping of nucleotide sequences and annotation of surrounding loci in barley. Mol Breeding 35, 13 (2015). https://doi.org/10.1007/s11032-015-0253-1
Ksouri N, Moreno MA, Contreras-Moreira B, Gogorcena Y (2024) Mapping the genomic landscape of peach and almond with PrunusMap. https://www.biorxiv.org/content/10.1101/2024.03.26.586732v1