LukProt - an animal evolution-centric eukaryotic protein database

LukProt is the EukProt database with additional species added, mostly the undersampled animal and some holozoan taxa. Please report any problems or suggestions to Lukasz Sobala: lukasz.sobala (at) hirszfeld.pl.   The database is composed of sequences translated from annotated genomes, transcriptomes...

Full description

Saved in:
Bibliographic Details
Main Author: Sobala, Łukasz F.
Format: Data Set
Language:English
Published: Zenodo 24-09-2024
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:LukProt is the EukProt database with additional species added, mostly the undersampled animal and some holozoan taxa. Please report any problems or suggestions to Lukasz Sobala: lukasz.sobala (at) hirszfeld.pl.   The database is composed of sequences translated from annotated genomes, transcriptomes or ESTs. The main purpose of the database is to be a convenient resource to look up proteins of interest and provide a single place where sequences from undersampled animal taxa can be found. The current version of the database (v1.5.1) is based on EukProt v3. The home of all public versions of LukProt is this page (Zenodo). The proteomes that are novel in LukProt are denoted as LPXXXXX and those coming from AniProtDB are called APXXXXX. The sequence IDs from EukProt are conserved in LukProt. This means that each sequence is assigned an ID in the following format: (A/E/L)PXXXXX_Species_epithet_(strain)_PYYYYYY where XXXXX is a number from 00001 to 99999 and YYYYYY is a number from 000001 to 999999. Each sequence is assigned a unique number assigned to each sequence within a taxon. All the IDs are compatible with BLAST v5 "-parse_seqids" option and the database can be readily deployed, for example on a server running SequenceServer. Within each of the source fasta files, the source sequence identifier was kept after a blank space, so that it can still be retrieved if needed. A publicly available BLAST server providing LukProt search is available at: https://lukprot.hirszfeld.pl/. Comparison of EukProt v2/v3, LukProt 1.4.1 and LukProt v1.5.1 in their main areas of difference: Taxogroup EukProt v2 EukProt v3 LukProt v1.4.1 LukProt v1.5.1 Holozoa (excluding Metazoa) 31 40 39 43 Ctenophora 2 2 35 38 Porifera 4 5 30 47 Placozoa 2 2 3 6 Cnidaria 3 5 65 88 Bilateria 51 51 94 142 Included with the database are: ready to use main database files: LukProt_v1.5.1_single_species_FASTA.7z – a FASTA file with the sequences - 7-zipped, uncompressed size: 17.6 GB to concatenate all into one file, run this in the parent directory: for file in $(find . -type f -name "*.fasta"); do awk 'FNR==1{print ""}1' $file >> LukProt_v1.5.1.fa; done. This will create single FASTA file with all the sequences in the parent directory. awk is used to insert a new line after every file because cat would sometimes merge the last sequence with the header of the first sequence. LukProt_v1.5.1_full_BLAST_db.7z – a preformatted, full BLAST database (NCBI BLAST database format version: v5, masked with segmasker), uncompressed size: 28.3 GB LukProt_v1.5.1_taxogroup_BLAST_db.7z – a collection of BLAST databases where each proteome is one taxogroup and is placed within the eukaryotic tree of life directory structure, uncompressed size: 26.3 GB LukProt_v1.5.1_single_species_BLAST_db.7z – a collection of BLAST databases where each proteome is one BLAST database and is placed within the eukaryotic tree of life directory structure, uncompressed size: 26.4 GB auxiliary database files: LukProt_v1.5.1.cdhit70.7z – the full database clustered at 70% identity using CD-HIT with the following command: cd-hit -g 1 -d 0 -T 20 -M 90000 -c 0.7 -uL 0.2 -uS 0.9 -s 0.2, uncompressed sizes: fasta file - 11 GB, clstr file - 2.5 GB LukProt_IDs_mapped.txt.gz – a text file mapping the LukProt IDs to the AniProtDB IDs and EukProt IDs that are different BUSCO_tables.ods – a spreadsheet with full result tables generated by BUSCO analysis OMAmer_output.zip – a folder with full results of OMAmer analyses (includes per-sequence taxonomy classification) OMArk_output.zip – a folder with the results of all OMArk analyses metadata: README.md – a README file describing the metadata LukProt_metadata_sheet.ods – main metadata file. A spreadsheet with information about each proteome (in an open .ods format, most compatible with LibreOffice) LukProt_metadata_other.zip – an archive with other metadata files, documented in the README. Contents include: the LukProt taxonomy in various formats supporting scripts for data manipulation and visualization a recoloring script (modified by LFS, originally by Dr. Celine Petitjean). The script is in public domain and reuploaded here only for convenience.  other files - see README changelog.md – database changelog Words of caution: The database has been synchronized to EukProt v3 in version v1.5.1. This means that identifiers were modified in comparison to LukProt v1.4.1. The convention is not expected to change any more in future updates. Many proteomes, especially those transcriptome-based, may contain contamination from different species. In addition, the translation algorithms often introduce errors (e.g. the transcript may not represent a full length protein). For this reason, to get accurate sequences from each organism, users are directed to source data and to the included OMAmer, OMArk and BUSCO data for details. The taxonomy is different to UniEuk/EukMap but UniEuk data were integrated where possible. A few NCBI taxids are missing. A number of proteomes present in some metadata, are unpublished and were held back. While the database contains metadata that present a particular phylogeny of animals, holozoans and other eukaryotes, no particular claims or hypotheses are made by the author(s). However, in the future efforts will be made to name clades officially, once they are more firmly established. Acknowledgements: Andrew E. Allen Lab for creating the original PhyloDB. Daniel Richter et al. for creating EukProt and keeping it updated. Members of the Multicellgenome Lab, especially Michelle Leger (for donating her database), for the bioinformatics support and for doing great science. All the authors of the original data. National Science Centre of Poland for funding of the project 2020/36/C/NZ8/00081, "The role of glycosylation in the emergence of animal multicellularity", which enabled the creation of this database.
Bibliography:10.6084/M9.FIGSHARE.6771635.V2
RelationTypeNote: Cites -- 10.5282/UBM/DATA.202
10.6084/m9.figshare.6125030.v1
RelationTypeNote: Cites -- 10.5061/DRYAD.JDFN2Z3CV
RelationTypeNote: IsReferencedBy -- 10.24072/pci.genomics.100368
10.5061/DRYAD.6CM1166
10.7910/DVN/INLEPM
RelationTypeNote: Cites -- 10.5061/DRYAD.50DC6
RelationTypeNote: Cites -- 10.6084/m9.figshare.6124802.v1
RelationTypeNote: Cites -- 10.5524/100483
10.6084/m9.figshare.5426494.v1
RelationTypeNote: Cites -- 10.7910/DVN/25071
RelationTypeNote: HasPart -- 10.6084/m9.figshare.12417881.v3
RelationTypeNote: HasVersion -- 10.5281/zenodo.7089121
10.5281/zenodo.7089121
RelationTypeNote: Cites -- 10.7939/R3794177K
10.1093/glycob/cwad041
RelationTypeNote: Cites -- 10.6084/m9.figshare.7108433.v1
RelationTypeNote: Cites -- 10.7910/DVN/INLEPM
10.6084/m9.figshare.6233573.v1
10.5281/zenodo.11321046
10.5061/DRYAD.50DC6
RelationTypeNote: Cites -- 10.6084/m9.figshare.5848068.v1
10.5281/zenodo.10522407
10.5281/zenodo.13765164
10.7910/DVN/24737
RelationTypeNote: HasVersion -- 10.5281/zenodo.13765164
RelationTypeNote: IsDescribedBy -- 10.1101/2024.01.30.577650
10.6084/m9.figshare.6124802.v1
10.5524/100483
RelationTypeNote: Cites -- 10.7910/DVN/24737
10.7939/R3S000
RelationTypeNote: Cites -- 10.5281/zenodo.10654583
10.5281/zenodo.10654583
10.5061/DRYAD.R2N70
10.6084/m9.figshare.6819812
RelationTypeNote: Cites -- 10.6084/m9.figshare.6233573.v1
RelationTypeNote: IsDescribedBy -- 10.1093/gbe/evae231
10.7910/DVN/25071
10.6084/m9.figshare.5848068.v1
10.5061/DRYAD.DNCJSXM47
10.5061/DRYAD.JDFN2Z3CV
RelationTypeNote: Cites -- 10.5061/DRYAD.DNCJSXM47
10.7939/R30R9M73W
RelationTypeNote: Cites -- 10.6084/m9.figshare.5426494.v1
RelationTypeNote: Cites -- 10.6084/m9.figshare.6819812
RelationTypeNote: Cites -- 10.5061/DRYAD.TN0F3
RelationTypeNote: Cites -- 10.5061/DRYAD.R2N70
10.7939/R3794177K
RelationTypeNote: Cites -- 10.6084/m9.figshare.6125030.v1
10.6084/m9.figshare.8299529.v2
10.1101/2024.01.30.577650
10.5061/DRYAD.7717Q
RelationTypeNote: HasVersion -- 10.5281/zenodo.10522407
RelationTypeNote: Cites -- 10.5061/DRYAD.7717Q
10.1093/gbe/evae231
RelationTypeNote: Cites -- 10.7939/R3S000
https://research.nhgri.nih.gov/aniprotdb/
10.6084/M9.FIGSHARE.1334306.V3
10.5282/UBM/DATA.202
10.6084/m9.figshare.10001870.v3
RelationTypeNote: Cites -- 10.6084/m9.figshare.10001870.v3
RelationTypeNote: HasVersion -- 10.5281/zenodo.13829058
RelationTypeNote: Cites -- 10.5061/DRYAD.6CM1166
10.5281/zenodo.13829058
10.5061/DRYAD.TN0F3
10.6084/m9.figshare.20497143.v1
RelationTypeNote: IsCitedBy -- 10.1093/glycob/cwad041
RelationTypeNote: Cites -- 10.6084/m9.figshare.22126232.v1
10.6084/m9.figshare.22126232.v1
RelationTypeNote: Cites -- 10.6084/m9.figshare.20497143.v1
10.24072/pci.genomics.100368
RelationTypeNote: Cites -- 10.6084/m9.figshare.8299529.v2
RelationTypeNote: Cites -- 10.6084/M9.FIGSHARE.6771635.V2
10.6084/m9.figshare.12417881.v3
10.6084/m9.figshare.7108433.v1
RelationTypeNote: Cites -- 10.6084/M9.FIGSHARE.1334306.V3
1759-6653
RelationTypeNote: HasVersion -- 10.5281/zenodo.11321046
RelationTypeNote: Cites -- 10.7939/R30R9M73W
ISSN:1759-6653
1759-6653
DOI:10.5281/zenodo.7089120