Considerations for experimental use of the CP086569.1 sequence as a reference for mapping Important! The original CP086569.1 from https://www.ncbi.nlm.nih.gov/nuccore/CP086569.1 has a couple of blank lines in the FASTA sequence which will throw off samtools indexing. A corrected version of the CP086569.1 sequence is available here: https://ybrowse.org/gbrowse2/gff/CP086569.1/CP086569.fasta For WGS mapping we have created a simplified hg38 reference with chrY replaced by CP086569.1: https://ybrowse.org/gbrowse2/gff/CP086569.1/hg38_CP086569.fasta Don't forget to index the new reference genome: bwa index (hg38_)CP086569.fasta samtools faidx (hg38_)CP086569.fasta General BWA MEM mapping requires FastQ files. They can be extracted from the BAM file: #!/bin/bash NUM_THREADS=$(getconf _NPROCESSORS_ONLN) samtools fastq -@ $NUM_THREADS SAMPLE_sorted.bam \ -1 SAMPLE_R1.fastq.gz \ -2 SAMPLE_R2.fastq.gz \ -s SAMPLE_unpaired.fastq.gz \ -0 SAMPLE_weird.fastq.gz -n BigY or YElite datasets may be mapped to the CP086569.fasta sequence directly, but for WGS we recommend to use hg38_CP086569.fasta which contains the other chromosomes from hg38 as a decoy: #!/bin/bash NUM_THREADS=$(getconf _NPROCESSORS_ONLN) READS_1="fastq/${YSEQID}_R1.fastq.gz" READS_2="fastq/${YSEQID}_R2.fastq.gz" REF="hg38_CP086569.fasta" bwa mem -M -t $NUM_THREADS $REF $READS_1 $READS_2 | \ samtools view -@ $NUM_THREADS -b -t $REF -o SAMPLE.bam - samtools sort -@ $NUM_THREADS -T /tmp/sorted -o SAMPLE_sorted.bam SAMPLE.bam samtools index -@ $NUM_THREADS SAMPLE_sorted.bam Approximate Positions for Important Regions on the CP086569.1 reference sequence: PAR1: CP086569.1:1..2500000 CEN: CP086569.1:10000000..11700000 DYZ19: CP086569.1:20973597..21208904 PAR2: CP086569.1:62024917..63000000 (Yq telomere) The following positions are either CP086569.1 specific or they're errors in the assembly. Unless your sample matches the CP086569.1 reference, you should probably ignore them. The list is just a first approach by comparing samples from A00, J1 and J2, so this list is most definitely not complete. We'll try to add more positions when we find them. Position REF ALT CP086569.1:2639156 G A CP086569.1:2894059 C A CP086569.1:3095513 A G CP086569.1:3230741 T C CP086569.1:3317177 G T CP086569.1:3794525 G A CP086569.1:4312947 C A CP086569.1:4384233 G C CP086569.1:4499385 G A CP086569.1:4717925 A G CP086569.1:4857547 A G CP086569.1:5407684 A G CP086569.1:5510145 C T CP086569.1:6174800 C T CP086569.1:6371561 G T CP086569.1:6715088 T A CP086569.1:6895154 C T CP086569.1:7098564 A G CP086569.1:7306585 G T CP086569.1:7859831 A G CP086569.1:8180296 G A CP086569.1:8185608 A G CP086569.1:8612074 T C CP086569.1:8633813 T C CP086569.1:8724174 G T CP086569.1:8893294 A G CP086569.1:11889841 A G CP086569.1:11915848 C T CP086569.1:11987753 C G CP086569.1:12043695 G C CP086569.1:12176451 C T CP086569.1:12194508 T G CP086569.1:12275196 G A CP086569.1:12277712 A G CP086569.1:12292837 G A CP086569.1:12303370 T C CP086569.1:12323190 G T CP086569.1:12716776 T C CP086569.1:12730045 A G CP086569.1:12764225 T C CP086569.1:12908572 T G CP086569.1:13015599 A G CP086569.1:13100808 T G CP086569.1:13165700 T C CP086569.1:13190840 A G CP086569.1:13462810 C A CP086569.1:13775103 G A CP086569.1:14259019 A G CP086569.1:14413616 G C CP086569.1:14545767 T C CP086569.1:14597124 A G CP086569.1:14648700 T C CP086569.1:14720056 A C CP086569.1:14804158 A C CP086569.1:14994774 G A CP086569.1:15158295 A G CP086569.1:15248435 T C CP086569.1:15295430 G T CP086569.1:15322036 C T CP086569.1:15546349 G A CP086569.1:15753928 A C CP086569.1:15763566 A G CP086569.1:15851951 G A CP086569.1:16070398 T G CP086569.1:16170673 C G CP086569.1:16292212 T G CP086569.1:16318453 C T CP086569.1:16327655 T A CP086569.1:16454284 C T CP086569.1:16895928 C T CP086569.1:16943293 G A CP086569.1:17425252 T A CP086569.1:17468957 C T CP086569.1:17776123 A C CP086569.1:17809165 G C CP086569.1:17930135 G A CP086569.1:17958966 G A CP086569.1:18221163 G A CP086569.1:18352075 C T CP086569.1:19801237 A T CP086569.1:20028119 C A CP086569.1:20135100 A G CP086569.1:20164287 T C CP086569.1:20200505 T C CP086569.1:20206595 A G CP086569.1:20356288 G A CP086569.1:20372366 T C CP086569.1:20377158 C G CP086569.1:20475153 A T CP086569.1:20513708 T C CP086569.1:20714573 G A CP086569.1:20873454 T G CP086569.1:21242038 T C CP086569.1:21854008 C G CP086569.1:21960917 G A CP086569.1:21989103 T C CP086569.1:22095350 C A CP086569.1:22209074 C T CP086569.1:22285919 C A CP086569.1:22491462 T C CP086569.1:22629493 A C CP086569.1:27281106 A G CP086569.1:27330315 T C CP086569.1:27470822 T A CP086569.1:27817303 A G CP086569.1:27891394 G T CP086569.1:27892057 T A CP086569.1:28497247 T C CP086569.1:28681835 A G CP086569.1:29195036 A C CP086569.1:29521864 A T CP086569.1:29702625 A G CP086569.1:31218902 T G CP086569.1:33071900 C G CP086569.1:36983793 C A CP086569.1:36983805 A G CP086569.1:36983852 G A CP086569.1:39494969 C A CP086569.1:39494981 A G CP086569.1:39495028 G A CP086569.1:44667368 C T CP086569.1:44667430 C A CP086569.1:44667442 A G CP086569.1:48708221 T G CP086569.1:49087400 C G CP086569.1:54885039 C G CP086569.1:60760175 G A CP086569.1:61938128 T G CP086569.1:62001989 C A CP086569.1:62004625 T A CP086569.1:62004752 G A CP086569.1:62004810 C T CP086569.1:62004815 G A CP086569.1:62004841 C T CP086569.1:62004872 G A CP086569.1:62007271 C T Script example to convert the currently known hg38 SNPs to CP086569.1 positions. This way you can always use the most up to date list of hg38 Y-SNPs: 8<---------------------------------------------------------------------------------- #!/bin/bash wget https://ybrowse.org/gbrowse2/gff/snps_hg38.vcf.gz wget https://ybrowse.org/gbrowse2/gff/CP086569.1/CP086569.fasta wget https://ybrowse.org/gbrowse2/gff/CP086569.1/hg38_chrYToCP086569.over.chain.gz # Convert coordinates from hg38 to CP086569.1 # http://crossmap.sourceforge.net/ CrossMap.py vcf hg38_chrYToCP086569.over.chain.gz snps_hg38.vcf.gz CP086569.fasta snps_chrCP086569.1.vcf # Remove "chr" in front of CP086569.1 cat snps_chrCP086569.1.vcf | sed 's/chrCP086569/CP086569/g' >snps_CP086569.1.vcf rm -f snps_chrCP086569.1.vcf # This are the non-mapable Y-SNPs remaining in hg38 format mv snps_chrCP086569.1.vcf.unmap snps_CP086569.1.unmap.vcf bcftools sort -Oz snps_CP086569.1.vcf -o snps_CP086569.1_sorted.vcf.gz tabix -p vcf snps_CP086569.1_sorted.vcf.gz 8<---------------------------------------------------------------------------------- Is the CP086569.1 reference better than hg38? In most cases not. It contains significantly more single base errors than the hg38 reference and the assembly structure is not clearly proven. However it may be useful for samples in haplogroup J, and specifically J1, since it represents a Y chromosome that is closer related than the hg38 sequence which is known to represent R1b. There may be regions that apply to a J1 sample which don't exist on the hg38 reference sequence. If you have FastQ data from a WGS, then the previously unmapped reads are available and can be mapped to the new regions. We have found some interesting, and possibly unique regions in the Yq12 range around 27..30 million. Submitting CP086569.1 SNPs to YBrowse? Sorry, we have currently no infrastructure to database CP086569.1 SNPs. Since this is all experimental, we'll need to see how things develop. Most likely we'll need to build something like a "Y Pangenome" to take account for multiple parallel reference sequences (one for each haplogroup). We'll need to think about a concept to do this properly. So PLEASE DON'T SUBMIT CP086569.1 SNPs at this stage. I hope this helps, Thomas Krahn