Intro

Xsamkit has been created as a set of tools for working with the .Xsam file format. It contains conversion tools, to convert from .sam format to .Xsam format, as well as conversion from .Xsam to .sam, .fastq and more. Xsamkit also contains a multitude of filtering options that allows fast querying of the .Xsam file. This documentation aims to outline the available functionality, along with providing small example code snippets for each piece of functionality.

Xsam is built using Java, and requires Java 8 or later to run. We recommend using Java 8 to run Xsamkit, since the code has been developed on this version. The examples in this documentation assume that the Xsamkit program has been aliased as ’xsamkit’.

For a complete set of options, we can run the Xsamkit help menu:

$ xsamkit −help

Xsamkit is currently available on the snap store:

Get it from the Snap Store

Useful Abbreviations

Term Definition
PE Paired-end reads
SE Single-end reads

In this documentation, the ends of a mapped read are referred to as the left and right. For our purposes, the left end of a read is the 3’ end, and the right of a read is the 5’ end.

Left ( 3’ ) |==============| Right ( 5’ )

Multi-threading

Much of the functionality within Xsamkit has been multi-threaded to reduce processing time. By default Xsamkit will use the total available number of processors that the computer has to offer. We recommend using at least 4 cores when running Xsamkit.

Xsam file format

The .Xsam file format is based on the .sam format, but it has been sorted into relevant sections, and extended with new custom .Xsam fields. The purpose of these custom fields are to cut computation by storing commonly used values for easy retrieval.

Duplicated reads

All D section reads, in PE files are duplicated. The paired fields of each pair, the original and the duplicate, are based on the first read in the pair. The duplicate read pair will have it’s read and mate read reversed, with the flag bits shifted accordingly.

Xsam Header and Sections

The .Xsam format introduces 3 new sections which are categorised by how the reads map to the genome:

Section Paired-end Single-end
P (Positioned) Both ends map to the same reference name Single end maps to a reference name
D (Different) Both ends map to a different reference name N/A
X Both ends do not map to a reference name Single end does not map to a reference name

It’s important here to note that these are short-hand for .Xsam’s mapping types. In the case of paired-end .Xsam files, the P sections contains both PPE and PX types of reads.

Xsam header fields

The .Xsam header fields begin with @CO, to maintain independence from the original .sam header lines. An example of the header line is below:

@CO Xsam:P:mm01:1:10:8000:1455794

There’s a lot going on here. Here’s a breakdown of each term:

Term Definition
@CO This is a custom Xsam header chunk
Xsam Xsam file format
P Section definition
mm01 Reference name of this section
1 span length section
10 Number of reads (or read pairs) in the chunk
8000 Number of bytes in the chunk
1455794 Byte position of the first byte, relative to the end of the header

Xsam chunk span sections

Reference name chunks are split into 4 sections. These sections are defined by a minimum and maximum span length of the reads within the chunk, which can be specified by the user. The chosen sizes are kept within the xsam header to be read by xsamkit.

@CO	Xsam:S:P1:2000
@CO	Xsam:S:P2:100000
@CO	Xsam:S:P3:3000000

The above sample contains the default values, 2000, 100000 and 3000000.

These values define that all of the reads in the 1st section, have a span length of between 0 and 2000. Section 2 contains reads with a span length between 2000 and 100000. Section 3 contains reads with a span length between 100000 and 3000000. Finally, all reads with a span length greater than 3000000 are kept in section 4.

An example of a full set of header sections is defined below:

@CO	Xsam:P:mm01:1:173390:128234817:4409752
@CO	Xsam:P:mm01:2:13:9409:132644569
@CO	Xsam:P:mm01:3:51:37784:132653978
@CO	Xsam:P:mm01:4:417:310647:132691762

Values p1,p2 and p3 can be set in the command-line with the respective --p1=, --p2= and --p3= flags.

Xsam fields

The .Xsam format contains a number of pre-computed fields, extracted from data in the original .sam file. The purpose of these fields is to provide instant access to useful data and speed up computation times of large files. Each field begins with the ’x’ character. Single-end (SE) files contain only information about the individual read, denoted by lower-case field names. Paired-end (PE) reads contain both information about the individual reads, as well as information about the read pair. Read pair information is denoted by upper-case field names. Below is a list of each field and what it means*:

Field Defintion SE? PE?
xl:i: Left end position of single read
xr:i: Right end position of single read
xs:i: Span length of single read
xd:A: Mapping direction of single read, forward(f) or reverse®
xm:A: Mapping type of a single read, unique(u), repeat®, or non-mapped(x)
xx:i: Specifies that this “un-mapped” read was originally mapped as a “repeat” but has been switched by xsamkit’s repeat handling mode. 1 denotes the switch, 0 otherwise
xa:A: Currently not in use
xL:i: Left position of the read pair
xR:i: Right position of the read pair
xS:i: Span length of the read pair
xW:i: Window length of the read pair
xP:i: Records duplicated read number in the field, unique to each duplicate
xQ:i: Denotes a duplicate. 0 for original read pair, 1 for duplicate read pair.
xC:i: Currently not in use
xD:i: Currently not in use
xLseq:i: (OPTIONAL:) flank extended left field
xRseq:i: (OPTIONAL:) flank extended right field

*All fields are required, unless specified otherwise.

The presence of an :i: denotes an integer value. :A: denotes a String.

Sam To Xsam Conversion

The input file can also be provided in a .sam format. When this occurs, the program will automatically convert the input into .Xsam format. It is possible to string stream commands, so that the command-line looks no different to an .Xsam input file.

For example:

$ xsamkit
−−if=example.sam
−−st1=stream1,rng,mm01,1,1000000,LE,Xsam
−−dir=streamOutput

Advanced Pile Up The advanced pile up (APU) command creates an alignment coverage histogram (ACH) of the input Xsam file. The user must specify a footprint for the read, which denotes the point in the read to be incremented in the histogram. We introduce two types of footprints available for APU conversion. The first are line footprints: Line Footprint Definition Read Increments the ACH for each base position in the read sequence Span Increments the ACH for the full span width. [Read More]

APU Kit COMING SOON ========================================= APUkit is a separate piece of software designed to output peaks found within the .apu file as well as convert the step size and bin size of the file. APUkit requires that the resolution input of the .apu file is bin size 1 and step size 1, which matches the output of APU function of the Xsamkit. Command line: $ apukit --if=input.apu --p=[stepSize,binSize [peakThreshold, peakBase]] where: [Read More]

Found a bug?

xsamkit is a work in progress and by no means flawless. If you’ve found a bug, please let us know:

gitlab

De-duplication of Xsam files De-duplication works by looking at user-defined coordinates. Xsamkit takes an Xsam file and filters for any reads which have the same values for those coordinates. The read with the highest PHRED score is output into a primos files, whilst all of the duplicates with lower PHRED scores are placed into a duplicates file. Any combination of these coordinates will be valid for de-duplication: Coordinate Definition POS1 Pos value of the first read in the . [Read More]

Footprints A footprint simply denotes the points on a read which the program works with when executing sub-routines. Footprints vary from SE to PE input .Xsam files. SE Footprints Footprint Definition Lb Left Base of the read Rb Right Base of the read Ab Any Base of the read (Lb or Rb) 5p 5 prime - base on the 5 prime end of the read 3p 3 prime - base on the 3 prime end of the read PE Footprints Footprint Definition LE Left External - the left-most base of the read pair (xL) RE Right External - the right-most base of the read pair (xR) AE Any External - LE or RE LI Left Internal - the right-most base of the left-most read RI Right Internal - the left-most base of the right-most read AI Any Interal - LI or RI Mapping Type The mapping type of the read, as defined in the xm field. [Read More]

Developer Stuff

Functionality for working with the .Xsam file format is avaiable in the core library.

The library is hosted on GitHub, and can easily be imported with JitPack.

Filtering The bulk of the xsamkit functionality is centred around the ability to filter .Xsam based on different criteria. This documentation will touch on each filtering option, describing what it does and how to use it. A key concept to understand is read footprints. Range Filtering A range is defined as a section of a particular reference between two points - a left point, or min, and a right point, or max. [Read More]

Flank extension Flank extension (or flankext) refers to mapping the reads back to the genome, finding the upstream or downstream bases, and adding these as new custom .Xsam fields. It is up to the user to provide relevant fasta files which match the reference names in the .Xsam header. If these are not provided, or some are missing then Xsamkit will provide a relevant info message and exit. Below is an example of how to specify flank extension: [Read More]

.ini Files Streams can be prepared ahead of time and stored in an a .ini file. These files are built up of sections of steams that are converted at runtime. Using an .ini file is a great way to store previous stream combinations. Let’s take a look at an example of an .ini file: [section01] --st1=... --st2=... [section02] --st1=.. --st2=.. --st3=.. This example .ini file contains 2 valid sections of streams, section01 and section02. [Read More]