## Intro

Xsamkit has been created as a set of tools for working with the .Xsam file format. It contains conversion tools, to convert from .sam format to .Xsam format, as well as conversion from .Xsam to .sam, .fastq and more. Xsamkit also contains a multitude of filtering options that allows fast querying of the .Xsam file. This documentation aims to outline the available functionality, along with providing small example code snippets for each piece of functionality.

Xsam is built using Java, and requires Java 8 or later to run. We recommend using Java 8 to run Xsamkit, since the code has been developed on this version. The examples in this documentation assume that the Xsamkit program has been aliased as ’xsamkit’.

For a complete set of options, we can run the Xsamkit help menu:

$xsamkit −help Xsamkit is currently available on the snap store: ## Useful Abbreviations Term Definition PE Paired-end reads SE Single-end reads In this documentation, the ends of a mapped read are referred to as the left and right. For our purposes, the left end of a read is the 3’ end, and the right of a read is the 5’ end. Left ( 3’ ) |==============| Right ( 5’ ) ## Multi-threading Much of the functionality within Xsamkit has been multi-threaded to reduce processing time. By default Xsamkit will use the total available number of processors that the computer has to offer. We recommend using at least 4 cores when running Xsamkit. ## Xsam file format The .Xsam file format is based on the .sam format, but it has been sorted into relevant sections, and extended with new custom .Xsam fields. The purpose of these custom fields are to cut computation by storing commonly used values for easy retrieval. ### Duplicated reads All D section reads, in PE files are duplicated. The paired fields of each pair, the original and the duplicate, are based on the first read in the pair. The duplicate read pair will have it’s read and mate read reversed, with the flag bits shifted accordingly. ## Xsam Header and Sections The .Xsam format introduces 3 new sections which are categorised by how the reads map to the genome: Section Paired-end Single-end P (Positioned) Both ends map to the same reference name Single end maps to a reference name D (Different) Both ends map to a different reference name N/A X Both ends do not map to a reference name Single end does not map to a reference name It’s important here to note that these are short-hand for .Xsam’s mapping types. In the case of paired-end .Xsam files, the P sections contains both PPE and PX types of reads. ### Xsam header fields The .Xsam header fields begin with @CO, to maintain independence from the original .sam header lines. An example of the header line is below: @CO Xsam:P:mm01:1:10:8000:1455794 There’s a lot going on here. Here’s a breakdown of each term: Term Definition @CO This is a custom Xsam header chunk Xsam Xsam file format P Section definition mm01 Reference name of this section 1 span length section 10 Number of reads (or read pairs) in the chunk 8000 Number of bytes in the chunk 1455794 Byte position of the first byte, relative to the end of the header #### Xsam chunk span sections Reference name chunks are split into 4 sections. These sections are defined by a minimum and maximum span length of the reads within the chunk, which can be specified by the user. The chosen sizes are kept within the xsam header to be read by xsamkit. @CO Xsam:S:P1:2000 @CO Xsam:S:P2:100000 @CO Xsam:S:P3:3000000  The above sample contains the default values, 2000, 100000 and 3000000. These values define that all of the reads in the 1st section, have a span length of between 0 and 2000. Section 2 contains reads with a span length between 2000 and 100000. Section 3 contains reads with a span length between 100000 and 3000000. Finally, all reads with a span length greater than 3000000 are kept in section 4. An example of a full set of header sections is defined below: @CO Xsam:P:mm01:1:173390:128234817:4409752 @CO Xsam:P:mm01:2:13:9409:132644569 @CO Xsam:P:mm01:3:51:37784:132653978 @CO Xsam:P:mm01:4:417:310647:132691762  Values p1,p2 and p3 can be set in the command-line with the respective --p1=, --p2= and --p3= flags. ### Xsam fields The .Xsam format contains a number of pre-computed fields, extracted from data in the original .sam file. The purpose of these fields is to provide instant access to useful data and speed up computation times of large files. Each field begins with the ’x’ character. Single-end (SE) files contain only information about the individual read, denoted by lower-case field names. Paired-end (PE) reads contain both information about the individual reads, as well as information about the read pair. Read pair information is denoted by upper-case field names. Below is a list of each field and what it means*: Field Defintion SE? PE? xl:i: Left end position of single read xr:i: Right end position of single read xs:i: Span length of single read xd:A: Mapping direction of single read, forward(f) or reverse® xm:A: Mapping type of a single read, unique(u), repeat®, or non-mapped(x) xx:i: Specifies that this “un-mapped” read was originally mapped as a “repeat” but has been switched by xsamkit’s repeat handling mode. 1 denotes the switch, 0 otherwise xa:A: Currently not in use xL:i: Left position of the read pair xR:i: Right position of the read pair xS:i: Span length of the read pair xW:i: Window length of the read pair xP:i: Records duplicated read number in the field, unique to each duplicate xQ:i: Denotes a duplicate. 0 for original read pair, 1 for duplicate read pair. xC:i: Currently not in use xD:i: Currently not in use xLseq:i: (OPTIONAL:) flank extended left field xRseq:i: (OPTIONAL:) flank extended right field *All fields are required, unless specified otherwise. The presence of an :i: denotes an integer value. :A: denotes a String. # Sam To Xsam Conversion The input file can also be provided in a .sam format. When this occurs, the program will automatically convert the input into .Xsam format. It is possible to string stream commands, so that the command-line looks no different to an .Xsam input file. For example: $ xsamkit
−−if=example.sam
−−st1=stream1,rng,mm01,1,1000000,LE,Xsam
−−dir=streamOutput

Advanced Pile Up The advanced pile up (APU) command creates an alignment coverage histogram (ACH) of the input Xsam file. The user must specify a footprint for the read, which denotes the point in the read to be incremented in the histogram. We introduce two types of footprints available for APU conversion. The first are line footprints: Line Footprint Definition Read Increments the ACH for each base position in the read sequence Span Increments the ACH for the full span width. [Read More]
APU Kit COMING SOON ========================================= APUkit is a separate piece of software designed to output peaks found within the .apu file as well as convert the step size and bin size of the file. APUkit requires that the resolution input of the .apu file is bin size 1 and step size 1, which matches the output of APU function of the Xsamkit. Command line: \$ apukit --if=input.apu --p=[stepSize,binSize [peakThreshold, peakBase]] where: [Read More]

## Found a bug?

xsamkit is a work in progress and by no means flawless. If you’ve found a bug, please let us know:

De-duplication of Xsam files De-duplication works by looking at user-defined coordinates. Xsamkit takes an Xsam file and filters for any reads which have the same values for those coordinates. The read with the highest PHRED score is output into a primos files, whilst all of the duplicates with lower PHRED scores are placed into a duplicates file. Any combination of these coordinates will be valid for de-duplication: Coordinate Definition POS1 Pos value of the first read in the . [Read More]