Pages

Sunday, January 14, 2018

Faceted browsing of chemicals in Wikidata

A few days ago Aidan and José introduced GraFa on the Wikidata mailing list. It is a faceted browser for content in Wikidata, and the screenshot on the right shows that for chemical compounds. They are welcoming feedback.

GraFa run on things of type chemical compound.
Besides this screenshot, I have not played with it a lot. It looks quite promising, and my initial feedback would be a feature to sort the results, and ability to export the full list to some other tool, e.g. download all those items as RDF.

Saturday, January 06, 2018

"All things must come to an end"

Cover of the book.
No worries, this is just about my Groovy Cheminformatics book. Seven years ago I started a project that was very educational to me: self-publishing a book. With the help from Lulu.com I managed to get a book out that sold over 100 copies and that was regularly updated. But their lies the problem: supply creates demand. So, I had a system that supplied me with an automated set up that reran scripts, recreated text output and even figures for the book (2D chemical diagrams). I wanted to make an edition for every CDK release. All in all, I got quite far with that: eleven editions.

But the current research setting, or at least in academia, does not provide me with the means to keep this going. Sad thing is, the hardest part is actually updating the graphics for the cover, which needs to resize each time the book gets ticker. But John Mayfield introduced so many API changes, I just did not have the time to update the book. I tried, and I have a twelfth edition on my desk. But where my automated setup scales quite nicely, I don't.

It may we worth reiterating why I started the book. We have had several places where information was given, and questions were answered: the mailing list, wiki pages, JavaDoc, the Chemistry Toolkit Rosetta Wiki, and more. Nothing in the book was not already answered somewhere else. The book was just a boon for me to answer those questions and provide an easy way for people to get many answers.

Now, because I could not keep up with the recent API changes, I am no longer feeling comfortable with releasing the book. As such, I have "retired" the book.

I am now working out on how to move from here. An earlier edition is already online under a Creative Commons license, and it's tempting to release the latest version like this too. That said, I have also been talking with the other CDK project leaders about alternatives. More on this soon, I guess.

Here's an overview of posts about the book:

Saturday, December 30, 2017

Adding SMILES, InChI, etc to Wikidata alkane pages

Ten alkanes in Wikidata. The ones without CAS regsitry
number previously did not have InChIKey or
PubChem CID. But no more; I added those.
While working on the 'chemical class' aspect for Scholia yesterday I noted that the page for alkanes was quite large, with a list of more than 50 long chain alkanes with pages in the Japanese Wikipedia with no SMILES, InChI, InChIKey, etc.

So, I dug up my Bioclipse scripts to add chemicals to Wikidata starting with a SMILES (btw, the script has significantly evolved since) and extended the query of that Scholia aspect to list just the Wikidata Q-code and name.  This script starts with one or more SMILES strings and generated QuickStatements (a must-learner).

Because the Wikidata entries also had the English IUPAC name, I can use that to autogenerate SMILES. Enter the OPSIN (doi:10.1021/ci100384d) plugin for Bioclipse which in combination with the CDK allowed me to create the matching SMILES, InChI, InChIKey, and use the latter to look up the PubChem compound identifier (CID). This is the script I ended up with:

inputFile = "/Wikidata/Alkanes/alkanes.tsv"
new File(bioclipse.fullPath(inputFile)).eachLine { line ->
  fields = line.split("\t")
  if (fields[0].startsWith("http://www.wikidata.org/entity/")) {
    wdid = fields[0].substring("http://www.wikidata.org/entity/".length())
    name = fields[1]
    if (fields.length > 2) { // skip entities that already have an InChIKey
      inchikey = fields[2]
      // println "Skipping: $wdid $inchikey"
    } else { // ok, consider adding it
      // println "Considering $wdid $name"
      try {
        mol = opsin.parseIUPACName(name)
        smiles = cdk.calculateSMILES(
          cdk.addImplicitHydrogens(
            cdk.removeExplicitHydrogens(mol)
          )
        )
        //println "  SMILES: $smiles"
        println "${smiles}\t${wdid}"
      } catch (Exception error) {
        //println "Could not parse $name with OPSIN: ${error.message}" 
      }
    }
  }
}

That way, I ended up with changes like this:


Friday, December 29, 2017

Using Scholia as Open Notebook Science tool to support literature searching

Source: Compound Interest, Andy Brunning.
CC-BY-ND-NC.
I have blogged about Scholia and the underlying Wikidata before. Following the example of this WikiProject Zika Corpus I am using Scholia (doi:10.1007/978-3-319-70407-4_36, or in Scholia, of course :) as a tool to support a literature study, to collect articles about a certain topic. Previously I used it to track the publication trail around the Elsevier-SciHub interactions. But when I was linking the Compound Interest infographics for the Advent 2017 series to Wikidata items (aiming to archive them on Zenodo) and ran into the poisonous mistletoe graphics of day 9. In this graphics it mentions the phoratoxins. Sadly, not too much was recorded about that in Wikidata.

So, I did an quick scan of literature (about half an hour, using Google Scholar). I ended up with a few articles about the chemistry of this compound, and as good open scientists I used Wikidata and Scholia as a notebook:


From these papers I found reference to six specific, phoratoxins A-F, for which I subsequently created Wikidata items:


I have a lot to discover about these cyclic peptides and they cannot be found in PubChem or ChemSpider (yet):


The SPARQL I uses is as follows and can be run yourself (note the "edit" link in the left corner of this link):

SELECT ?mol ?molLabel ?InChIKey ?CAS ?ChemSpider ?PubChem_CID WITH {
    SELECT DISTINCT ?mol WHERE {
      ?mol wdt:P31/wdt:P279* wd:Q46995757 .
    } LIMIT 500
  } AS %result
  WHERE {
    INCLUDE %result
    OPTIONAL { ?mol wdt:P235 ?InChIKey }
    OPTIONAL { ?mol wdt:P231 ?CAS }
    OPTIONAL { ?mol wdt:P661 ?ChemSpider }
    OPTIONAL { ?mol wdt:P662 ?PubChem_CID }
    SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
  }

And since I had a few other compound classes there, and in our metabolomics research too, of course, I finally hacked up an extension of Scholia for chemical classes (pull request pending). This is what it looks like for fatty acid:



That makes browsing information about chemicals in Wikidata a lot easier and support our effort to link WikiPathways to Wikidata considerable.

I also used this approach for other topics:

Looking at these pages again, it's great to see the community nature of Wikidata in action. The pages grow in richness over time :)

Wednesday, December 27, 2017

Two Papers about Adverse Outcome Pathways

Slice of Fig. 5 of Penny's paper (see main text).
Over the past year our group got involved in two projects where Adverse Outcome Pathways (AOPs) are used. These AOPs are risk assessment tools, but when linked to biological pathways they get a lot more interesting. the key events (KEs) in the AOPs can be linked to bioassays and thus biological processes. Earlier this year Marvin Martens (follow him on Twitter) started as PhD candidate to work on EU-ToxRisk and OpenRiskNet on exactly this integration. The first of these two projects got recently described in "Adverse outcome pathways: opportunities, limitations and open questions" (doi:10.1007/s00204-017-2045-3).

But at the end of eNanoMapper we collaborated with Penny Nymark and Roland Grafström on bioassay measurements of biological responses to exposure of nanomaterials. "We" is particularly Freddie and Linda who worked with Penny to develop an approach to link AOPs with biological pathways, and the worked on this lung fibrosis pathway:


Penny's paper the describes this approach and this pathway was also recently published: "A Data Fusion Pipeline for Generating and Enriching Adverse Outcome Pathway Descriptions"  (doi:10.1093/toxsci/kfx252).

BTW, this <iframe> embeds this pathway in the page using the JavaScript library pvjs by Anders Riutta from the Gladstone Institutes. You can click the genes to get identifiers for various databases. You can learn on this page on how to use this on your webpage or blog.