Technical Appendix

Data provenance, algorithms, equations, and known limitations. Written to the standard of a scientific paper's Methods section.

1. Data Provenance

Every external data source used by Molara, including retrieval method, version, and caching strategy.

SourceURL / VersionData RetrievedAccessCache / Rate
PubChem (NIH/NLM)pubchem.ncbi.nlm.nih.gov/rest/pugCID, molecular formula, MW, IUPAC name, canonical SMILES, XLogP, TPSA, HBD, HBA, description, 3D SDF coordinatesREST API (no key)5 req/sec; 5-min TTL
RDKitrdkit.org ≥ 2024.3.5MW, LogP (Wildman-Crippen), TPSA, HBD, HBA, rotatable bonds, ring count, heavy atoms, Lipinski evaluation, 3D conformer generationLocal Python libraryN/A
DDInter 2.0ddinter2.scbdd.com (2024)Drug-drug interaction severity (Major / Moderate / Minor) for ~302,516 pairs across ~2,310 drugsLocal SQLite (pre-loaded from 8 ATC CSV files)N/A (local)
OpenFDAapi.fda.gov/drug/label.jsonDrug interaction warning text from FDA-approved drug labelsREST API (optional key)1,000/day; 10-min TTL
RCSB PDBfiles.rcsb.org/downloadProtein-drug co-crystal PDB coordinate files (5 curated pairs: COX-2, A2A, Penicillin Acylase, Neuraminidase, DHFR)REST download (no key)10-min TTL
3Dmol.js3dmol.csb.pitt.edu v2.xClient-side WebGL molecular visualization (stick, sphere, cartoon render modes)npm packageN/A
AI Language ModelLLMAI-generated pharmacology Q&A; system prompt enriched with current molecule contextREST API (key required)1,024 tokens/response

2. Algorithms & Equations

Every computation performed by Molara, documented with the exact formulas and methods used.

2.1 Molecular Property Calculation

Molecular properties are computed from a canonical SMILES string using the RDKit cheminformatics library. PubChem provides the SMILES; RDKit recalculates selected descriptors server-side for consistency.

PropertyRDKit MethodDescription
Molecular WeightDescriptors.MolWt()Sum of average atomic masses (Da)
LogPDescriptors.MolLogP()Wildman-Crippen octanol-water partition coefficient
TPSADescriptors.TPSA()Topological polar surface area (Ericsson method, Ų)
H-Bond DonorsDescriptors.NumHDonors()Count of N–H and O–H groups
H-Bond AcceptorsDescriptors.NumHAcceptors()Count of N and O atoms
Rotatable BondsDescriptors.NumRotatableBonds()Non-terminal, non-ring single bonds
Ring CountDescriptors.RingCount()Total number of ring systems (SSSR)

2.2 3D Structure Generation

When a pre-computed 3D structure is unavailable from PubChem, Molara generates one from the SMILES string using a two-step process:

  1. Embedding — Explicit hydrogens are added, then a 3D conformer is generated with RDKit's ETKDGv3 (Experimental Torsion Knowledge Distance Geometry v3) algorithm using a fixed random seed of 42 for reproducibility.
  2. Optimization — The conformer is energy-minimized with the MMFF (Merck Molecular Force Field) for up to 500 iterations. The result is exported as an SDF mol block.

Pipeline: SMILES → Chem.MolFromSmiles() Chem.AddHs() AllChem.EmbedMolecule(ETKDGv3, seed=42) AllChem.MMFFOptimizeMolecule(maxIters=500) → SDF

2.3 Lipinski's Rule of Five

Lipinski's Rule of Five (Lipinski et al., 1997) predicts whether a compound is likely to be orally bioavailable. A molecule passes if all four criteria are satisfied:

MW500 DalogP5HBD5HBA10\text{MW} \leq 500 \text{ Da} \qquad \log P \leq 5 \qquad \text{HBD} \leq 5 \qquad \text{HBA} \leq 10

Where MW is molecular weight, LogP is the Wildman-Crippen partition coefficient, HBD is the count of hydrogen-bond donors (N–H, O–H), and HBA is the count of hydrogen-bond acceptors (N, O). Violations are counted and displayed alongside each rule's pass/fail status.

2.4 Pharmacokinetic Simulation

Molara uses a one-compartment oral dosing model with first-order absorption and first-order elimination. The body is modeled as a single, well-mixed compartment.

SymbolParameterUnit
FOral bioavailabilityfraction (0–1)
DDosemg
VdVolume of distributionL
kaAbsorption rate constanthr¹
keElimination rate constanthr¹

Plasma concentration at time t

C(t)=FDkaVd(kake)(eketekat)C(t) = \frac{F \cdot D \cdot k_a}{V_d \cdot (k_a - k_e)} \left( e^{-k_e t} - e^{-k_a t} \right)

Time to maximum concentration

tmax=ln(ka/ke)kaket_{\max} = \frac{\ln(k_a \,/\, k_e)}{k_a - k_e}

Area under the curve (AUC)

AUC0=FDVdke\text{AUC}_{0 \to \infty} = \frac{F \cdot D}{V_d \cdot k_e}

Elimination half-life

t1/2=ln2ke0.693ket_{1/2} = \frac{\ln 2}{k_e} \approx \frac{0.693}{k_e}

The simulation generates 200 evenly-spaced time points over a default window of 24 hours. Cmax is evaluated at tmax. When ka = ke, the special-case L'Hôpital form C(t) = (F · D · ka · t · e−ket) / Vd is used, and tmax = 1/ke.

2.5 Drug Interaction Detection

Drug-drug interactions are detected using a two-source strategy that combines a local database with live FDA label queries.

Source A — DDInter 2.0 (local)

A pre-loaded SQLite database containing ~302,516 drug-drug interaction pairs from 8 ATC classification categories (A, B, D, H, L, P, R, V). Each pair has a severity classification: Major, Moderate, or Minor. Lookups are bidirectional and case-insensitive. 24 common brand names (e.g., Tylenol → Acetaminophen, Advil → Ibuprofen) are resolved via an alias map.

Source B — OpenFDA (live)

The FDA Drug Label API is queried in real time for each drug using the openfda.generic_name field (with fallback to openfda.brand_name). The drug_interactions section of the label is extracted and displayed as free-text interaction information. Results are cached for 10 minutes.

3. Known Limitations

Molara is an educational tool, not a clinical decision support system. Users should be aware of the following constraints.

Molecular Data

  • RDKit-generated 3D coordinates are energy-minimized approximations (MMFF force field), not experimentally determined crystal structures.
  • PubChem molecular properties are computationally derived; experimental values may differ, especially for LogP and TPSA.
  • SMILES-based analysis does not distinguish stereoisomers (R/S, E/Z) unless the SMILES explicitly encodes chirality.

Pharmacokinetics

  • The one-compartment model assumes instantaneous and uniform distribution to all tissues. Multi-compartment behavior (e.g., CNS penetration, adipose sequestration) is not captured.
  • Does not account for plasma protein binding, active metabolites, enterohepatic recirculation, renal/hepatic impairment, or drug-drug PK interactions.
  • Parameters reflect population-level averages. Individual pharmacokinetic variability (age, weight, genetics) is not modeled.

Drug Interactions

  • DDInter 2.0 covers approximately 2,310 drugs. Interactions involving newer, biosimilar, or less common agents may not be present.
  • Severity classifications (Major / Moderate / Minor) are categorical and do not capture dose-dependent or patient-specific risk.
  • OpenFDA label text reflects US FDA-approved labeling only. Regional regulatory differences are not represented.

AI Assistant

  • AI responses are generated by a large language model and should never be used for clinical decision-making. Always consult a healthcare professional.
  • The model may produce pharmacologically plausible but factually incorrect statements (hallucinations), particularly for rare drugs or novel research.
  • A training data cutoff applies; the newest approved drugs or recent safety warnings may not be reflected.

4. Attribution

  • Drug interaction data from DDInter 2.0 (Xiong et al., Nucleic Acids Research, 2022)
  • Molecular data from PubChem PUG-REST API (National Library of Medicine, NIH)
  • Protein structures from RCSB Protein Data Bank (Berman et al., 2000)
  • RDKit: Open-source cheminformatics (rdkit.org)
  • 3D conformer generation via ETKDGv3 (Riniker & Landrum, J. Chem. Inf. Model., 2015)
  • 3D visualization powered by 3Dmol.js (Rego & Koes, 2015)
  • Lipinski's Rule of Five (Lipinski et al., Adv. Drug Deliv. Rev., 1997)
  • AI-powered pharmacology assistant