Monorepo for Tangled tangled.org

proposal/discussion: extensible markup lexicon #383

open opened by boltless.me edited

several issues with current markup situation

Related Issues:

I recommend reading all related issues before discussing.

markup format is not extensible from lexicon#

Current lexicon definition doesn't specify the markup format. Right now, we only support blessed, tangled-specific markdown variant. But in future, we want to support custom syntax like org-mode requested in #197.

markdown facets#

It is pretty common to reference objects like user, issue, pull, repository, blob, or even git commits via markdown. And if someone reference something, we want that reference to be permanent.

For example, if alice referenced bob as @bob.tngl.sh and bob changed its handle to something else, @bob.tngl.sh should still point to same user (bob). We currently include mentioned/referenced identities in record to invalidate the legacy link, but this isn't enough. bluesky uses app.bsky.richtext.facet to embed resolved metadata to rich text, but its hard to adopt same solution because we need to apply byte-wise facets to a markup language. Byte-wise facets is quite doable for markdown variants or djot, but I assume not all markup language/parser will allow this.

Proposal#

Introduce sh.tangled.markup lexicon.

sh.tangled.markup#markdown#

Represent title/body text of issue/pull/comment.[1]

Both lexicons has two fields:

  • text (raw text)
  • refMap (uri -> item map)

refMap will map any uri used in text to resolved identifier like did, at-uri or blob. For maximum extensibility, it would be better to make key (uri) to be extensible too.


Honestly I'm not satisfied with my own solution, but I think we do need some kind of dedicated lexicon to represent the markup content instead of using raw string type.

I'm open to more thoughts.

[1]: Title might use sh.tangled.markup#markdown_inline instead to be more specific

i am open to the idea of defining a rich markdown facet-y lexicon for our use case. it is quite an undertaking to represent a markdown AST as a lexicon and the usefulness is questionable, given that other implementors need to be able to lower markdown AST into the lexicon AST. but we can be sure that issues/comments render identically on all tangled appviews.

one reason to prefer raw string markdown might be: other rendered content such as README files are plain text, any alternate appview would need to understand how to render plaintext markdown anyway.

having only an ast instead of a raw string makes it difficult to fix parsing errors or retroactively choose to support a new syntax on existing issues/etc, which seems suboptimal (e.g. suppose u wanted to begin auto-linkifying issue references, and have that work for all existing issues that use a supported syntax)

one other thing to consider is post-processing -- instead of using facets to, say, convert an @-reference to a did, have it be converted on submission to a did-link -- e.g. writing @directxman12.dev might get autolinkified as [did:plc:xyz] on saving, and the renderers are expected to resolve that reference link back to @directxman12.dev or whatever my current handle is at the time. this accomplishes much the same thing as facets here (u already have a markup language, so u don't get the bsky "no need for a markup parser" benefit), but means that implementers don't need to xref markdown ast with byte offsets as they render (possible, but annoying in many markdown libraries), and instead embed all the needed data inline.

for links, at least, this also doesn't require any markdown extensions, just reference link hooks, which is a feature already common in a number of parsers (e.g. pulldown-cmark calls this the "broken link callback")

generally , i do think being able to have different markup languages be supported would be nice, but i think instead of trying to do a 1-size-fits-all solution for things like x-refs, it's probably worth leaning on the language itself when u can. e.g. markdown has reference links, html has custom attributes, org-mode has link abbreviations, etc.

@oppi.li to clarify, I'm not suggesting to use AST here. I know that is well, impossible. afaik the only way to fully support markdown spec is to not use AST and directly convert to HTML while parsing (I suppose goldmark package is not fully CommonMark/GFM compatible.)

What I'm suggesting here is:

  • use raw string to store markdown content, but with a wrapper type to specify which markup language this raw string is using (to support other formats like org-mode or djot in future)
  • embed the resolved uris to make markup link consistent. (mostly for user mentions)

Or we can have resolvedDids map. Issue/pull references aren't necessary once we atprotate the issue/pull ids. So only thing we need to embed is the user did. the sh.tangled.markup#markdown object can hold handle -> did map which can be used to resolve mentioned user from their old handle. When appview parse the mentioned users from markdown content, it can prioritize this pre-resolved DIDs

i don't think we would ever support authoring content in org mode or djot, and probably best to constrain the input format.

i do like the "format on save" approach by @directxman12.dev , we can then expect appviews to resolve DIDs upon render.

i don't think we would ever support authoring content in org mode or djot, and probably best to constrain the input format.

This is the saddest thing I've heard among Tangled's future plans.

If so, markdown is not a good choice, please reconsider unless we are going to maintain our own spec. It's not about personal taste, but because we can't really define the markdown spec. See how GitHub and GitLab had to maintain their own markdown-variant to define a strict format. It's not just because they are extending the syntax with @ mentions and # references, but because markdown is incomplete spec with undefined behavior still exists at this date. Markdown is really slippery spec and it is extremely hard to embed other syntax as markdown. For example, we cannot format inapplicable task in task list unless we are going to write that as raw HTML. Markdown is not designed for that kind of use.

If we go only with markdown, I'd argue that it becomes even more necessary to manage this as a discrete type that can extend. I'm pretty sure we will change the way how we parse and render the markdown several times in future. So even tangled won't support specs other than blessed markdown (whatever that is), imo having a lexicon to maintain the version would be valuable.

Is wrapping raw string+metadata as an object too much abstraction?

{
  "$type": "sh.tangled.comment",
  "body": {
    "$type": "sh.tangled.markup#markdown",
    "text": "hello @alice.com",
    "resolvedIds": [
      {"handle": "alice.com", "did": "did:plc:alice"}
    ]
  }
}

I'm not sure why both of you mention about facets. I haven't said we should support facets. I said we should make richtext extensible for future updates can be new markdown syntax like I did with mentions and link based references, or completely new markup language support.

Forgive me yapping here, as someone who has gone deep dive into the markup language rabbit hole, that decision quite hurts. I absolutely agree that we should store markdown in raw format, we should not focus on supporting other formats right now and using facet is not a good idea here. I'm just suggesting to at least wrap current body, mentions and references fields to more manageable type.

This is the saddest thing I've heard among Tangled's future plans.

don't really think this is a controversial decision. all platforms have minor extensions to markdown; and sure we can write a more detailed spec about how we render markdown, but it would be heavily inspired by gitea, gitlab and github. note that i said authoring content, and not rendering readmes and such, we could support more formats there. the reason for this is so that other clients will not have to implement a slew of formats to render UGC well.

Is wrapping raw string+metadata as an object too much abstraction?

this is fine, but converting to links at write time would achieve the same thing without a bespoke lexicon! and it is baseline markdown syntax well.

don't really think this is a controversial decision. ...

The reason I'd be sad is Tangled not being extensible here while using one of the most extensive protocol. Imagine we don't try to support VCS other than git. While git is still mainstream and tangled will be based on it for pretty long, I wish it leave some room for future extensibility. See how standard.site just didn't specified about the content type (though their use-case is quite different from tangled comment and more closer to README rendering)

I absolutely agree that markdown should be standard here! wait, then why opinionated to partially-patch-based workflow I just wish Tangled to be more open for improvements as I expect it will last pretty long. It is more of a "we should not limit the commit sha to sha1" proposal than "we should support sha256 commit hashes today".

this is fine, but converting to links at write time would achieve the same thing without a bespoke lexicon! and it is baseline markdown syntax well.

Ok I do agree with that. As long as we give some semantic meaning to raw text with wrapper object, I'm fine. I don't care much about the details on how we would embed the links, I'm just trying to group the exposed body, mentions and references fields into one semantic type. Actually, if I add up on post-processing idea, we can include the original text before the post processing so we can restore the original text on edit!

{
  "$type": "sh.tangled.comment",
  "body": {
    "$type": "sh.tangled.markup#markdown",
    "original": "hello @alice.com",
    "text": "hello [did:plc:alice]"
  }
}
sign up or login to add to the discussion
Labels

None yet.

area

None yet.

assignee

None yet.

Participants 3
AT URI
at://did:plc:xasnlahkri4ewmbuzly2rlc5/sh.tangled.repo.issue/3mctoic4vhe22