A URI is a structured sequence of characters that identifies a resource (such as a file) on the Internet. URI lists generated by web server logs and hypertext transfer protocol (HTTP) form submissions are a common input for text analysis functions.
URI syntax is defined by the Internet Engineering Task Force (IETF). The following table describes the key components of a hierarchical URI. The examples in the table are from this URI: https://www.google.com/webhp?p1=chrome&p2=hello%20world&p3=UTF-8#fragment1
Component | Example |
---|---|
scheme | https |
host | www.google.com |
path | /webhp |
query | ?p1=chrome&p2=hello%20world&p3=UTF-8 A query starts with a question mark (?). An ampersand (&) precedes each query parameter. Here, the query parameters are p1, p2, and p3. Their values are chrome, hello%20world, and UTF-8, respectively. %20 represents a space character. |
fragment | #fragment1 |
A URI can contain the US-ASCII characters for the lowercase and uppercase letters of the English alphabet and the Arabic numerals. Any character outside this character set is percent-encoded; that is, converted to a sequence of the form %hh, where h is a hexadecimal digit. In a query, the space character is encoded as %20. For example, "San José" is encoded as "San%20Jos%C3%A9". Outside a query, the space character is encoded as the plus character (+). For example, "San José" is encoded as "San+Jos%C3%A9".